Not Your Father’s Security Data Lake

Security Analytics “Big Data” Problem

There are two significant problems with log analytics and security analytics solutions today: A storage cost problem and a data silo problem. Both are relating to how data is stored.

Data Storage Cost

Since the first recorded reference to “Big Data” by Michael Cox and David Ellsworth with the publication of “Application-controlled demand paging for out-of-core visualization in 1997, dealing with rapidly growing data volumes in an organization has been a never-ending challenge.

Security data is no exception and analysts have always had the need to store large amounts of log data and the migration to cloud infrastructure as well as an increased need to collect a vast amount of data from endpoints and user activity created by the growth of a remote workforce has made it a challenge for legacy monitoring and security solutions keeping up. 

Current monitoring and security solutions have evolved over the years from simple log management and SIEM solutions with rule-based alerting and simple anomaly detection on a manageable volume of log data to advanced cloud analytics, threat hunting, and AI on logs and metrics from a vast cloud infrastructure and a several SaaS applications. This has led to a significant increase in data volume ingested by legacy log analytics applications. This increase in data volume has put significant budget pressure on teams who must often resort to reduce the data retention time, compromising the ability to interrogate historical data.

Data Silos

The number of security applications implemented in a SOC has increased as well. In addition to log management and SIEM, most organizations today have implemented solutions for UEBA, NTA, EDR, CASB, advanced threat detection, advanced analytics, and others. These solutions are addressing specific needs and are often characterized as “point solutions,” storing data in separate data silos which prevents data in different solutions to be correlated during analysis. As analysts realized that data in one solution would be valuable to analytics in another, SOAR products evolved to facilitate access to information across data silos. For example, SOAR gives SOC teams the ability to create playbooks that automate the collection of threat-related data from several sources used to automate responses to low-level threats. However, this is merely a “band-aid” on the problem which does not solve the problem fully.

So, you have multiple separate data silos, each with limited data retention, and you face steep increases in licensing and operational cost from the increase in data to be analyzed. These problems stem largely from the storage technology used in legacy solutions.

Storage Technology Evolution

Data Storage technology has evolved over the past 30 years from basic filesystem storage to relational databases, data warehouses, index storage, Hadoop stack, columnar databases, and most recently, cloud data warehouses. Over the past few years, data lakes entered the market to consolidate multiple data sources into one repository. Hadoop was a major driver as organizations embraced big data by creating data lakes for various disciplines such as marketing, sales, R&D, etc. These data lakes were used primarily by data scientists who would load a huge amount of geological, demographical, and meteorological data to aid in oil exploration, market segmentation, and weather forecasting, respectively, for example.

full-block-image

There were also attempts to create Hadoop-based security data lakes by some forward-thinking enterprise security teams who recognized the storage and data silo problem, although these efforts were rarely successful for the following reasons:

  • Significant complexity involved managing Hadoop and the required data operations skills
  • High operational overhead managing Hadoop clusters
  • Expensive hardware drove up cost
  • Lack of a pure SQL interface for the store
  • Proliferation of stores with different interfaces for Hive, HBase, Spark, etc.

What Is a Security Data Lake?

Let’s take a step back and look at what a security data lake is. In general, data lakes are different from data warehouses as they aggregate all data “as-is” independent of format or source, structured and semi-structured, into a single repository. In the case of a security data lake, the data can come from any device, entity, user, or network in the cloud or on-premises. A security data lake aggregates all enterprise log and event data and, by storing all security data in one repository, analysts now have easy access to all data across all sources and across all applications. With schema-on-read, there is no requirement to understand how data is related when it is ingested but rather relies on the end-users to define those relationships as they consume it. This makes a data lake very efficient in processing huge volumes of data cost effectively. 

In the past, one disadvantage of a data lake was that it was usually not good for quick and easy analytical processing. If you needed to return queries very fast, as in a few seconds or less, the data lake didn’t give you that performance. Additionally, it could be difficult for less technical people to explore and ask questions of the data with the lack of simple query language support. This has now been addressed, and solved, with the emergence of the modern cloud data lake, like Snowflake, and applications like Elysium Analytics, running on the cloud data lake.

A key difference between a legacy data lake, like Hadoop, and a cloud data lake, like Snowflake, is that a cloud data lake utilizes low-cost cloud storage to provide unlimited scale both for storage and access whereas a Hadoop data lake has physical boundaries on how much it can store or expand. The modern cloud data lake provides not only truly unlimited low-cost storage, but also unlimited on-demand compute which, in the case of Snowflake, is billed by the second. Snowflake has also solved the search performance-problem with their Search Optimization Service” bringing response times down by an order of magnitude compared to the earlier data lake solutions.

With the introduction of Snowflake, security data lakes can be implemented with the following benefits:

  • Easy collection of data from all sources
  • Process for cleaning and enriching data in the pipeline to the store
  • Access to the data using standard interfaces – SQL, REST API
  • Metadata Catalog of the log data
  • Low operational overhead

 This is why the team at Elysium Analytics ported our Hadoop-based security analytics solution to Snowflake.

While initially building our solution on Hadoop, we experienced first-hand the operational challenges customers faced in the effort of scaling a complex, on-premises architecture cost effectively. In evaluating options for moving our solution from an on-premises architecture to a cloud-scale architecture, we had to satisfy three high level requirements:

Cloud-Scale Compute — Access to “unlimited” processing power for deep machine learning and queries on-demand given the nature of the “ebb and flow” of a security operations center where massive compute capacity is required intermittently when there is a security event that needs to be investigated. Any on-premises or traditional PaaS architecture would not provide this elasticity.

Open Data Model — Security operations and threat detection is all about the data and you must be able to store both structured or semi-structured formats in a raw source data format for data analysis and machine learning. Structured data needs to be enriched and parsed to accommodate a wide variety of reporting and analytical needs but not normalized for flexible query capabilities. This also allows you to query multiple data stores and data types at once. Finally, the open data model allows analysts to change data schemas and allows for differing views into the underlying data, without the difficult process of re-indexing and re-loading the data.

Predictive Analysis — Reducing the reliance onrules and queries to detect threat behaviors increase the detection capabilities for outcomes that are challenging to detect via rules.

The Modern Data Lake – Offered as SaaS

At Elysium Analytics, we have developed a Snowflake-native solution that meets all the above criteria and provides search, analytics, and graph on one data store with a full API interface and zero operations. With Snowflake as our data platform, we have a solution that provides:

  • Simple collection of data from all sources with a cloud based “connect and collect app.”
  • Processing, cleaning, and enrichment of the data in the data store pipeline with a cloud-scale data pipeline running on snowflake.
  • Access to the data with full-text search, standard SQL and REST API, KQL interfaces.
  • A Catalog of log metadata data with our open data model.
  • zero operations overhead platform. 

With Elysium Analytics, you can integrate with any security device log source “out of the box” and apply a data model. You can enable all log data in the data lake for full text search and profile for all interesting data points with “out of the box” behavioral analytics and explainable machine learning.

On the front-end of the solution, we have pre-built analytics dashboards for most common sources and provide the ability to build your own dashboards with a few clicks. For visualization we are leveraging and bundling Kibana, Looker, and Jupyter Notebook. All these visualization tools are natively integrated.

full-block-image

Though it may seem obvious now that leveraging data lake technology in security analytics is the right way to go, the right technology is needed to significantly improve its viability in a production environment. With the availability of the highly efficient Snowflake platform, modern security solutions are finally able to break through barriers legacy solutions are limited by. We believe this is the time for security teams to embrace security data lakes.

Go to Elysiumanalytics.ai and sign up for a free trial for simple full text search and visualization of your data on Snowflake.