The Security Data Fabric Identity Crisis

Why success with decoupled data pipelines takes overcoming a people problem

Apr 04, 2024

Data Collection Wants to Be Free

Big changes are happening to how security teams get their data. For years, data collection was a function of the SIEM. Splunk’s app store, for example, includes hundreds of supported connectors that integrate with everything from firewalls to vulnerability scanners. In parallel, large SOCs formed teams to manage open-source pipelines for shaping and routing data using technologies like Apache Kafka and NiFi.

Security organizations have now reached a tipping point for their data. On one side are the three V’s of data explosion: volume, velocity, and variety. Security teams are dealing with an avalanche of logs from endpoint agents, multi-cloud hybrid infrastructure, distributed workforces, and SaaS applications. Information must be collected from within the environment and from outside via APIs. It’s also more valuable than ever, with advances in AI enabling new insights on risks and threats. Add “value” as a fourth V driving the need for data pipeline investment.

On the other side, alternative destinations have opened up. Cheap cloud storage, subsidized cloud provider SIEMs, and powerful cloud data platforms could be better options than the monolithic SIEM for some datasets. The resulting dynamic has been described as The Great Splunkbundling, where previously integrated components are freed from the SIEM— starting with the data pipeline.

Cribl has dominated this space with its Cribl Stream product. More powerful and cost-effective than Splunk’s own solution, Cribl Stream gained widespread adoption by maintaining compatibility while opening up a range of sources and destinations. Investors value Cribl at over $2 billion because they know that security teams (as well as IT, DevOps, and other log consumers) value optionality and specialization.

Building the Future of Data for IT and Security - Cribl — Record-breaking demand for Cribl’s independent data management solution (Source)

As security operations embrace commercial off-the-shelf data collection solutions, more providers are entering the scene. Some, such as Databahn, tout their cybersecurity expertise and AI use cases. Others, such as Dassana, Monad, and DataBee emphasize data sources beyond logs- including vulnerability, identity, and asset datasets that can support compliance analytics and security metrics. These newcomers identify as security data fabric solutions.

Investors and industry heavyweights have poured hundreds of millions of dollars into this emerging category. Zscaler’s Avalor acquisition and SentinelOne’s investment in Auguria are examples from last month. Cole Grolmus captured the activity in a post that drew plenty of comments- not least from security data fabric companies that didn’t get included in his cool diagram:

Source

But how should security leaders evaluate newcomers like Tarsal (“One click to build your security data lake”)? And what is the identity crisis that threatens to crash the security data fabric party?

Demystifying Security Data Fabric

Gartner defines data fabric (generally, not just for security) as “an emerging data management design for attaining flexible, reusable and augmented data integration pipelines, services and semantics.” The top job for data fabric is making data available in the right places, in the right formats, and reliably. It sounds highly relevant for security operations! But where does it fit into the existing SOC stack?

As Pramod Gosavi points out, security data fabric is not a security data lake. Gosavi writes that it “ingests data from multiple feeds, then aggregates & and compresses, standardizes, enriches, correlates, and normalizes that data before transferring a full-time-series dataset to a security data lake.” In other words, a pipeline that connects the multitude of logs, findings, and contextual datasets to several destinations. With many enterprises juggling multiple SIEMs and data lakes, that “many to many” fabric must address substantial complexity.

For large enterprises, many sources are located within the organization— in the data center or private cloud. These require the kind of forwarder infrastructure that Cribl has perfected and most security data fabrics do not offer. Enterprise log sources such as network flows and operating system events are very high volume, so the data reduction capabilities of Cribl are another big advantage.

Where the new security data fabric solutions may have a role to play is in collecting data from behind APIs. SaaS security tools like Okta, Wiz, and Axonius have important pieces of the overall SOC puzzle. When moving away from bundled SIEM connectors, security leaders can turn to security data fabric solutions for collecting, cleaning, and enriching cloud-based sources. As in the Avalor example below, these tend to describe posture, assets, and identities.

The sources supported by different security data fabric solutions form one of the considerations for selection and implementation. The bigger challenge comes from what happens after the data is available.

Diagnosing the Identity Crisis

The original consumers of data fabrics and Extract-Transform-Load (ETL) products are data engineers and analysts. The trouble with applying these approaches to cybersecurity is that it’s an insular field with lots of specialized knowledge and not enough data skills. Fivetran can build a big business connecting Salesforce to Snowflake for customer relationship management. Its users have been using previous iterations of the same stack for decades. The cybersecurity equivalents of Fivetran face a customer base that’s often new to SQL, BI, and data science.

The result is an identity crisis where data collection products feel compelled to provide “last mile” analytics use cases. Dassana, for example, talks about “Revolutionizing security data ETL” on one page, while its solution includes an app for vulnerability management and another for security KPIs. These are great use cases to address, but they beg the question: can one startup solve the thorny issues of “many to many” data connectivity while also solving risk-based finding prioritization and scoring the SOC?

The answer may depend on the customer. Smaller security teams may prefer a security data fabric that includes connectors and prebuilt analytics. Larger security organizations, on the other hand, should strive to get as many integrations as they can off the shelf while developing their people’s data analytics capabilities.

Building and maintaining API connectors is a slog that is best to outsource— and no longer requires SIEM lock-in. But “one-click to build your security data lake” still requires you to consider how that data will be put to work for threat detection, risk management, and executive reporting.

Putting Security Data Fabric to Work

The value security data fabric delivers to an organization depends on the consumer side. Security leaders deploying a fabric or security ETL solution should thoroughly plan in advance for the people, processes, and tooling that will use the data.

One of the big differences between a security data fabric approach and traditional data tiering is that all the destinations should be usable. Data in the fabric is routed to where it can best be used rather than just stored for future rehydration. This means that the people involved in those use cases need broadly applicable analytics skills. A security data fabric architecture turns proprietary languages into a liability, as they would only be useful for one of the destinations in the fabric. SQL (the S stands for Standard) and Python are increasingly valuable for SOCs because they are supported across a broad range of security products and adjacent tooling, such as Jupyter Notebooks and PowerBI.

Consumer-side processes should also be considered early in the security data fabric initiative. With data from many sources being analyzed at many destinations, onboarding processes should ensure a consistent mapping to data models with standard schemas. Quality assurance, aging out old data, and iteratively reviewing which analytics platforms are used for each use case are all valuable processes to consider.

Finally, on the tooling side, security leaders should opt for analytics solutions that plug into the stack without breaking it. Since most security products were not designed to work with a data fabric, this is easier said than done. From threat detection platforms to compliance automation, analytics products have traditionally been built to connect directly to individual sources or the data pipeline. This is convenient for smaller organizations that haven’t yet rolled out extensive connectors. For larger SOCs, however, the “inline” approach conflicts with their existing data flows. Having to ship data out to a third-party analytics solution only to bring it back into various storage and analytics platforms impacts cost, risk, and flexibility.

Cybersecurity solutions are shifting from inline to downstream analytics

The “downstream” approach is preferable for enterprise security teams that have invested in their pipelines and want to avoid sensitive data leaving their perimeter. With downstream analytics, a vendor’s solution can plug into the customer’s data platforms for analytics. Not all products support this approach due to dependencies on streaming analytics and other design considerations. But products like CyberSaint for risk management and Anvilogic for SIEM that can work “downstream” from the customer’s data platforms are better aligned with security data fabric success.

Take these considerations into account when planning your security data fabric initiative. No single tool delivers successful security analytics across many sources connected to many destinations. But a combination of proven platforms and an exciting batch of well-funded innovators is set to unlock dramatic gains for data-driven security organizations.

Omer on Security

Discussion about this post