How to Jumpstart Your Collaboration with the Data Team
To make a longstanding impact, security leaders don't need to look further than the company's existing data analysts
The term “DataSecOps” isn’t catchy. Nevertheless, incorporating data analytics best practices into the security operation can dramatically improve visibility and threat detection. Not to mention the cost savings of using general-purpose data lake technology alongside proprietary security logging platforms.
While security organizations can build up data expertise over time, the best way to adopt data analytics best practices is through collaboration with the data team. Here’s proven advice on how to get started.
Are Data Dynamos Security Savvy?
Unfortunately, traditional SIEM solutions, with their proprietary backends and languages, have driven a wedge between cyber and data teams. The two organizations have lacked common ground for years.
The sophistication of recent ransomware, credential stuffing, and cryptomining attacks makes the divide even wider. For example, hacker tools like OpenBullet come packed with features and plugins designed to bypass defensive measures. Understanding present-day attacks to the point of effectively detecting and mitigating them takes years of domain expertise. We can’t expect all data practitioners to also be security experts.
This makes it critical to take the right approach to collaboration between cyber and data organizations. Your company’s Director of Data Analytics might not yet be familiar with the MITRE ATT&CK framework or the persistence methods of Linux cryptomining malware, but they can still start playing an important role in defending the business. This lines up with how most centralized data teams support other departments.
When reaching out to the central data team, you can give examples of threat detection challenges that can be seen as typical data problems. Framing use cases in familiar terms early in the conversation will establish confidence that the right people are in the room and no one is wasting their time.
Communication with known bad actors: Identifying communication requires point lookups in network traffic data against known malicious IPs and domains.
A series of actions associated with attack techniques: Piecing together event sequences using event correlation methods to detect known attack patterns.
Large amounts of data copied out of the organization: Calculating data volumes and checking for high or unusual results.
Failed login attempts recurring across accounts: Analyzing login logs for repetitive failed attempts using pattern recognition techniques.
Unexpected changes to critical systems and environments: Applying contextual datasets to log events from operating systems and cloud providers.
You can also frame the new collaboration using the “Three Vs” that are familiar to any experienced data practitioner. The organization’s threat detection challenge involves volume, with terabytes of security data generated across the environments. High SIEM ingest and storage costs often limit data collection and retention.
It also involves velocity in that threats often take too long to identify, scope, and mitigate. Since many event log sources are never collected to the SIEM, activity records often must be reactively loaded and analyzed days or weeks after the initial breach event. Proactively analyzing such sources can help the security team keep up with attacks as they unfold.
Finally, the data/cyber partnership must deal with the variety of security data. Activity logs and contextual datasets come in diverse formats and schemas, which presents a challenge when integrating, analyzing, and correlating the feeds.
Enabling the new collaboration
You need to put a few things in place to bring together the cybersecurity and data teams. The first is to establish a common data platform. This is the Achilles heel of locked-in SIEM platforms like Palo Alto Network’s XSIAM—they keep security and data orgs apart. In contrast, general-purpose analytics platforms like Snowflake support the enterprise’s typical data analytics use cases and cybersecurity. This provides the necessary foundation for effective collaboration.
Once you and your new friends from the data team agree on a joint data platform, you’ll need to agree on a data ingest strategy. Depending on the use cases you prioritize, this might involve native connectors within the data platform, managed connectors within the security platform, or a dedicated solution for collecting security data. Analytics teams will be familiar with the term ETL, which stands for Extract, Transform, Load, with leaders like Fivetran and Matillion in the data space. Security-specific equivalents have emerged in the form of observability pipelines and security data fabrics. As described in a previous post, it’s an interesting space that’s seeing rapid development and likely has a solution to meet your needs with different approaches.
Another prerequisite that the data/cyber partnership must address is the common language for collaboration. While many security products have their own proprietary search syntax, data analysts have converged on the Standard Query Language (SQL). Security teams may initially be daunted by the need to express detection logic in SQL, but recent improvements have lowered the adoption curve.
For example, Anvilogic provides a detection content library with thousands of SQL-based rules that can be deployed off the shelf or used as a starting point for custom detections. Correlated detection scenarios can be created in a low-code builder that converts logic from the UI to SQL code behind the scenes. Recent advances in Gen AI are directly relevant to the SQL skills gap, with security practitioners able to express their threat detection logic in English, which the AI copilot converts to code.
Taken together, a common data platform, collection strategy, and query language enable a fruitful collaboration between security operations and data analytics teams. Good options for each of these areas are proven and readily available. Once you agree on these, you’ll be ready to improve security outcomes together.
A Roadmap for Getting Started
A data-driven security operation is not a solution you buy. While various products can accelerate your initiative and increase its likelihood of success, the following areas should be built out in collaboration with the data team. Taken together, these form a roadmap for taking ownership of your data and aligning with how the rest of the enterprise turns data into insights.
1. Data Preparation
Collecting logs from your systems and environments involves bringing streaming them to your data platform, but making them useful takes preparation. In the separate, siloed world of cybersecurity, this was called “parsing”. Data analytics best practices establish that more work upfront is a worthwhile investment.
According to an article by data management provider Talend on the topic, “Data preparation is the process of cleaning and transforming raw data prior to processing and analysis. It is an important step prior to processing and often involves reformatting data, making corrections to data, and combining datasets to enrich data.”
Work with your data team to identify the necessary data preparation steps that will support reliable, high-fidelity threat detection. This includes having standard time zones, extracting commonly used fields from nested semi-structured data, and creating “views” that join together related datasets for easy querying. For example, you might have one table for Windows logs and another for Linux logs—a view that joins the two could be used for detections that are operating system agnostic.
2. Health Monitoring
Security operations teams often struggle with health monitoring. I’ve seen third-party audit findings ding a SOC for missing sensors, dead feeds, and out-of-date field names that cause detection misses. Even worse is when these issues are uncovered in a security incident retrospective.
While some SOCs have responded by assigning dedicated headcount to ensure that expected data is available and valid, the partnership between cyber and analytics can unlock powerful automation for this area. Data teams have mature tools and techniques for monitoring data health.
For example, the open source tool Elementary is used by thousands of data engineers to bake health checks directly into their data platform. This gives them an early warning on issues affecting freshness, volume, or schema changes. A data team using Elementary for their enterprise data health monitoring could easily extend their tests to cover the security data lake.
3. Reports and Dashboards
Building reports and dashboards is one of the most common ways in which data teams support the business. In collaboration with the security operations team, business intelligence (BI) professionals can deliver powerful and actionable insights. For example, calculating the SOC’s workload in terms of daily alert volume and how well the team is keeping up can influence hiring and training plans. Breaking down top contributors to alert volume can focus tuning efforts on noisy detection rules. And making detection coverage metrics available to leadership can raise confidence and give executive stakeholders a sense of the progress that the SOC is making in staying ahead of the threats facing the organization.
With the security team using the same data platform as the rest of the enterprise, there will surely be a well-maintained and broadly accessible BI tool where the new SOC reports can be hosted. I recommend regularly meeting with your data analysts to review how reports are being used and refine the dashboards to reflect the KPIs that drive real action across detection engineering, hiring, and architecture.
4. Built-in Functions
Another advantage of giving the cyber team access to the enterprise’s central data platform is the availability of powerful built-in functions. These functions deliver a range of capabilities relevant to threat detection, threat hunting, health monitoring, and incident response. What’s really cool is that they are delivered “as a service” without the need for developer cycles.
For example, Snowflake’s built-in Haversine function takes the geolocation of two points and returns the distance between them. Advanced functions for time window operations, data generation, and semantic classification can play a role in the detection of beaconing malware, synthetically generated domains, and other TTPs.
5. Data Science
We’ll never know how much damage resulted from security operations being late to the AI/ML revolution. It’s undeniable that the lack of data science access faced by most cybersecurity organizations has hampered their ability to succeed. Security data lakes and the partnership between cyber and data teams present an opportunity to catch up.
After lining up with the data org on an analytics platform, the security operation can start exploring which of its initiatives could benefit from data science. From insider threats, to credential stuffing and “low and slow” detection evasion, data science can help with use cases that are top of mind for many security organizations. See my previous post on behavior analytics for ways to get started.
As you can see, partnering with the data team can tremendously benefit the security operation. To get started, you don’t need to hire world-class data experts for the security team. Your company already has great data resources on staff. They’re just helping other teams.
The foundational requirements and roadmap outlined above can help you and your peers on the data side of the house know what to expect from a partnership. You’ll be surprised how much support cyber/data collaboration generates among senior leadership. Data practitioners are usually happy to help defend the enterprise against threat actors. This is a great way to do more with what your organization already has—a top CISO priority in 2024.