Cybersecurity attracts people with a particular kind of creativity. We’re hackers in the original sense of the word, resourceful at getting around technical obstacles. But the way in which many creative security engineers have gotten around the limitations of their SIEM has turned into the worst kind of vulnerability: a false sense of security.
How the SOC Creates Dark Data
Gartner defines dark data as “information assets organizations collect… but generally fail to use for other purposes.” The concept usually describes the risk of keeping things like customer call transcripts that could leak in a data breach. But the security organization has a unique risk in storing data that doesn’t get used. In the case of the SOC, dark data is any log that gets collected but doesn’t help detect threats.
It’s easy to understand why dark security data is so prevalent. The SIEM solutions where detection engineering takes place make it too expensive to justify ingest for some sources. But compliance policies still require those sources to be collected. So the security team turns to a storage service like AWS S3 and uses it as a dumping ground. Over time, reliance on S3 buckets grows to a point where S3 becomes the organization’s largest log repository. In my work with cloud-centric security teams, I’ve found that dark data is usually more than twice the volume of data analyzed in the SIEM.
The broad shift from using SIEM vendor data pipelines (e.g. Splunk Heavy Forwarder) to an independent data pipeline (e.g. Cribl Stream) has accelerated this trend. There’s also more data being generated directly in cloud storage, like AWS CloudTrail. While the motivation for keeping logs in buckets is well understood, mature security teams will want to understand and consider the side effects.
The Impact on Security Posture
The problem with keeping your security data in S3 is that it doesn’t help you detect threats. While you’ve addressed the storage side of the problem, the result is still a reactive security posture. The indicators may be there, but you’ll only find them after the threat actor has worked their way down the kill chain. That’s far from ideal.
There are other issues as well with the dark security data dumped in cloud buckets. Data pipelines are brittle creatures, and over time there are sure to be broken feeds and missing sources. Spotting and fixing these data outages is itself an analytics challenge that requires more than a bucket. Access control and visibility are also important aspects to consider.
In practice, you’re probably already doing this hack in pockets of the SOC, if not as an official policy. Organizations that rely on storage buckets for significant security datasets should review how well they are able to use that consistently and reliably, if at all, for identifying threats in the environment.
Can’t I Just Query Inside These Buckets?
I’ve worked with teams that built their TDIR strategy around S3. For one large B2B SaaS vendor, that meant using the AWS Athena service ad hoc for incident response. When faced with a suspected breach, the security analysts found that searching across dozens of terabytes meant nerve-racking wait times for each search. The investigation ultimately concluded that there was no breach. While this was a relief for the CISO, the finance team was less pleased with the bill. The Athena query costs were charged by “bytes scanned” and the price tag for the investigation came as a shock.
Although performance and cost-effectiveness can be improved through careful database management (ever run a VACUUM command?), that’s something that most security teams are not doing today and would rather not take on. At another security team I worked with, the expectation that a hundred terabytes of CloudTrail could be searched in place was dashed after a few days of struggle and the team reached out for help in loading the data to Snowflake. Pluralsight has a solid article on Athena and its limitations, and some customers have written about moving from S3/Athena to Snowflake. That’s one reason why Anvilogic, which I recently joined, chose Snowflake as the recommended security data lake platform to complement Splunk.
A False Hope: Log Rehydration
Rehydration, restoration and replay are different ways to call the process of bringing archived data into analytics platform when needed. This looks good on paper, but in practice it’s a recipe for disaster.
Consider a security org that collects 10 TB/d and keeps it for one month in Microsoft Sentinel. For the rest of the year, data is archived and the team plans to restore it when needed. Then the SolarWinds event happens and the CISO wants to know if the company was affected. It’s only at this point that the team reads the fine print on the rehydration feature:
Restore data for a minimum of two days.
Restore data more than 14 days old.
Restore up to 60 TB.
Restore is limited to one active restore per table.
Restore up to four archived tables per workspace per week.
Limited to two concurrent restore jobs per workspace.
It turns out that this team can restore less than one week of logs at a time, and may need to shard their restores across multiple workspaces in order to stay within the limits. There is also a cost of $100 per terabyte restored. So for a ten month investigation, 3,000 TB may need to be restored and analyzed at a cost of $300,000 with no ability to analyze across the slices of restored data. And when additional campaign IOCs and TTPs are released a week later, the exercise would need to be repeated.
This isn’t to pick on Sentinel and its multi-tier storage architecture—the same challenges affect every rehydration strategy. If your analytics platform could hold the data you need, you wouldn’t be relying on rehydration in the first place. The bottom line is that security data that’s been dumped in storage is much less useful than it seems.
Cleaning Up Your Dark Data
Does all of this contradict last week’s post on the benefits of using a security data lake? Using cloud object storage as part of a security data lake (whether directly in a service like S3 or indirectly in a data platform like Snowflake) can be a great way to go, with proven benefits. The trouble begins when you do security architecture without first putting on your data engineer hat. There needs to be a plan for using that data consistently and reliably.
In other words, your approach to storing security data should be an integral part of your broader detection and response strategy. So wherever you house your security data should account for the following basic criteria:
The stuff you’re expecting to get is arriving.
The stuff that’s arriving is helping to detect threats.
The threat detections are being combined to reduce alert noise.
The stuff you’ve collected and the alerts you’ve generated are useful for investigations.
If you’re not sure that this is the case for all of your security data, create a list of your security data feeds and review what’s going where. Run a drill in which the indicator of compromise is in a source that’s going elsewhere than your SIEM. Did you receive an alert? Ask an analyst to build a timeline of a certain user’s activity in that source system from six months ago. Or go full chaos monkey and pause a couple of log shippers to see if the outage gets flagged.
The key is to create a culture in your organization with a healthy degree of skepticism. Data that was collected might not be supporting your detections. When we say we have those logs, what do we really mean? A slightly cynical approach to dark data will appeal to the hacker instincts of your team members.