Behavior Analytics in Your Security Data Lake Just Got Way Easier
Snowflake's new machine learning functions bring ML to SOCs without data scientists
I’m no data scientist, but I know machine learning can be a SOC’s best friend. In a previous role, my security engineering team had to convince our management (and ourselves) that we would get alerted if the Capital One hacker tried the same attack with us. Machine learning for behavior analytics gave us a way to detect a compromised AWS account trying to steal documents from our cloud. And now, prebuilt machine learning functions in Snowflake cut the effort involved from weeks to hours. And you don’t need to be a data scientist to use them.
Could Capital One Have Detected the 2019 Breach?
The hacker “erratic” behind one of the largest breaches of all time was not stealthy. She bragged about copying over 30 GB of documents out of 700 cloud storage buckets. Despite all that activity, the breach stayed undetected from March to July. When we read the breach reports, my team knew we needed detection time that would not be measured in months.
Could behavior analytics help detect this kind of cloud breach? We considered how a compromised account would act differently from how it typically behaved. This is where ML can play a role by learning what is normal for users, like how many documents they download in an hour. Then, a detection rule could compare recent activity to this baseline and flag suspicious behavior. Since “erratic” copied out tens of thousands of documents, she would likely have tripped this alarm early in the exfiltration phase.
Baselining Your Cloud with ML Functions
My team spent a long time building the machine learning code for our UEBA baselining project. We needed a data scientist with skills in R and Jupyter to plug into the data lake and move data back and forth. Luckily for security operations teams everywhere, Snowflake now provides fully automated built-in machine learning functions as a shortcut to UEBA.
Snowflake is a cloud data platform with cheap storage and a compute engine that scales up and down to run fast detections and investigations. That’s made it a popular option for security teams looking to augment their SIEM. Like everything in Snowflake, the new machine learning functions are delivered as a fully managed service. Snowflake’s Anomaly Detection function is described as follows:
Anomaly detection allows you to detect outliers in your time series data by using a machine learning algorithm. You use CREATE SNOWFLAKE.ML.ANOMALY_DETECTION to create and train the detection model, and then use the <name>!DETECT_ANOMALIES method to detect anomalies.
Helpful examples are given in the official docs here and in blog posts like No Data Science Team? No Problem which aren’t just a good starting point- they’re evidence that data science is getting democratized. This is all great news for the SOC.
You'll need relevant log data to train the anomaly detection service in your Snowflake account. AWS can generate CloudTrail logs for S3 file downloads, but these “GetObject” events get so chatty in production that many security teams haven’t been able to collect them. This is where the cheap and limitless storage of the data lake comes in. As I showed in a previous post, Snowflake is so cost-effective for security data that you can afford to bring in datasets that would otherwise never get collected- and now you can use that data for behavior analytics.
The result of baselining file downloads for your AWS users is a trained model representing what is normal for each one. You can also extend the ML function to baseline users together with additional fields, such as the user’s department. In this way, your model learns not just from activities but also from tags that represent the user’s context.
No installation, maintenance, or tuning is required to run the model training function. The same goes for the function that compares new data to the existing model. It’s just a simple command of DETECT_ANOMALIES to get back a result like this for each user:
This example shows user123 downloaded 150 files from the organization’s S3 buckets in the last hour. The ML model had forecast 50 downloads for this user and set 70 as an upper bound for what should be considered an anomaly. The bounds of the “normal” range are determined automatically by the model during training and may be recalculated when you retrain the model. Since user123 is way out of the normal range, the function returns TRUE. This is Snowflake saying that the user is acting very fishy.
From Anomaly to Threat Detection
To complete our UEBA project, my team needed a way to pull our homegrown anomaly detection model into our detection engineering process. At the time, this involved our wonky but lovable scripts and container clusters galore. This is still the way many enterprise SOCs use data lakes today. But Snowflake and its cybersecurity ecosystem have made progress that changes the game.
One option mentioned in the documentation linked above is the new Snowflake Alerts service. You can use it as a fully managed way to run the DETECT_ANOMALIES command on a schedule and receive an email containing the results. This has the advantage of not requiring an external solution but doesn’t provide a framework for developing and managing detections.
Another option is to use a SIEM solution that can serve as a security wrapper for Snowflake. I joined Anvilogic because security teams at big companies were choosing it as their way to turn Snowflake into a security data lake. Now, with its anomaly detection functions, Snowflake gives security teams more than just cost savings and performance at scale. Any threat scenario you create in Anvilogic can call the DETECT_ANOMALIES function as part of its detection logic. The function’s output can trigger an alert or serve as one indicator in a sequence of events for the SOC. Anomaly detection built into the data lake gives security teams without data science expertise an easy shortcut to ML-powered threat detection.