Anomalo, a provider of data validation and documentation services, today announced integration with data lakehouse platform Databricks. Under this engagement, enterprises can connect their Databricks instance with Anomalo and start monitoring the quality of the data stored in the lakehouse in real-time.
When it comes to building an AI or analytics project, it becomes more important than ever to ensure data quality. However, as the complexity and volume of the data increase, dependencies are introduced into the code and third-party data sources are added. Naturally, this can cause the quality level to decrease — or even become broken, stale or corrupt — affecting downstream BI and analytics tools or modeling and machine learning frameworks. Organizations typically tackle this challenge by writing data validation rules or setting limits and thresholds — which takes up a lot of time and resources.
Anomalo automates the process
Founded by former Instacart executives Elliot Shmukler and Jeremy Stanley, Anomalo addresses this by monitoring enterprise data and automatically detecting data issues.
The solution leverages machine learning to validate ever-evolving datasets in almost real-time and flag unusual changes. When it detects an issue, a rich set of visualizations to contextualize and explain the problem, as well as an instant root-cause analysis that points to the likely source of the issue is surfaced. This enables developers to resolve the hiccups with their data, down to individual rows and columns, before it is used to power analytics or AI models and save precious time.
If required, enterprise users can also set up custom validations and add no-code rules to track key metrics for datasets they care about.
How to connect with Databricks?
Anomalo can be used as a cloud-based software-as-a-service (SaaS) platform or as a remotely managed VPC deployment. The solution even offers web and Slack notifications with data analyses, visualizations, and statistical summaries.
To get started, enterprise users simply need to log into Anomalo, connect their Databricks instance as a new data source by entering the server hostname and HTTP path, and select the tables they want to monitor for data quality issues. Once this is done, the solution will automatically monitor the data freshness, volumes, missing values and anomalies.
The entire process takes less than five minutes and requires no major configuration or coding efforts. Plus, along with issue alerts, users also get to use Anomalo’s Pulse dashboard which provides an overview of data health, complete with insights into data coverage, arrival times, trends, repeat offenders among other things.
Snowflake already connects to Anomalo
While Databricks support in Anomalo will play a major role in helping lakehouse users monitor the quality of their data, it is worth noting that this is not the first integration for the data quality platform. The solution also connects to other data repositories, including Snowflake’s data cloud. It counts itself as a Snowflake Select Partner.