The uncomfortable truth about operational data pipelines

Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! Watch here.


The world is filled with situations where one size does not fit all – shoes, healthcare, the number of desired sprinkles on a fudge sundae, to name a few. You can add data pipelines to the list.

Traditionally, a data pipeline handles the connectivity to business applications, controls the requests and flow of data into new data environments, and then manages the steps needed to cleanse, organize and present a refined data product to consumers, inside or outside the business walls. These results have become indispensable in helping decision-makers drive their business forward.

Lessons from Big Data

Everyone is familiar with the Big Data success stories: How companies like Netflix build pipelines that manage more than a petabyte of data every day, or how Meta analyzes over 300 petabytes of clickstream data inside its analytics platforms. It’s easy to assume that we’ve already solved all the hard problems once we’ve reached this scale.

Unfortunately, it’s not that simple. Just ask anyone who works with pipelines for operational data – they will be the first to tell you that one size definitely does not fit all.

Event

MetaBeat 2022

MetaBeat will bring together thought leaders to give guidance on how metaverse technology will transform the way all industries communicate and do business on October 4 in San Francisco, CA.

Register Here

For operational data, which is the data that underpins the core parts of a business like financials, supply chain, and HR, organizations routinely fail to deliver value from analytics pipelines. That’s true even if they were designed in a way that resembles Big Data environments.

Why? Because they are trying to solve a fundamentally different data challenge with essentially the same approach, and it doesn’t work.

The issue isn’t the size of the data, but how complex it is.

Leading social or digital streaming platforms often store large datasets as a series of simple, ordered events. One row of data gets captured in a data pipeline for a user watching a TV show, and another records each ‘Like’ button that gets clicked on a social media profile. All this data gets processed through data pipelines at tremendous speed and scale using cloud technology.

The datasets themselves are large, and that’s fine because the underlying data is extremely well-ordered and managed to begin with. The highly organized structure of clickstream data means that billions upon billions of records can be analyzed in no time.

Data pipelines and ERP platforms

For operational systems, such as enterprise resource planning (ERP) platforms that most organizations use to run their essential day-to-day processes, on the other hand, it’s a very different data landscape.

Since their introduction in the 1970s, ERP systems have evolved to optimize every ounce of performance for capturing raw transactions from the business environment. Every sales order, financial ledger entry, and item of supply chain inventory has to be captured and processed as fast as possible.

To achieve this performance, ERP systems evolved to manage tens of thousands of individual database tables that track business data elements and even more relationships between those objects. This data architecture is effective at ensuring a customer or supplier’s records are consistent over time.

But, as it turns out, what’s great for transaction speed within that business process typically isn’t so wonderful for analytics performance. Instead of clean, straightforward, and well-organized tables that modern online applications create, there is a spaghetti-like mess of data, spread across a complex, real-time, mission-critical application.

For instance, analyzing a single financial transaction to a company’s books might require data from upward of 50 distinct tables in the backend ERP database, often with multiple lookups and calculations.

To answer questions that span hundreds of tables and relationships, business analysts must write increasingly complex queries that often take hours to return results. Unfortunately, these queries simply never return answers in time and leave the business flying blind at a critical moment during their decision-making.

To solve this, organizations attempt to further engineer the design of their data pipelines with the aim of routing data into increasingly simplified business views that minimize the complexity of various queries to make them easier to run.

This might work in theory, but it comes as the cost of oversimplifying the data itself. Rather than enabling analysts to ask and answer any question with data, this approach frequently summarizes or reshapes the data to boost performance. It means that analysts can get fast answers to predefined questions and wait longer for everything else.

With inflexible data pipelines, asking new questions means going back to the source system, which is time-consuming and becomes expensive quickly. If anything changes within the ERP application, the pipeline breaks completely.

Rather than applying a static pipeline model that can’t respond effectively to data that is more interconnected, it’s important to design this level of connection from the start.

Rather than making pipelines ever smaller to break up the problem, the design should encompass those connections instead. In practice, it means addressing the fundamental reason behind the pipeline itself: Making data accessible to users without the time and cost associated with expensive analytical queries.

Every connected table in a complex analysis puts additional pressure on both the underlying platform and those tasked with maintaining business performance through tuning and optimizing these queries. To reimagine the approach, one must look at how everything is optimized when the data is loaded – but, importantly, before any queries run. This is generally referred to as query acceleration and it provides a useful shortcut.

This query acceleration approach delivers many multiples of performance compared to traditional data analysis. It achieves this without needing the data to be prepared or modeled in advance. By scanning the entire dataset and preparing that data before queries are run, there are fewer limitations on how questions can be answered. This also improves the usefulness of the query by delivering the full scope of the raw business data that is available for exploration.

By questioning the fundamental assumptions in how we acquire, process and analyze our operational data, it’s possible to simplify and streamline the steps needed to move from high-cost, fragile data pipelines to faster business decisions. Remember: One size does not fit all.

Nick Jewell is the senior director of product marketing at Incorta.

Originally appeared on: TheSpuzz

Scoophot
Logo