Data Skipping

“Without big data, you are blind and deaf and in the middle of a freeway.” — Geoffrey Moore

Two people are working together on a diagram at a desk with a laptop and other office supplies.
Photography by Pavel Danilyuk on Pexels
Published: Friday, 15 November 2024 07:20 (EST)
By Tomás Oliveira

Big data is like a double-edged sword. On one hand, it gives us an unprecedented amount of information to analyze and work with. On the other hand, it can be a nightmare to process efficiently. If you’ve ever felt like your big data system is crawling at a snail’s pace, you’re not alone. Enter data skipping, a technique that’s been flying under the radar but can make a world of difference in performance.

So, what exactly is data skipping? It’s a method that allows your system to skip over irrelevant data during query processing. Instead of scanning through every single piece of data in your dataset, data skipping helps your system focus only on the chunks that matter. Think of it as a way to fast-forward through the boring parts of a movie to get to the action scenes. Sounds cool, right? Let’s dive deeper into how it works and why it’s a game-changer for big data.

How Does Data Skipping Work?

At its core, data skipping relies on metadata. When data is stored, metadata is generated to describe certain characteristics of that data, like the minimum and maximum values in a particular column. When a query is run, the system can check the metadata to see if a data block contains any relevant information. If the block doesn’t meet the criteria, it’s skipped entirely. This can save a ton of time, especially when dealing with massive datasets.

Imagine you’re running a query to find all transactions above $10,000 in a dataset containing millions of transactions. Instead of scanning through every single transaction, data skipping allows the system to skip over blocks where the maximum transaction value is less than $10,000. It’s like having a cheat sheet for your data, allowing you to bypass irrelevant sections and get to the good stuff faster.

Why Is Data Skipping Important for Big Data?

Big data systems are often bottlenecked by the sheer volume of data they need to process. Traditional methods like full table scans can be painfully slow, especially when you’re dealing with terabytes or even petabytes of data. Data skipping helps alleviate this bottleneck by reducing the amount of data that needs to be read and processed. The result? Faster query times and more efficient use of resources.

Another reason data skipping is crucial is that it plays nicely with modern storage formats like Parquet and ORC, which are designed to store data in a columnar format. These formats naturally lend themselves to data skipping because they organize data in a way that makes it easier to generate and use metadata. So, if you’re already using one of these formats, you’re halfway there!

When Should You Use Data Skipping?

Data skipping isn’t a silver bullet for every big data problem, but it shines in certain scenarios. It’s particularly useful when you have large datasets with a lot of irrelevant data for a given query. For example, if you’re running analytical queries that only need to focus on a subset of your data, data skipping can significantly speed things up.

However, if your queries tend to touch most of the data in your dataset, the benefits of data skipping might be less noticeable. In these cases, other optimization techniques like partitioning or indexing might be more effective. But for selective queries, data skipping can be a game-changer.

In short, data skipping is a powerful tool in the big data toolbox, especially when used in the right context. It’s not a one-size-fits-all solution, but when applied correctly, it can drastically improve performance and efficiency.

As Geoffrey Moore once said, “Without big data, you are blind and deaf and in the middle of a freeway.” But with techniques like data skipping, you can at least make sure you’re not stuck in traffic.

Big Data