Why Your Data Warehouse Benchmarks Might Be Misleading

Why Your Data Warehouse Benchmarks Might Be Misleading - And Why Cloud Changes Everything

This blog summarizes and analyzes the research paper Cloud Analytics Benchmark by van Renen, A., and Leis, V. (2023), published in PVLDB 16(6), 1413–1425.

Read the paper

Most organizations rely on the benchmark results when performing a comparison of data warehouse systems. But what if the benchmark itself no longer reflects the realities of the modern cloud system’s conduct? This paper confronts that assumption head-on. It argues that traditional benchmarking can overlook the absolutely critical components of cloud analytics, which are multi-tenancy, elastic resource allocation, workload bursts, and cost.

Why traditional benchmarks fall short

For years, TPC-H, TPC-DS, and similar benchmarks have been an important part of analytical database performance evaluation. They are good at measuring query engine behavior in isolation. The paper, however, argues that they were designed for an older model of computing where hardware is fixed, workloads are relatively stable, and a single database system is tested in isolation.

Cloud data warehouses work in a very different way. The cloud data warehouses follow a very different operating model – compute resources can scale up or down, multiple tenants share infrastructure, and users are not only concerned with the execution speed but also workload costs. As such, a runtime-only benchmark provides a partial and erroneous view of system quality.

Takeaway: in the cloud era, the best data warehouse is not simply the fastest one. It is the one that delivers strong performance while using resources efficiently and controlling cost.

How cloud changes the rules of database systems

Traditional database research often assumes that the system runs on a given hardware setup and the main objective is to optimize execution time. Cloud computing changes that assumption. Storage and compute are often separated, resources are rented on demand, and users no longer buy hardware upfront. Instead, they pay for usage.

That change affects the very idea of optimization. A query that runs a little faster but uses far more resources may not be better from the user’s perspective. Performance in a cloud setting, therefore, has to be evaluated in conjunction with cost, variability in workloads, and the system’s adaptiveness to varying demand. It is an important shift that the database systems of the cloud are no longer just measured by the pace but how cost-efficiently they can perform under dynamical and jointly-used workloads.

Important shift: cloud database systems are judged not just by speed, but by cost-efficient performance under dynamic and shared workloads.

What real cloud workloads actually look like

The best thing about this paper is that it does not only make its argument out of theory. The authors analyze a real-world telemetry dataset released by Snowflake named Snowset. This comprises of a dataset of millions of real queries, allowing to contrast the synthetic benchmarks against the real cloud warehouse behavior.

The results are revealing. Real workloads are much more diverse than classic benchmarks suggest. Query sizes vary greatly, workloads fluctuate over time, and a large number of operations are not just read-only analytics. Many queries both read and write data, which points to modern ELT-style data transformation happening inside the warehouse itself.

Why this matters: a benchmark that ignores multi-user activity, cloud elasticity, and read/write transformation workloads cannot fully represent how real analytics platforms are used.

What makes real workloads complex

The Snowset analysis shows that cloud analytics workloads are not just big; they are lopsided. Cloud analytics workloads are not just large but also uneven according to the Snowset analysis, with the great majority of queries being relatively small and finishing quickly and very few large queries consuming most CPU time and system resources. That makes scheduling and resource allocation harder than in a benchmark baseline.

The authors also show that many workloads are highly variable over time. Some tenants generate steady activity, while others create short bursts of very intense demand. This kind of intermittency is exactly the kind of behavior cloud systems are expected to handle well, but traditional benchmarks rarely model it.

Few large queries consume a major share of resources
Many small queries create noisy, irregular demand
Read/write workloads suggest in-warehouse ETL and transformation
Different tenants behave very differently across time and scale

This is an important lesson for analytics systems: real-world usage is heterogeneous, and any useful evaluation method must capture that heterogeneity rather than average it away.

Introducing the Cloud Analytics Benchmark (CAB)

To respond to these limitations, the paper introduces the Cloud Analytics Benchmark, or CAB. CAB is built specifically for cloud data warehouse environments. Instead of using one isolated dataset and one simplified run pattern, it models multiple tenants, different database sizes, and variable query arrival behavior.

Feature	What it captures
Multi-tenancy	Multiple users and databases running simultaneously
Elasticity	Dynamic scaling of compute resources
Workload patterns	Bursty, periodic, and variable query arrivals
Metrics	Query latency and total monetary cost

CAB uses TPC-H as its base because TPC-H is well known and easier to reproduce, but the benchmark changes how that workload is organized. Instead of back-to-back execution on a single dataset, CAB creates multiple databases and separate query streams, each representing a different tenant in a cloud environment.

This is one of the most meaningful contributions of the paper. It keeps benchmarking practical while still moving much closer to how cloud analytics services are actually used.

How CAB models real-world behavior

CAB does not simply replay a fixed list of SQL statements. It generates databases of different sizes, assigns them different workload intensities, and simulates realistic arrival patterns. Some tenants are modeled as having steady workloads, others as having periodic bursts, and still others as experiencing sudden spikes or daily jobs.

This is a valuable design decision because cloud systems are often evaluated on their ability to adapt to demand fluctuations. CAB therefore moves benchmarking away from a narrow “how fast is the query engine?” question and toward a broader “how well does the entire service handle real customer behavior?” question.

Practical interpretation: CAB treats a cloud data warehouse as a service system, not just a query processor.

What the experiments reveal

The experimental section is especially useful because it shows how CAB can reveal trade-offs that traditional benchmarks may hide. The authors test different Snowflake configurations and compare provisioned versus serverless behavior. Their results show that cluster sizing is not straightforward, and simply adding more resources does not always improve latency proportionally.

They also show that shared-resource designs and per-tenant designs each have strengths and weaknesses. Shared clusters may benefit large workloads, while smaller tenants can experience worse latency when they compete with more demanding workloads. In other words, architecture choices affect different users differently.

Small clusters can become overloaded and create large queues
Larger clusters improve latency up to a point, but returns diminish
Shared resources may disadvantage smaller or query-heavy tenants
Serverless systems can improve some performance outcomes, but sometimes at significantly higher cost

This is exactly why the paper’s choice of metrics is so important. A system that looks good on raw speed alone may be far less attractive once cost and queuing behavior are included.

Insight: the “best” cloud warehouse is not necessarily the fastest. It is the one that reaches the best balance between latency, scalability, and cost.

Why this paper matters for analytics

This paper is much relevant to analytics since cloud data warehouse is now at the middle of various modern data ecosystems. Dashboards, reporting systems, ELT pipelines, and large-scale business intelligence workflows more and more fundamentally rely on Snowflake, BigQuery, Redshift, and similar platforms.

If those systems are evaluated using outdated assumptions, organizations may make poor architectural or financial decisions. This paper therefore matters not only to researchers, but also to engineers and companies trying to choose the right platform for large-scale analytics.

It also connects strongly to course topics. It is directly about data warehouse systems, workload behavior, benchmarking, and cloud-native analytics. It additionally touches NoSQL-related ideas through the broader discussion of cloud architectures and modern distributed systems, even though its main contribution is centered on analytical SQL data warehouses.

Relation to our project and coursework

This paper is directly relevant to our coursework because it shows how modern analytical systems should be evaluated in realistic settings. For a project involving data warehousing, query workloads, or cost-performance optimization, CAB provides a strong conceptual foundation. It encourages us to think beyond raw performance and consider how systems behave under multi-user, cloud-based workloads.

In practical terms, the paper is useful in that it links benchmarking to real operational concerns – tenant behavior, workload spikes, and cost of running analytics at scale. Therefore, it is of practical importance to students in bridging academic research with real system designs.

Critical evaluation of the paper

Although the paper is strong and well motivated, it is not without limitations. One concern is that CAB is heavily informed by Snowflake’s publicly released telemetry dataset. That gives the benchmark realism, but it also creates the risk of overfitting the design to one specific platform’s workload characteristics.

Another limitation is that the benchmark still relies on TPC-H derived queries rather than actual user SQL. This makes implementation practical, but it means CAB is still a hybrid approach: more realistic than traditional benchmarks, but not a full replay of real-world workloads.

The evaluation is also illustrative rather than exhaustive. The paper demonstrates CAB on selected systems and configurations, but broader comparisons across more vendors and more recent traces would make the benchmark even stronger.

Possible bias toward Snowflake-like workload patterns
No real SQL text in the source dataset, only telemetry
Limited cross-platform evaluation in the experiments

Balanced view: CAB is a major improvement over traditional benchmarking, but it should be seen as a strong step forward rather than the final answer to cloud workload evaluation.

Key learnings from the paper

1. Real-world cloud workloads are far more heterogeneous than classic analytical benchmarks suggest.

2. Cloud benchmarking must consider cost as well as performance.

3. Multi-tenancy and elastic scaling are not side issues; they are central to modern data warehouse design.

4. Real systems include significant ETL and read/write transformation activity inside the warehouse.

5. Benchmarking a cloud warehouse as a full service leads to better questions than benchmarking only its query engine.

How the work can be enhanced further

One of the strengths of the paper is that it naturally opens the door for future improvements. The most obvious extension would be to incorporate additional real-world workload traces from systems beyond Snowflake. This would reduce the risk of vendor-specific bias and make CAB more broadly representative.

1. Add more workload sources

Using traces from multiple cloud data warehouses would make the benchmark more representative and reduce platform-specific assumptions.

2. Extend to streaming and real-time analytics

Modern analytics increasingly includes near-real-time data flows. Future versions of CAB could include freshness-sensitive and streaming-style workloads.

3. Include ML-oriented and specialized tenants

Cloud warehouses increasingly support machine learning and advanced data science tasks. Adding such workloads would improve the benchmark’s relevance to modern practice.

Final reflection

What makes this paper significant is not simply that it proposes another benchmark. It reframes the benchmarking problem for the cloud era. It argues that a cloud data warehouse should be evaluated as a dynamic, multi-tenant, cost-sensitive service rather than as a static query engine running on fixed hardware.

That is an important shift in perspective. It helps explain why old benchmarks are no longer enough, and why database systems research must evolve alongside modern deployment models. The paper successfully shows that realistic benchmarking is not just a technical detail; it is essential for fair comparison, good system design, and practical decision-making.

In modern analytics systems, true performance is not just about speed.
It is about how intelligently a system handles real workloads, real users, and real cost.

Search This Blog

Cloud Analytics Benchmark