Data Warehouse in the Cloud Benchmark

1. Summary

Product Profile and Evaluation: Microsoft Azure SQL Data Warehouse and Amazon Redshift

Data-driven organizations rely on analytic databases to load, store, and analyze volumes of data at high speed to derive timely insights. This benchmark study focuses on the performance of cloud-enabled, enterprise-ready, relationally based, analytical workload solutions from Microsoft Azure SQL Data Warehouse and Amazon Redshift.

For our benchmark, we compare the performance of SQL Data Warehouse and Amazon Redshift using a workload derived from the well-recognized industry standard TPC Benchmark™ H (TPC-H) [1]. The intent of the benchmark’s design was to simulate a set of basic scenarios to answer fundamental business questions and report business outcomes relevant to many industry sectors. In our opinion as field practitioners, this benchmark is a fair representation of the majority of queries that are run on data warehouses, excluding the low-end basic queries and the high-end data science queries.

The benchmark tested the scalability of corporate-complex workloads in terms of data volume with 30TB of data. The testing was conducted using as similar a configuration as can be achieved across Azure and Amazon Web Services (AWS) offerings.

In our benchmark, all 22 queries of the TPC-H (with only necessary platform-specific syntax changes) were executed serially as a single set three (3) times (after a warmup). The individual times of each query were added, and the set with the fastest overall run of the set was chosen for comparison (in aggregate and for individual query comparisons).

The benchmark results were insightful in revealing the query execution performance of Azure SQL Data Warehouse and Redshift and some of the differentiators in the two products. Azure SQL DW outperformed Redshift in 56 of the 66 queries ran. Overall, the performance advantage was 1.67 times faster.

Please note these results are as of July 2018. Future enhancements to either platform could shift the results.

[1] Due to the cloud configuration, performance data from this benchmark cannot be compared with published official TPC-H results, which only have published results for on-premise configurations.

2. Cloud Analytics Platform Offerings

Big data analytics platforms load, store, and analyze volumes of data at high speed, providing timely insights to businesses. This data can be relational and structured, semi-structured, or unstructured from a variety of sources. Data-driven organizations are leveraging this data, for example, for advanced analysis to market new promotions, operational analytics to drive efficiency, and predictive analytics to evaluate credit risk and detect fraud. Customers are leveraging a mix of relational analytical databases and data warehouses to gain analytic insights.

This paper focuses on relational analytical databases in the cloud because deployments in the cloud are at an all-time high and poised to expand dramatically. The cloud offers opportunities to differentiate and innovate with these database systems at a much more rapid pace than was possible ever before. The cloud has been a disruptive technology by offering elastic scalability vis-à-vis on-premise deployments, enabling faster server deployment and application development, less costly storage. For these reasons and others, many companies have leveraged the cloud to maintain or gain momentum as a company.

This paper focuses on benchmarking Azure SQL Data Warehouse and Amazon Redshift, two relational analytical databases based on scale-out cloud data warehouses and columnar-based database architectures. Despite these similarities, there are some distinct differences in the two platforms.

In terms of storage, Amazon Redshift has a tight coupling between compute nodes and storage; data is stored on the compute nodes with a backup on AWS S3. While tight coupling keeps the data in close proximity to compute resources, it creates difficulty when scaling. Changing the instance size requires data movement to repartition and redistribute data on the new topology. Also, when a Redshift cluster is shut down, it is actually deleted. In fact, this is the only way to pause compute billing. Upon restoration, the local storage on the compute nodes has to be loaded from the external backup on S3, which involves data movement once again.

The largest drawback to this tight coupling of compute and storage is the rigid data volume sizes of Redshift nodes. At the enterprise class, Redshift dense compute nodes (dc2.8xlarge) have 2.56TB per node of solid state drives (SSD) local storage. Their dense storage nodes (ds2.8xlarge) have 16TB per node, but it is on spinning hard disks (HDD) with slower I/O performance. Given those two options and two much smaller configuration options (which in our opinion are not large enough for enterprise scale), Redshift has limited ways to configure and balance storage and compute. Customers who are happy with their Redshift compute resources but need more storage must purchase additional compute nodes to increase storage. This can become costly as data volumes expand. For example, if a Redshift customer has a 10-node dc2.8xlarge cluster, they have a hard limit of 25.6TB compressed. At current pricing, this would cost $48.00 per hour. If they needed to double their storage capacity, they would have to add 10 more nodes, doubling their cost to $96.00 per hour.

On the other hand, Azure SQL Data Warehouse storage is separate from compute, and allows data to grow independently of the compute Data Warehouse Unit (DWU).  This allows Azure SQL Data Warehouse to have potentially unlimited columnar storage capacity without changing the compute resources. Inversely, Azure SQL Data Warehouse can have additional compute power added to a cluster on the fly. This capability allows for adjustment to various workload demands, giving the potential for cost savings when demand is low. Azure SQL Data Warehouse can pause and resume capability that stops the compute billing, where only the storage is billed during the paused time.

Finally, although concurrency was not tested in the benchmark, Azure SQL Data Warehouse supports 128 concurrent queries. This is more than twice as many as Amazon Redshift, which supports a maximum concurrency of 50 across queue defined by their workload Management feature.

Platform Summary

Azure SQL Data Warehouse

Redshift

Company Microsoft Amazon
First Released 1989 (SQL Server)

2005 (Analytics Platform System)

2016 (Azure SQL DW)

2014
Current Version 12.0.2000.8 1.0.2294
Compute-Storage Separate Local/Tightly Coupled
SQL Transact-SQL PostgreSQL 8
Scale Out Cloud Data Warehouse
Columnar
AWS Cloud
Azure Cloud

3. Benchmark Setup

The benchmark was executed using the following setup, environment, standards, and configurations.

Benchmark Data

The data sets we used in the benchmark werea workload derived from the well-recognized industry standard TPC Benchmark™ H (TPC-H) [2].

To show the data model, the following diagram was taken from page 13 of the TPC-H Revision 2.17.3 specification document.

To give an idea of the data volumes used in our benchmark, the following table gives row counts of the database when loaded with 30TB of TPC-H data:

TPC-H Table 30TB Row Count
Customer 4,500,000,000
Line Item 180,000,000,000
Orders 45,000,000,000
Part 6,000,000,000
Supplier 300,000,000
PartSupp 24,000,000,000

Cluster Environments

Our benchmark included six (6) different cluster environments—three each for Azure SQL Data Warehouse and Amazon Redshift, scaled at 10, 30, and 60 nodes.

Queries

We sought to replicate the TPC-H Benchmark queries modified only by syntax differences required by both platforms. The benchmark is a fair representation of enterprise query needs.

The TPC-H testing suite has 22 queries, which are described by the table below. The SQL operations used in each query are noted.

Q# Description Sum Sub-
query
Join Min/
Max
Avg Count Top/
Limit
1 Pricing Summary Report
2 Minimum Cost Supplier
3 Shipping Priority
4 Order Priority Checking
5 Local Supplier Volume
6 Forecasting Revenue Change
7 Volume Shipping
8 National Market Share
9 Product Type Profit Measure
10 Returned Item Reporting
11 Important Stock Identification
12 Shipping Modes and Order Priority
13 Customer Distribution
14 Promotion Effect
15 Top Supplier
16 Parts/Supplier Relationship
17 Small Quantity Order Revenue
18 Large Volume Customer
19 Discounted Revenue
20 Potential Part Promotion
21 Suppliers Who Kept Orders Waiting
22 Global Sales Opportunity

[2] More can be learned about the TPC-H benchmark at http://www.tpc.org/tpch/.

4. Benchmark Results

This section analyzes the query results from the best run of the three sets of 22 TPC-H queries. The difference in the three query runs was usually minimal (< 5%).

Legend: 

Query 1 is the only query that uses only Sum, Average, and Count. Azure SQL DW was progressively faster as nodes were added, with performance ranging from 1.93 to 2.81 times faster than Redshift. Query 2 was one of two queries that contained a Min/Max. Azure SQL DW was, on average, nearly twice as fast as Redshift.
TPC-H query 3 was close to the overall average performance advantage that Azure SQL DW had over Redshift in all the queries (1.49 and 1.67 times faster). Query 4, with a subquery and a count, had the best relative query performance for Azure SQL DW, outperforming Redshift by nearly 5 times on average across the three-node configurations.
Query 5, which only employs a sum aggregation, favored Azure SQL DW as well. The simple SUM of Query 6 was another high relative performer for Azure SQL DW, outperforming Redshift by over 4 times on average across the three-node configurations.
Although Azure SQL DW outperformed Redshift by nearly 2 times on average across the three-node configurations, at 60 nodes Redshift had a 20% better query performance. For Query 8, Redshift was 21% faster at 10 nodes, and Azure SQL DW was 11% and 5% faster at 30 and 60 nodes respectively. This query performance was one of the closest Redshift got to Azure SQL DW performance, which was only 32% faster on average across the three-node configurations.
In Query 9 (sub-select and a sum operation) Azure SQL DW’s performance was roughly twice as fast. Query 10 uniquely has a Sum and a Top/Limit. The query was close to the overall average performance advantage that Azure SQL DW had over Redshift in the queries (1.57 to 1.67 times faster).
Query 11 (another sub-select and a sum operation) saw performance favoring Azure SQL DW. Query 12, another simple SUM, was another plus performance query for Azure SQL DW, with Azure SQL DW outperforming Redshift by over 2 times on average across the three-node configurations.
Query 13 is the only TPC-H query with an explicit JOIN. Azure SQL DW was twice as fast on average. For Query 14, Redshift was 28% faster at 10 nodes, and Azure SQL DW was 23% and 37% faster at 30 and 60 nodes respectively. This query was one of the closest Redshift got to Azure SQL DW performance although Azure SQL DW was 44% faster on average across the three-node configurations.
Query 15 showed Azure SQL DW to be roughly 20% faster across all node counts. For Query 16, Redshift outperformed Azure SQL DW across all node configurations, nearly doubling Azure SQL DW performance.
Conversely, Azure SQL DW nearly doubles Redshift performance across the board for Query 17. Query 18 saw mixed results with Azure SQL Server leading in 2 out of 3.
Query 19 was another strong performer for Azure SQL DW, with 3.7, 3.0 and 5.2 times Redshift performance across the node configurations. Query 20 showed Azure SQL DW to be twice as fast or faster.
Query 21 showed Azure SQL DW between 30% and 50% faster. Query 22 was close but Redshift outperformed Azure SQL DW in 2 of the 3 configurations.

Overall, the relative performance advantage of Azure SQL DW over Redshift did not change much based on whether the node count was 10, 30, or 60. Azure SQL DW was 1.70, 1.83, and 1.67 times as fast on average on 10, 30, and 60 nodes, respectively. Overall, the performance advantage was 1.67 times faster.

Azure SQL DW outperformed Redshift in 56 of the 66 queries, while Redshift outperformed Azure SQL DW in 4 (of 22) queries at 10 nodes, 2 (of 22) queries at 30 nodes and 4 (of 22) queries at 60 nodes.

Specific elements in the query produced the following performance advantages for Azure SQL DW:

SQL Element Performance Advantage (Azure SQL DW x faster)
Min/Max 1.14
Average 1.97
Count 1.36
Top/Limit 1.44

5. Price Per Performance

To measure price-performance, we used the TPC-H metric of cost of ownership divided by composite query per hour. If you contiguously ran all 22 of these queries on the platforms to completion of the set, the cost at an hourly basis is indicated in the chart below. Azure SQL DW provided consistently better price-performance in our queries.

 

6. Conclusion

Cloud databases are a way to avoid large capital expenditures, provision quickly, and provide performance at scale for advanced analytic queries in the enterprise. Relational databases with analytic capabilities continue to support the advanced analytic workloads of the organization with performance, scale, and concurrency. In a representative set of corporate-complex queries from the well-known TPC-H, Azure SQL Data Warehouse consistently outperformed Redshift.

Overall, the benchmark results were insightful in revealing the query execution performance and some of the differentiators for Azure SQL Data Warehouse and Redshift, with Azure SQL Data Warehouse query response times on the 30TB data set running 1.67 times as fast.

In terms of cost, we found Azure SQL DW to be about 25% less expensive as a platform for running the query set.

Overall, Azure SQL Data Warehouse is an excellent choice for companies needing a high-performance and scalable analytical database in the cloud or to augment the current, on-premises offering with a hybrid architecture at a reasonable cost.

7. About William McKnight

William McKnight is a former Fortune 50 technology executive and database engineer. An Ernst & Young Entrepreneur of the Year finalist and frequent best practices judge, he helps enterprise clients with action plans, architectures, strategies, and technology tools to manage information.

Currently, William is an analyst for GigaOm Research who takes corporate information and turns it into a bottom-line-enhancing asset. He has worked with Dong Energy, France Telecom, Pfizer, Samba Bank, ScotiaBank, Teva Pharmaceuticals, and Verizon, among many others. William focuses on delivering business value and solving business problems utilizing proven approaches in information management.

8. About Jake Dolezal

Jake Dolezal is a contributing analyst at GigaOm. He has two decades of experience in the information management field, with expertise in analytics, data warehousing, master data management, data governance, business intelligence, statistics, data modeling and integration, and visualization. Jake has solved technical problems across a broad range of industries, including healthcare, education, government, manufacturing, engineering, hospitality, and restaurants. He has a doctorate in information management from Syracuse University.

9. About GigaOm

GigaOm provides technical, operational, and business advice for IT’s strategic digital enterprise and business initiatives. Enterprise business leaders, CIOs, and technology organizations partner with GigaOm for practical, actionable, strategic, and visionary advice for modernizing and transforming their business. GigaOm’s advice empowers enterprises to successfully compete in an increasingly complicated business atmosphere that requires a solid understanding of constantly changing customer demands.

GigaOm works directly with enterprises both inside and outside of the IT organization to apply proven research and methodologies designed to avoid pitfalls and roadblocks while balancing risk and innovation. Research methodologies include but are not limited to adoption and benchmarking surveys, use cases, interviews, ROI/TCO, market landscapes, strategic trends, and technical benchmarks. Our analysts possess 20+ years of experience advising a spectrum of clients from early adopters to mainstream enterprises.

GigaOm’s perspective is that of the unbiased enterprise practitioner. Through this perspective, GigaOm connects with engaged and loyal subscribers on a deep and meaningful level.

10. Copyright

© Knowingly, Inc. 2018 "Data Warehouse in the Cloud Benchmark" is a trademark of Knowingly, Inc. For permission to reproduce this report, please contact sales@gigaom.com.