Table of Contents
1. Summary
Organizations today need a broad set of enterprise data cloud services with key data functionality to modernize applications and utilize machine learning. They need a platform designed to address multifaceted needs by offering multifunction data management and analytics to solve the enterprise’s most pressing data and analytic challenges in a streamlined fashion. They also need a selection that allows a worry-less experience with the architecture and its components.
The chosen platform should bring a multitude of data services onto a single, cohesive space. A key differentiator among platforms is the overarching management, deployment, governance, billing, and security of those services, which can reduce complexity in administration and scaling data pipelines. As more components are added, and more integration points among those components arise, complexity will increase substantially. Greater complexity will lead to more technical debt and administrative burden as organizations cobble together and maintain the flow of data between point solutions.
We decided to take four leading platforms for machine learning under analysis. We have learned that the cloud analytic framework selected for an enterprise and an enterprise project matters in terms of cost.
By looking at the problem from a cost perspective, we’ve learned to be wary of architectures that decentralize and decouple every component by business domain, which enables flexibility in design, but blows up the matrix of enterprise management needs.
Some architectures look integrated but, in reality, may be more complex and more expensive. When almost every additional demand of performance, scale, or analytics can only be met by adding new resources, it gets expensive.
Based on our approach described in the next section, and using the assumptions listed in each section mimicking a medium enterprise application, Azure was the lowest-cost platform. It had a three-year cost of $3M to purchase the analytics stack for a “medium-size” organization. AWS was 19% higher, while Google and Snowflake were more than double the cost.
Highlights of the Azure stack include Azure Synapse Dedicated, Azure Synapse SQL Pool, Azure Data Factory, Azure Stream Analytics, Azure Synapse Spark, Azure Synapse Serverless, Power BI Professional, Azure Machine Learning, Azure Active Directory P1, and Azure Purview.
The AWS stack includes Amazon Redshift ra3, Amazon Redshift Managed Storage, AWS Glue, Amazon Kinesis, Amazon EMR + Kinesis, Amazon Redshift Spectrum, Amazon Quicksight, Amazon SageMaker, Amazon IAM, and AWS Glue Data Catalog.
The Google stack is Google BigQuery Annual Slots, Google BigQuery Active Storage, Google Dataflow Batch, Google Dataflow Streaming, Google Dataproc, Google BigQuery On-Demand, Google BigQuery BI Engine, Google BigQuery ML, Google Cloud IAM, and Google Data Catalog.
We labeled the fourth stack Snowflake, since that is the featured vendor for dedicated compute, storage, and data lake, but it is really a multi-vendor heterogeneous stack. This includes Snowflake database, AWS Glue, Kafka Confluent Cloud, Amazon EMR + Kinesis, Tableau, Amazon SageMaker, Amazon IAM, and AWS Glue Data Catalog.
Azure was also the lowest-cost platform for large enterprises at a $9M one-year (annual) cost to purchase. AWS was 32% higher, while Google and the Snowflake stack were more than two times higher.
Dedicated compute is the largest configuration cost, ranging from 54% for the AWS stack to 78% for the Google stack. Data integration is second in all stacks.
A three-year total cost of ownership analysis for medium enterprises, which includes labor costs, reveals that Azure is the platform with the lowest cost of ownership at $3 million. AWS is at $4 million, Google at $7.6 million, and Snowflake is $8 million. For large enterprises, Azure three-year TCO is $8.5 million, AWS $12.3 million, Google $19.2 million, and Snowflake $22 million. (Figure 1)
Figure 1. Three-Year Total Cost of Ownership for Each Platform
2. Modernizing Your Use Case
Calculating the total cost of ownership in projects happens formally or informally for many enterprise programs. It is also occurring more frequently than ever. Sometimes, well-meaning programs will use TCO calculations to justify a program, but the measurement of the actual TCO on the flip side can be a daunting experience, especially if the justification for the TCO was entered into lightly.
Perils of TCO measurement aside, enterprise applications should be attaining high returns. However, if the application is not being implemented to a modern standard, using a machine learning platform as described herein, there are huge inefficiencies and competitive gaps in functionality. Therefore, many enterprises are considering leveling up or migrating these use cases now, and reaping the benefits.
This paper will focus on the platform costs for medium-and large-sized configurations, broken down by category, across four major platforms: Azure, Snowflake, Amazon, and Google.
The categories, or components in a modern enterprise analytics stack, we included in our TCO calculations are as follows:
- Dedicated compute
- Storage
- Data integration
- Streaming
- Spark analytics
- Data lake
- Business intelligence
- Machine learning
- Identity management
- Data catalog
A performance test using the Gigaom Analytic Field Test queries (derived from the TPC-DS) was used to establish equivalency for our pricing and help determine the medium- and large-sized configurations of the four platforms.
Since each platform prices its services differently (with no way to align on hardware specifications), we did our best to align on overall price and not give any of the four platforms a bottom-line advantage in this category. For time-based pricing (e.g., per hour), we assumed 24/7/365 operations. We leave it to the reader to judge the applicability of our decisions.
In addition to these configuration components, the labor cost factors for the following functions are estimated using our cost multipliers for migrating to data warehouse ecosystems on AWS, Azure, Google, and Snowflake:
- Data migration
- ETL integration
- Analytics migration
- Ongoing support and continuous improvement
We then rated each of the platforms across complexity of maintenance, complexity of setup, and complexity of operation and administration. We came up with a support and improvement cost to arrive at our final three-year total cost of ownership figures for the study.
3. Performance Comparison
A performance test was used to establish equivalency for our pricing of the four platforms. The GigaOm Analytic Field Test, used across all vendors, was designed to emulate the TPC Benchmark™ DS (TPC-DS) and adhered to its specifications. This was not an official TPC benchmark. The queries were executed using the setup, environment, standards, and configurations described below. For more details on how the testing was conducted, see the Appendix.
Field Test Results
This section analyzes the query results from the fastest of the three runs of the GigaOm Analytic Field Test queries (derived from the TPC-DS) described in the Appendix. The primary metric used was the aggregate total of the best execution times for each query. Three power runs were completed. Each of the 103 queries (99 plus part 2 for 4 queries) was executed three times in order (1, 2, 3, … 98, 99) against each vendor cloud platform, and the overall fastest of the three times was used as the performance metric. These best times were then added together to obtain the total aggregate execution time for the entire workload.
As previously mentioned, the best total aggregate execution time was taken for the entire workload. The chart in Figure 2 shows the overall performance of each platform in terms of total time it took to execute the entire set of 103 queries in the GigaOm Analytic Field Test.
Figure 2. Analytic Field Test 30 TB Execution Times in Seconds (lower is better)
Price per Performance
The next step was to determine the price per performance. Our goal was to have price performance be as close as possible across all four platforms. This would allow us to choose the right platform size to include in our TCO calculations.
The price-performance metric is dollars per query-hour ($/query-hour). This is defined as the normalized cost of running the GigaOm Analytic Field Test workload on each cloud platform. It was calculated by multiplying the best on-demand rate (expressed in dollars) offered by the cloud platform vendor at the time of testing multiplied by the number of computation nodes used in the cluster, and by dividing this amount by the aggregate total of the best execution times for each query (expressed in hours).
Figure 3 provides the price performance of each platform as configured. This is based on running all 103 of the queries contiguously to completion of the set.
Figure 3. Analytic Field Test 30 TB Price per Performance (lower is better)
As you can see, we could not achieve a similar price per performance. For Azure, we used the DW7500c for the medium-sized configuration and DW15000c for large that we originally tested. For Snowflake, 3XLarge for medium and 4XLarge for large were used in our TCO calculations. Redshift price performance was close to Azure. We’ve found Redshift to perform more closely to our Azure and Snowflake medium-sized configurations with 32 nodes of ra3.4xlarge and 16 nodes of ra3.16xlarge for large. For BigQuery, we tested 5,000 and 10,000 Flex slots. The performance using 10,000 slots was equivalent to the medium cases for the other three platforms. Thus, for our TCO calculations, we used a 10,000 annual slot commitment for medium organizations and a 20,000 annual slot commitment for large enterprises.
4. Total Cost of Ownership
A full analytics stack in the cloud is more than just a data warehouse, cloud storage, and a business intelligence solution. This TCO study required consideration of 10 categories to establish both equivalence among the four analytics platforms’ offerings and a fair estimate of pricing. In our experience, all of these components are essential to having a full enterprise-ready analytics stack.
Though the capabilities of the components across the stacks are not equal, all four stacks have been utilized successfully to build machine learning applications. Every component has six enterprise needs that must be met. These are security and privacy, governance and compliance, availability and recovery, performance and scalability, skills and training, and licensing and usage. The capability difference is made up in labor (addressed below) using our “cost multipliers.”
These stacks can be used for a variety of machine learning applications, including customer analytics, fraud detection, supply chain optimization, and IoT analytics. Of course, each application could use a slightly different set of components or quantity of each component. Our vision for developing our stacks is based on customer analytics.
Primary Components
The categories or components in a modern enterprise analytics stack that we included in our TCO calculations are as follows:
- Dedicated compute
- Storage
- Data integration
- Streaming
- Spark analytics
- Data lake
- Business intelligence
- Machine learning
- Identity management
- Data catalog
Dedicated Compute
The dedicated compute category represents the heart of the analytics stack—the data warehouse itself. A modern cloud data warehouse must have separate compute and storage architecture. The power to scale compute and storage independently of one another has transitioned from an industry trend to an industry standard. The four analytics platforms we are studying have separate pricing models for compute and storage. Thus, this TCO component deals with the costs of running the compute portion of the data warehouse (see Table 1). Storage of data comes next.
Table 1. The Four Vendor Stack Offerings for Dedicated Compute
Vendor Offering | Pricing Used | |
---|---|---|
Azure | Azure Synapse Analytics Workspace | 1-year reserved ($0.9513/hour per 100 DWU) |
AWS | Amazon Redshift RA3 | 1-year commitment all upfront ($8.61 effective hourly) |
GCP | Google BigQuery | Annual slot commitment ($1,700 per 100 slots) |
Snowflake | Snowflake Computing | Enterprise+ ($4.00 per hour per credit) |
Source: GigaOm 2022 |
You can find the pricing data for each offering here:
- https://azure.microsoft.com/en-us/pricing/details/synapse-analytics/
- https://aws.amazon.com/redshift/pricing/
- https://cloud.google.com/bigquery/pricing
- https://www.snowflake.com/pricing/
For Azure, we opted for the unified workspace experience in Azure Synapse Analytics. While you can purchase reserved capacity for a dedicated SQL pool resource under their legacy pricing model, at the time of this writing, reserved capacity pricing was unavailable for Azure Synapse Analytics Workspace. For AWS, we chose its latest RA3 family of clusters. Redshift RA3 includes addressable Redshift Managed Storage in its price. However, there is a separate storage charge that falls into the separate Storage category discussed in the next section. For Google, we chose BigQuery with dedicated slots, which is much more economical than their on-demand pricing model that charges $5 for each TB scanned by each query. For a stack built on Snowflake, the only choice is, of course, Snowflake. We used its Enterprise+ Azure pricing model, which offers multi-cluster capabilities, HIPAA support, PCI compliance, and disaster recovery.
Azure Synapse and Amazon Redshift both have the ability to pause compute (and therefore compute billing) manually when the resource is not needed. With Snowflake, it automatically pauses itself after a period of inactivity determined by the customer (e.g., 5 minutes), and upon first query it will auto-wake. These three platforms also allow you to scale the compute size up and down. Utilizing these features could result in some savings. With a BigQuery annual slot commitment, you get the best pricing, but there is no way to “pause” billing—you would need more expensive Flex Slots for that.
To determine the number of DWU, nodes, slots, and credits needed for Synapse, Redshift, BigQuery, and Snowflake, respectively, we used our field test to determine a like-for-like performance equivalence test. See the section on Performance Comparison for a rundown of this methodology and results.
We assumed the data warehouse in a modern enterprise would be running 24/7/365. Thus, we priced it as running 8,760 hours per year.
Storage
The dedicated storage category represents storage of the enterprise data. Formerly, this data was tightly coupled to the data warehouse itself, but modern cloud architecture allows for the data to be stored separately (and priced separately). See Table 2.
Table 2. The Four Offerings of the Vendor Stacks for Storage
Vendor Offering | Pricing Used | |
---|---|---|
Azure | Azure Synapse Analytics SQL Pool | $0.023 per GB-month |
AWS | Amazon Redshift Managed Storage | $0.024 per GB-month |
GCP | Google BigQuery Storage | $0.20 per GB-month (uncompressed) |
Snowflake | Snowflake Computing Storage | $40.00 per TB-month |
Source: GigaOm 2022 |
- https://azure.microsoft.com/en-us/pricing/details/synapse-analytics/
- https://aws.amazon.com/redshift/pricing/
- https://cloud.google.com/bigquery/pricing
- https://www.snowflake.com/pricing/
- Note that for Snowflake we used on-demand storage rather than up-front capacity pricing to be consistent with the on-demand pricing of other platforms.
For all four vendor stacks, we chose the de facto storage that comes with each dedicated compute component discussed above. While these come with compute resources, they are priced separately according to the size of the customer’s data.
We priced storage at 30 TB of uncompressed data (compressed to 7.5 TB) for the medium-tier enterprise and 120 TB of uncompressed data (compressed to 30 TB) for the large-tier enterprise for Synapse, Redshift, and Snowflake. BigQuery prices data storage for uncompressed size.
Data Integration
The data integration category represents the movement of enterprise data from source to the target data warehouse through conventional ETL (extract-transform-load) and ELT (extract-load-transform) methods. Table 3 shows the rundown of options.
Table 3. Vendor Stack Options Chosen for Data Integration
Vendor Offering | Pricing Used | |
---|---|---|
Azure | Azure Data Factory (ADF) | $0.25 per DIU-hour + $1.00 per 1,000 activity runs |
AWS | AWS Glue | $0.44 per DPU-hour |
GCP | Google Dataflow (Batch) | $0.0828 per worker-hour |
Snowflake | AWS Glue | $0.44 per DPU-hour |
Source: GigaOm 2022 |
- https://azure.microsoft.com/en-us/pricing/details/data-factory/data-pipeline/
- https://aws.amazon.com/glue/pricing/
- https://cloud.google.com/dataflow/pricing
- https://aws.amazon.com/glue/pricing/
For Azure, we considered the Data Factory pipeline orchestration and execution using integration runtime pricing and Data Integration Unit (DIU) utilization. However, the compute power of a DIU is not published on Azure’s website. AWS Glue was priced for ETL jobs using Data Processing Units (DPU). A single Glue Data Processing Unit (DPU) provides 4 vCPU and 16GB of memory. Google Dataflow pricing units are worker-hours. A default Dataflow worker provides 1 vCPU and 3.75 GB memory. We also used AWS Glue for Snowflake. For Glue and ADF, we considered 64 units for the medium-tier enterprise and 256 units for large DPU and DIU, respectively, running for 8,760 hours per year (24/7/365). For Dataflow, we used 256 for medium-tier and 1,024 for large (since its default workers have one-fourth the compute power of a DPU). For ADF, we also priced 1,000 activity runs per month for medium and 4,000 runs per month for large.
Streaming
The streaming category represents the movement of enterprise data via a streaming workload from event-driven and IoT (Internet of Things) sources. Table 4 shows the vendor stack options chosen for streaming.
Table 4. Vendor Stack Options Chosen for Streaming
Vendor Offering | Pricing Used | |
---|---|---|
Azure | Azure Stream Analytics (for Analytics) and Azure Event Hubs | $0.11 per streaming unit (SU) per hour + $1.233 per processing-unit per hour |
AWS | Amazon Kinesis Data Analytics | $0.11 per KPU-hour + $0.10 per GB-month for running application storage |
GCP | Google Dataflow (Streaming) | $0.352 per worker-hour |
Snowflake | Confluent Cloud (Kafka) | $1.50 base per hour + $0.12 per GB write + $0.05 per GB read |
Source: GigaOm 2022 |
For each of the four platforms, we made reasonable assumptions about the workload requirements of the medium- and large-tier configurations. Since each platform prices its Streaming services in vastly different ways (with no way to align on hardware specifications), we did our best to align on overall price, so as not to give any of the four platforms a bottom-line advantage in this category. For time-based pricing (e.g., per hour), we assumed 24/7/365 operations. We leave it to the reader to judge fairness in our decisions.
Spark Analytics
The Spark analytics category represents the use of Apache Spark for data analytics workloads. Neither AWS nor Google has a separate Spark processor as Synapse, so EMR and Dataproc are still hanging around for Spark processing. Table 5 shows the options chosen here.
Table 5. Vendor Stack Options Chosen for Spark Analytics
Vendor Offering | Pricing Used | |
---|---|---|
Azure | Big Data Analytics with Apache Spark | $0.143 per vCore-hour |
AWS | Amazon EMR + Kinesis Spark | $1.008 for EMR on r6g.4xlarge per hour + $0.015 per shard-hour for Kinesis |
GCP | Google Dataproc | $0.01 per CPU-hour + $1.0481 per hour (n2-highmem-16) |
Snowflake | Amazon EMR + Kinesis Spark | $1.26 for EMR on r6g.4xlarge per hour + $0.015 per shard-hour for Kinesis |
Source: GigaOm 2022 |
- https://azure.microsoft.com/en-us/pricing/details/synapse-analytics/
- https://aws.amazon.com/emr/pricing/
- https://cloud.google.com/dataproc/pricing
- https://aws.amazon.com/emr/pricing/
Again, we made reasonable assumptions about the workload requirements of the medium and large-tier enterprises. Since each platform prices its Spark services in vastly different ways (with no way to align on hardware specifications), we did our best to align on overall price, so as not to give any of the four platforms a bottom-line advantage in this category. In this case, the one exception is Google Dataproc, which is priced significantly lower than the other three competitors. For time-based pricing (e.g., per hour), we assumed 24/7/365 operations. We leave it to the reader to judge fairness in our decisions.
Data Lake
The data lake category represents the use of a data lake that is separate from the data. This is common in many modern data-driven organizations as a way to store and analyze massive data sets of “colder” data that don’t necessarily belong in the data warehouse. Table 6 shows the options considered for data lakes.
We eliminated the Hadoop data lake components: EMR, HDInsight, Dataproc, and Cloudera, and are now using S3, ADLS, and Google Cloud Storage. Since all four data warehouses now have external table support for object storage, it made sense to just use them.
Table 6. Vendor Stack Options Chosen for Data Lakes
Vendor Offering | Pricing Used | |
---|---|---|
Azure | Azure Synapse Serverless | $5.00 per TB-month of scanned data |
AWS | Amazon Redshift Spectrum | $5.00 per TB-month of scanned data + $2.15 per hour for ra3.4xlarge |
GCP | Google BigQuery On-Demand Infrastructure | $5.00 per TB-month of scanned data |
Snowflake | Snowflake External Tables | $4.00 per credit per hour |
Source: GigaOm 2022 |
For this comparison, we set aside performance considerations and aligned on price.
Business Intelligence
The business intelligence category represents the use of a BI tool to complement the data warehouse for business users. (See Table 7)
Table 7. Vendor Stack Options Chosen for Business intelligence
Vendor Offering | Pricing Used | |
---|---|---|
Azure | PowerBI Professional | $9.99 per Developer per month |
AWS | Amazon Quicksight | $18 per Author per month + $5 per Reader per month |
GCP | BigQuery BI Engine | $0.0416 per GB-month |
Snowflake | Tableau | $70 per Creator per month + $42 per Explorer per month + $15 per Viewer per month |
Source: GigaOm 2022 |
- https://powerbi.microsoft.com/en-us/pricing/
- https://aws.amazon.com/quicksight/pricing/
- https://cloud.google.com/bi-engine/pricing
- https://www.tableau.com/pricing/teams-orgs#online
For this comparison, we did not consider features and capabilities. Amazon Quicksight and BigQuery BI Engine are not as mature or fully capable as PowerBI or Tableau, and therefore may be less effective for the workload. That characteristic is difficult to quantify.
We did, however, use the same number of users for each. Note that PowerBI is the most economical because it only charges for Developers (the same as Creator/Author in Quicksight and Tableau). For Developer/Creator/Author, we chose 100 for medium-tier and 500 for large-tier enterprises. For Quicksight, we chose 1,000 and 5,000 Readers for medium/large, respectively. For Tableau, we divided this population into 800 Viewers and 200 Explorers for medium and 4,000 Viewers and 1,000 Explorers for the large tier. For BigQuery BI Engine, we priced it to scan the entire data warehouse once a week (30 TB for medium and 120 TB for large), although this usage is difficult to predict and budget.
Machine Learning
The machine learning category represents the use of a machine learning and data science platform on top of the data warehouse and/or data lake. Table 8 shows the options chosen here.
Table 8. Vendor Stack Options Chosen for Machine Learning
Vendor Offering | Pricing Used | |
---|---|---|
Azure | Azure Machine Learning | $0.504 per hour (ML on E8 v3) |
AWS | Amazon Sagemaker | $0.605 per hour (ml.r5.2xlarge) |
GCP | Google BigQuery ML | $5 per TB-scanned + $25,000 M/$100,000 L (estimated) per year for model creation |
Snowflake | Amazon Sagemaker | $0.605 per hour (ml.r5.2xlarge) |
Source: GigaOm 2022 |
- https://azure.microsoft.com/en-us/pricing/details/machine-learning/
- https://aws.amazon.com/sagemaker/pricing/
- https://cloud.google.com/bigquery-ml/pricing
- https://aws.amazon.com/s3/pricing/
For this comparison, we set aside feature considerations and again aligned on price. Azure ML is free as a service, so you only pay for the compute. Snowflake and Amazon Sagemaker have some integrations that allow you to couple them together. BigQuery ML uses its on-demand pricing model outside of the slot commitment you may have purchased. BigQuery ML also charges different fees for model creation. We chose an estimated $25,000 budget per year for the medium-sized configuration and $100,000 for large to cover these fees. For Azure ML and Sagemaker, we priced 16 nodes for medium and 64 nodes for large enterprises. Also note that with BigQuery ML, you pay additional fees per model created. We did not factor this into our pricing because it would impact the bottom line much less than 1% of the overall total.
Identity Management
The identity management category represents the integration of users through IAM (identity and account management). Table 9 shows options chosen for each vendor stack.
Table 9. Vendor Stack Options Chosen for Identity Management
Vendor Offering | Pricing Used | |
---|---|---|
Azure | Azure Active Directory | $6 per user per month |
AWS | Amazon IAM | free |
GCP | Google Cloud IAM | free |
Snowflake | Amazon IAM | free |
Source: GigaOm 2022 |
For this comparison, only Azure charges for the use of Active Directory. While on the surface, the free IAM services of the other platforms are attractive, many organizations have sophisticated security and IAM that need to integrate with on-premises security and single sign-on (SSO). For this reason, Azure Active Directory is a popular choice, especially among organizations that already use Windows Active Directory for on-premises security.
Data Catalog
The data catalog category represents the use of data governance and a centralized data catalog for all data assets. Table 10 shows the options chosen for each vendor stack.
Table 10. Vendor Stack Options Chosen for Data Catalog
Vendor Offering | Pricing Used | |
---|---|---|
Azure | Azure Purview | $0.411 per Capacity unit-hour |
AWS | Amazon Glue Data Catalog | $1 per 100K objects |
GCP | Google Data Catalog | $10 per 100K API calls |
Snowflake | Amazon Glue Data Catalog | $1 per 100K objects |
Source: GigaOm 2022 |
- https://azure.microsoft.com/en-us/pricing/details/azure-purview/
- https://aws.amazon.com/glue/pricing/
- https://cloud.google.com/data-catalog/pricing
- https://aws.amazon.com/glue/pricing/
For this comparison, we used our best educated estimation. AWS Glue and Glue Data Catalog now support Snowflake, so we dropped Talend and Alation for Snowflake (both of which tend to be negotiable and, therefore, hard to project) and added Glue to simplify pricing.
Annual Subtotals
Taking all the above pricing scenarios into consideration, the following figure and tables are the one-year (annual) cost to purchase the analytics stack from each cloud vendor. Table 11 addresses pricing for medium-sized enterprises.
Table 11. Medium-Sized Enterprise One-Year (Annual) Cost to Purchase an Analytics Stack
Azure | AWS | GCP | Snowflake | |
---|---|---|---|---|
01-Dedicated Compute | $788,400 | $754,061 | $2,040,000 | $2,242,560 |
02-Storage | $2,070 | $2,160 | $8,280 | $3,600 |
03-Data Integration | $152,160 | $246,682 | $185,771 | $246,682 |
04-Streaming | $104,875 | $65,270 | $74,012 | $74,810 |
05-Spark Analytics | $40,086 | $43,730 | $42,332 | $43,730 |
06-Data Lake | $30,000 | $105,406 | $30,000 | $140,160 |
07-Business Intelligence | $11,988 | $81,600 | $64,896 | $328,800 |
08-Machine Learning | $70,641 | $84,797 | $147,880 | $84,797 |
09-Identity Management | $72,000 | $0 | $0 | $0 |
10-Data Catalog | $3,604 | $1,200 | $12,000 | $1,200 |
Medium-Ent Annual Subtotal | $1,275,823 | $1,384,906 | $2,605,171 | $3,166,339 |
Source: GigaOm 2022 |
The one-year cost to purchase for large enterprises is then depicted in Table 12.
Table 12: Large Enterprise One-Year (Annual) Cost to Purchase an Analytics Stack
Azure | AWS | GCP | Snowflake | |
---|---|---|---|---|
01-Dedicated Compute | $1,576,800 | $1,507,771 | $4,080,000 | $4,485,120 |
02-Storage | $8,280 | $8,640 | $33,120 | $14,400 |
03-Data Integration | $608,640 | $986,726 | $743,083 | $986,726 |
04-Streaming | $419,499 | $261,082 | $296,047 | $259,822 |
05-Spark Analytics | $160,343 | $174,920 | $169,329 | $174,920 |
06-Data Lake | $120,000 | $421,624 | $120,000 | $560,640 |
07-Business Intelligence | $59,940 | $408,000 | $259,584 | $1,644,000 |
08-Machine Learning | $282,563 | $339,187 | $591,520 | $339,187 |
09-Identity Management | $144,000 | $0 | $0 | $0 |
10-Data Catalog | $36,004 | $12,000 | $36,000 | $12,000 |
Large-Ent Annual Subtotal | $3,416,068 | $4,119,950 | $6,328,684 | $8,476,815 |
Source: GigaOm 2022 |
Other Costs
In addition to these components, other cost factors were considered. We realize that it takes more than simply buying cloud platforms and tools—namely, people—to achieve a production-ready enterprise analytics stack. These resources are required to build up and maintain the platform to deliver business insight and meet the needs of a data-driven organization.
The additional cost factors include:
- Data migration
- ETL integration
- Analytics migration
- Ongoing support and continuous improvement
Labor Costs
To calculate TCO, one cannot overlook the people factor in the cost. It is people who build, migrate, and use the analytics platform. To figure labor costs for our TCO calculations, we used a blended rate of both internal staff and external professional services, as shown below. The hourly rates are based on our experience working in the industry.
- Internal staff: $67 per hour
- External staff: $150 per hour
For internal staff, we used an average annual cash compensation of $100,000 and a 22% burden rate, bringing the actual cost to a blended $122,000. We also estimated the year to have 1,824 working hours, which gave us an effective hourly rate of $66.89. For external professional services, we chose a nominal blended rate of $150 per hour.
To support both the migration to the cloud data analytics platform and the ongoing maintenance and continuous improvement, we estimated a mixture of internal and external resources, as seen in Table 13.
Table 13. Mixture of Internal and External Resources
Migration Phase | Improvement Phase | |
---|---|---|
Internal Staff | 50% | 75% |
External Services | 50% | 25% |
Blended Rate | $108 per hour | $88 per hour |
Source: GigaOm 2022 |
Migration
To calculate the base cost for migrating from an existing platform to the cloud analytics solution, we used the values shown in Table 14.
Table 14. Base Cost for Migration
Migration Component (and primary cost driver) | Items (Medium) | Items (Large) | Migration Effort* | Base (Medium) | Base (Large) |
---|---|---|---|---|---|
Data Migration (# of DW objects) | 819 | 3,276 | 0.65 | $53,827 | $215,310 |
ETL Migration (# of ETL feeds/workflows) | 82 | 328 | 7.64 | $67,840 | $271,360 |
Analytics Migration (# of reports) | 410 | 1,640 | 1.21 | $53,828 | $215,310 |
Total Migration Base Cost | $175,495 | $701,980 | |||
Source: GigaOm 2022 |
* Average hours to migrate each item
Data migration includes significant data objects (tables, views, stored procedures, triggers, functions, scripts, etc.) that must be exported from the old data warehouse, transformed, and loaded into the new data warehouse. ETL migration includes the feeds (connections to source app databases/systems) that need to be migrated to support ongoing data imports into the new data warehouse. Analytics migration includes reports and dashboards that need to be modified to work on (connect to) the new data warehouse.
The migration effort, or the average hours to migrate each item, varies due to the complexity of the environment’s legacy artifacts. Migrating from more modern on-premises platforms might be easier than, say, a legacy mainframe. Each organization should do a deep source system analysis to understand the challenges and complexity factors for their given situation. The situation presented here is within what we, in our experience, consider relatively “typical,” although your mileage will vary.
In addition, we consider the complexity and difficulty of migrating to the cloud and each of the four platforms we are considering in this paper. Based on experience and the rigorous independent assessment of each platform, we have developed multipliers (shown in Table 15) that measure the degree of difficulty or complexity for each of these three migration categories with the vendor platforms we priced here.
Table 15. Cost Multipliers for Migrating to Data Warehouse Ecosystems
Azure | AWS | GCP | Snowflake | |
---|---|---|---|---|
Data Migration | 2.0 | 2.0 | 4.0 | 3.0 |
ETL Migration | 1.0 | 2.0 | 4.0 | 3.0 |
Analytics Migration | 0.2 | 1.0 | 1.0 | 1.0 |
Medium Total | $186,260 | $297,162 | $418,830 | $540,497 |
Large Total | $745,042 | $1,188,649 | $1,675,318 | $2,161,987 |
Source: GigaOm 2022 |
In terms of data migration, most enterprise data warehouses (e.g., Oracle, Netezza, Teradata) have these features: indexing, constraints, replicated tables, heap tables, and user-controlled partitioning, in addition to security features such as row + column security, masking, and column encryption. Redshift, BigQuery, and Snowflake have limited support for them. Without these features, migration can be more difficult.
For ETL migration, code conversion is a big pain point. This can be felt in a smaller way with Redshift because its older PostgreSQL 8 syntax creates limitations, or with BigQuery and its own flavor of SQL creating significant time for conversion.
For analytics migration, we primarily considered BI conversion. We assert that PowerBI already has excellent integration with Synapse, and that will save conversion time.
For the overall three-year TCO calculation, migration costs will only be applied once.
Ongoing Support and Continuous Improvement
The cost of maintaining and improving the total analytics platform must be considered as well. This includes support work, such as database administration, disaster recovery, and security. However, no analytics platform is (or should be) static. Maintenance and operations include ongoing work in security/privacy, governance and compliance, availability and recovery, performance management and scalability, skills and training, and licensing and usage tracking.
The environment needs constant improvement and enhancement to grow with business needs. The work of project management and CI/CD integration are considered here as well.
The factors involved that can impact these costs include:
- Complexity of maintenance
- Complexity of setup
- Complexity of operation and administration
We rated each of the platforms across these metrics and came up with a support and improvement cost multiplier applied to the base cost (minus migration costs calculated in the previous section). This is shown in Table 16.
Table 16. Ongoing Support and Continuous Improvement Cost Multipliers
- | Azure | AWS | GCP | Snowflake |
---|---|---|---|---|
Maintenance | Easy | Hard | Easy | Easy |
Setup | Medium | Medium | Hard | Easy |
Operation and Administration | Medium | Medium | Hard | Medium |
Support/Improvement Multiplier | 25% | 35% | 35% | 20% |
Medium Ent. Annual Total | $404,153 | $616,113 | $825,744 | $1,030,340 |
Large Ent. Annual Total | $1,193,901 | $1,968,795 | $2,293,398 | $2,689,160 |
Source: GigaOm 2022 |
Three-Year TCO
Finally, we arrive at our final three-year total cost of ownership figures for the study. Table 17 provides a breakdown and grand total for each of the four cloud vendors’ full analytics stacks.
Table 17. Three-Year Breakdown with Grand Totals
- | Azure | AWS | GCP | Snowflake |
---|---|---|---|---|
Year 1 | ||||
1-Medium Total | $186,260 | $297,162 | $418,830 | $540,497 |
2-Large Total | $745,042 | $1,188,649 | $1,675,318 | $2,161,987 |
Year 2 | ||||
1-Medium Total | $1,594,779 | $1,869,623 | $3,799,606 | $3,516,981 |
2-Large Total | $4,270,085 | $5,561,933 | $10,172,178 | $8,543,723 |
Year 3 | ||||
1-Medium Total | $1,594,779 | $1,869,623 | $3,799,606 | $3,516,981 |
2-Large Total | $4,270,085 | $5,561,933 | $10,172,178 | $8,543,723 |
TOTAL | ||||
Medium Enterprise | $3,375,818 | $4,036,407 | $8,018,042 | $7,574,458 |
Large Enterprise | $9,285,212 | $12,312,515 | $22,019,674 | $19,249,434 |
Source: GigaOm 2022 |
Finally, Figure 4 provides a visual depiction of the three-year total costs across the four platforms.
Figure 4. Three-Year TCO Breakdown
5. Conclusion
Based on our approach and using the assumptions listed in each section mimicking a medium enterprise application, Azure was the lowest cost platform. It cost $3,375,818 to purchase the analytics stack for a one-year (annual) period. AWS was 19% higher, while Google and Snowflake were more than double the cost.
Azure was also the lowest-cost platform for large enterprises at $9M one-year (annual) to purchase. AWS was 32% higher, while Google and the Snowflake stack were more than two times higher.
In a three-year total cost of ownership, which includes people cost, for medium enterprises, Azure is the platform with the lowest cost of ownership at $3 million. AWS is at $4 million, Google $8 million, and Snowflake $8 million. For large enterprises, Azure three-year TCO is $8.5 million, AWS $12.3 million, Google $19.2 million, and Snowflake $22 million.
With Azure Synapse, there is a single point of management for significant services shared with the Azure platform. Performance and scale for all analytic capabilities are managed together. Skills and training are greatly simplified.
6. Appendix: GigaOm Analytic Field Test
A performance test was used to establish equivalency for our pricing of the four platforms. The GigaOm Analytic Field Test, used across all vendors, was designed to emulate the TPC Benchmark™ DS (TPC-DS) and adhered to its specifications. This was not an official TPC benchmark. The queries were executed using the following setup, environment, standards, and configurations.
The GigaOm Analytic Field Test is a workload derived from the well-recognized industry-standard TPC Benchmark™ DS (TPC-DS). From tpc.org:
The TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general-purpose decision support system. The purpose of TPC benchmarks is to provide relevant, objective performance data to industry users. TPC-DS Version 2 enables emerging technologies, such as Big Data systems, to execute the benchmark.
The data model consists of 24 tables—seven fact tables and 17 dimensions. To give an idea of the data volumes used in our field test, Table 18 shows row counts of fact tables in the database when loaded with 30 TB of GigaOm Analytic Field Test data.
Table 18. GigaOm Analytic Field Test Data Volumes
Scale Factor 30,000 30 TB Row Count | |
---|---|
Catalog Returns | 4,319,925,093 |
Catalog Sales | 43,200,404,822 |
Inventory | 1,627,857,000 |
Store Returns | 8,639,952,111 |
Store Sales | 86,399,341,874 |
Web Returns | 2,160,007,345 |
Web Sales | 21,600,036,511 |
Source: GigaOm 2022 |
The GigaOm Analytic Field Test is a fair representation of enterprise query needs. The GigaOm Analytic Field Test testing suite has 99 queries—four of which have two parts (14, 23, 24, and 39). This brings it to a total of 103 queries. The queries used for the tests were compliant with the standards set out by the TPC Benchmark™ DS (TPC-DS) specification and included only minor query modifications as set out by section 4.2.3 of the TPC-DS specification document. For example, minor query modifications included vendor-specific syntax for date expressions. Also in the specification, some queries require row limits and, thus, vendor specific syntax was used (e.g., TOP, FIRST, LIMIT, and so forth) as allowed by section 4.2.4 of the TPC-DS specification.
Cluster Environments
Our benchmark included four different cluster environments for medium and for large configurations (Table 19).
Table 19. Cluster Environments
Azure Synapse Analytics Workspace | Amazon Redshift | Google BigQuery | Snowflake | |
---|---|---|---|---|
Medium Configuration | DW7500c | ra3.4xlarge (32 nodes) | 5,000 Flex Slots | 3XLarge |
Large Configuration | DW15000c | ra3.16xlarge (16 nodes) | 10,000 Flex Slots | 4XLarge |
Source: GigaOm 2022 |
7. About William McKnight
William McKnight is a former Fortune 50 technology executive and database engineer. An Ernst & Young Entrepreneur of the Year finalist and frequent best practices judge, he helps enterprise clients with action plans, architectures, strategies, and technology tools to manage information.
Currently, William is an analyst for GigaOm Research who takes corporate information and turns it into a bottom-line-enhancing asset. He has worked with Dong Energy, France Telecom, Pfizer, Samba Bank, ScotiaBank, Teva Pharmaceuticals, and Verizon, among many others. William focuses on delivering business value and solving business problems utilizing proven approaches in information management.
8. About Jake Dolezal
Jake Dolezal is a contributing analyst at GigaOm. He has two decades of experience in the information management field, with expertise in analytics, data warehousing, master data management, data governance, business intelligence, statistics, data modeling and integration, and visualization. Jake has solved technical problems across a broad range of industries, including healthcare, education, government, manufacturing, engineering, hospitality, and restaurants. He has a doctorate in information management from Syracuse University.
9. About GigaOm
GigaOm provides technical, operational, and business advice for IT’s strategic digital enterprise and business initiatives. Enterprise business leaders, CIOs, and technology organizations partner with GigaOm for practical, actionable, strategic, and visionary advice for modernizing and transforming their business. GigaOm’s advice empowers enterprises to successfully compete in an increasingly complicated business atmosphere that requires a solid understanding of constantly changing customer demands.
GigaOm works directly with enterprises both inside and outside of the IT organization to apply proven research and methodologies designed to avoid pitfalls and roadblocks while balancing risk and innovation. Research methodologies include but are not limited to adoption and benchmarking surveys, use cases, interviews, ROI/TCO, market landscapes, strategic trends, and technical benchmarks. Our analysts possess 20+ years of experience advising a spectrum of clients from early adopters to mainstream enterprises.
GigaOm’s perspective is that of the unbiased enterprise practitioner. Through this perspective, GigaOm connects with engaged and loyal subscribers on a deep and meaningful level.
10. Copyright
© Knowingly, Inc. 2022 "Cloud Analytics Platform Total Cost of Ownership" is a trademark of Knowingly, Inc. For permission to reproduce this report, please contact sales@gigaom.com.