George Gilbert, Author at Gigaom

Sector Roadmap: Hadoop/Data Warehouse Interoperability

George Gilbert — Thu, 29 Jan 2015 15:59:05 +0000

SQL-on-Hadoop capabilities played a key role in the big data market in 2013. In 2014, their importance only grew, as did their ubiquitousness, making possible new use cases for big data. Now, with virtually every Hadoop distribution vendor and incumbent database vendor offering SQL-on-Hadoop solutions, the key factor in the market is no longer mere SQL query capability, it’s the quality and economics of the resulting integration between Hadoop and data warehouse technology.

This Sector Roadmap^TM examines that integration, reviewing SQL-on-Hadoop solutions on offer from the three major Hadoop vendors: Cloudera, Hortonworks, and MapR; incumbent data warehouse vendor Teradata; relational-database juggernaut Oracle; and Hadoop/data warehouse hybrid vendor Pivotal. With this analysis, key usage scenarios made possible by these solutions are identified, as are the architectural distinctions between them.

Vendor solutions are evaluated over six Disruption Vectors: schema flexibility, data engine interoperability, pricing model, enterprise manageability, workload role optimization, and query engine maturity. These vectors collectively measure not just how well a SQL-on-Hadoop solution can facilitate Hadoop-data warehouse integration, but how successfully it does so with respect to the emerging usage patterns discussed in this report.

Key findings in our analysis include:

In addition to the widely discussed data lake, the adjunct data warehouse is a key concept, which has a greater near-term relevance to pragmatist customers.
The adjunct data warehouse provides for production ETL, reporting, and BI on the data sources first explored in the data lake. It also offloads production ETL from the core data warehouse in order to avoid costly capacity additions on proprietary platforms at a 10- to 30-times cost premium.
MapR fared best in our comparison due to the integration powers of Apache Drill’s technology. It would have fared better still were Drill not in such a relatively early phase of development.
Hortonworks, given its enhancements to Apache Hive, and Cloudera, with its dominant Impala SQL-on-Hadoop engine, follow closely behind MapR.
Despite their conventional data warehouse pedigrees, Teradata, Pivotal, and Oracle are very much in the game as they make their comprehensive SQL languages available as a query interface over data in Hadoop.

Key:

Number indicates company’s relative strength across all vectors
Size of ball indicates company’s relative strength along individual vector

Source: Gigaom Research

Image courtesy of 3dmentat/iStock.

The post Sector Roadmap: Hadoop/Data Warehouse Interoperability appeared first on Gigaom.

Bringing in-memory transaction processing to the masses: an analysis of Microsoft SQL Server 2014 in-memory OLTP

George Gilbert — Tue, 27 May 2014 07:01:08 +0000

The emerging class of enterprise applications that combine systems of record and systems of engagement has geometrically growing performance requirements. They have to support capturing more data per business transaction from ever-larger online user populations. These applications have many capabilities similar to consumer online services such as Facebook or LinkedIn, but they need to leverage the decades of enterprise investment in SQL-based technologies. At the same time as these new customer requirements have emerged, SQL database management system (DBMS) technology is going through the biggest change in decades. For the first time, there is enough inexpensive memory capacity on mainstream servers for SQL DBMSs to be optimized around the speed of in-memory data rather than the perf0rmance constraints of disk-based data. This new emphasis enables a new DBMS architecture.

This research report addresses two audiences.

The first is the IT business decision-maker who has a moderate familiarity with SQL DBMSs. For them, this explains how in-memory technology can leverage SQL database investments to deliver dramatic performance gains.
The second is the IT architect who understands the performance breakthroughs possible with in-memory technology. For them, this report explains the trade-offs that determine the different sweet spots of the various vendor approaches.

There are three key takeaways.

First, there is an emerging need for a data platform that supports a variety of workloads, such as online transaction processing (OLTP) and analytics at different performance and capacity points, so that traditional enterprises don’t need an internal software development department to build, test, and operate a multi-vendor solution.
Second, within their data platform, Microsoft’s SQL Server 2014 In-Memory OLTP not only leverages an in-memory technology but also takes advantage of the ability to scale out to 64 virtual CPU processor cores to deliver a 10- to 30-times gain in throughput without requiring the challenge of partitioning data across a cluster of servers.
Third, Oracle and IBM can scale to very high OLTP performance and capacity points, but they require a second, complementary DBMS to deliver in-memory technology. SAP’s HANA is attempting to deliver a single DBMS that supports a full range of analytic and OLTP workloads with the industry closely watching how well they optimize performance. NewSQL vendors VoltDB and MemSQL are ideal for greenfield online applications that demand elastic scalability and automatic partitioning of data.

The post Bringing in-memory transaction processing to the masses: an analysis of Microsoft SQL Server 2014 in-memory OLTP appeared first on Gigaom.

What to know when choosing database as a service

George Gilbert — Thu, 26 Sep 2013 06:55:29 +0000

In the past couple of years, we’ve seen more innovation in the SQL database management system (DBMS) category than in the 30-plus years since commercial products became available. The past decade’s web 2.0 sites have mostly driven this innovation, which looks to bridge some of the gap between NoSQL DBMS and MySQL.

MySQL was the most common early foundation for these web apps, but it required partitioning storage across many servers when there was more than a modest amount of data. Here was the core problem these early websites faced: They could use MySQL distributed across many servers for scalability, or they could use it on one server and let the DBMS take care of data integrity and full SQL queries, the two foundations of 30 years of data-management best practices. But once customers traded away those two foundations in return for scalability, they began embracing alternatives. NoSQL DBMS featured not only scalability but also more flexibility in handling new data types and new ways of manipulating that data. These products included not only Hadoop but also Cassandra, Couchbase, and MongoDB, among others. More recently the new SQL DBMS vendors such as Clustrix, MemSQL, NuoDB, and VoltDB have combined the elastic scalability of NoSQL products with increasingly comprehensive support for data integrity and SQL queries in a distributed environment.

The Software-as-a-Service (SaaS) applications of the future, whether next-generation consumer web and mobile services or traditional enterprises connecting with their customers, will likely build on the largely complementary foundations of the emerging distributed SQL and NoSQL DBMS.

Unlike SaaS versions of traditional enterprise applications such as CRM, financials, and HR, these new applications are about more than administrative efficiency. Rather, they power new applications. Some of the most common are:

Online advertising
Game-session management
Network-intrusion detection
Fraud detection
Risk management
Ecommerce

The typical applications have many common features. We will examine this category of DBMS based on this feature set as a framework.

Database as a Service
Big and fast data
Elastic capacity
Real-time operational analytics
Priced to be an embedded service

This report will help IT business decision makers navigate the emerging requirements of distributed SQL DBMS supporting these new SaaS applications. Although application developers have assumed more influence with the growing importance of line-of-business applications, IT business decision makers still need to understand the requirements. They will ultimately have to support these DBMS as part of the services their companies deliver to their end customers. In addition, over time central IT will have to help manage the proliferation of DBMS by providing guidance to groups with common requirements. This report will take readers through the common requirements for a DBMS to support these new applications. We will explain what each requirement means, why it’s important, and more precisely what to look for in a product.

The emerging distributed SQL DBMS are focused on the lower right portion of the figure below. On the x-axis, the data capture speed is high. At the same time, the decision latency is low by virtue of the analytics speed. Analytics are real-time. The DBMS on which we are focusing in this report typically must interact with DBMS elsewhere on this spectrum of activities. Analyzing historical data for exploratory or production reporting might take place in a data warehouse. And predictive modeling on larger data sets might take place offline with Hadoop as the foundation.

Source: IBM Global Technology Outlook

The post What to know when choosing database as a service appeared first on Gigaom.

How to manage big data without breaking the bank

George Gilbert — Tue, 30 Apr 2013 15:55:18 +0000

In the tsunami of experimentation, investment, and deployment of systems that analyze big data, vendors have seemingly been trying approaches at two extremes—either embracing the Hadoop ecosystem or building increasingly sophisticated query capabilities into database management system (DBMS) engines.

At one end of the spectrum, the scale-out Hadoop distributed file system (HDFS) has become a way to collect volumes and types of data on commodity servers and storage that would otherwise overwhelm traditional enterprise data warehouses (EDWs). The Hadoop ecosystem has a variety of ways to query data in HDFS, with SQL-based approaches emerging in variety and maturity.

At the other end of the spectrum are both traditional and NewSQL DBMS vendors, with IBM, Microsoft, and Oracle among the former and Greenplum, Vertica, Teradata Aster, and many others emerging among the latter. These companies share unprecedented innovation and growth in analytic query sophistication. Accessing tables stored on disks organized in rows via SQL is no longer enough. Vendors have been adding the equivalent of new DBMS engine plug-ins, including in-memory cache for performance, column storage for data compression and faster queries, advanced statistical analysis, and even machine learning technology.

While the NewSQL vendors have introduced much lower price points than the traditional vendors as well as greater flexibility in using commodity storage, they haven’t made quite as much progress on shrinking the growth in storage hardware required relative to the growth in data volumes.

For some use cases, there appears to be room for a third approach that lies between the extremes and borrows from the best of each. RainStor in particular and the databases focusing on column storage more generally have carved out a very sizable set of data storage and analytics scenarios that have been mostly ignored. Much of the data that needs to be analyzed doesn’t need to be updated — it can instead be stored as an archive in a deeply compressed format while still online for query and analysis. Databases with column store technology, such as Vertica and Greenplum, have taken important steps in this direction, and the incumbent vendors are also making progress in offering this as an option.

Organizing data storage in columns makes it easier to compress. Column stores can accelerate queries by scanning just the relevant and now smaller columns in parallel on multiple CPU cores. But the storage and database engine overhead of mediating potentially simultaneous updates to and reads from the data still remains. In other words, the column stores are a better data warehouse. They are not optimized to serve as archives, however. An online archive can compress its data by a factor of 30 to 40 because it will never have to be decompressed for updates. New data only gets appended. Without the need to support updates, it’s much easier to ingest new data at very high speed, and without the need to mediate updates, it’s much easier to distribute the data on clusters of low-cost storage.

This paper is written for two audiences.

One is the business buyer who is evaluating databases and trying to reconcile the difficulty of growth in data volumes running at 50 percent to 100 percent per annum with an IT budget growing in single digits. Of particular value to this audience are the generic use cases and the customer case studies. Also relevant is the price comparison with Oracle Exadata, which shows not just the capital cost of a traditional data warehouse solution but also the hidden running costs.
The other audience is the IT infrastructure technologist who is tasked with evaluating the proliferation of database technologies. For this audience, the more technical sections of the paper will be valuable. These sections focus on the different technology approaches to creating online analytic databases. The paper will compare mainstream data warehouse technologies and column stores in particular with a database that focuses more narrowly as an online analytic archive. In order to use a concrete example of an existing analytic archive, the paper will explain how RainStor’s database works.

The post How to manage big data without breaking the bank appeared first on Gigaom.

Real-time query for Hadoop democratizes access to big data analytics

George Gilbert — Wed, 07 Nov 2012 07:55:09 +0000

The delivery of real-time query makes Hadoop accessible to more users — and by orders of magnitude. Its significance goes well beyond delivering a database management system (DBMS) kind of query engine that other products have had for decades. Rather, Hadoop as a platform now supports a whole new paradigm of analytics.

Real-time query is the catalyst for delivering a new level of self-service in analytics to a much broader audience. Interactive response and the accessibility of a structured query language (SQL) interface through open database connectivity/Java database connectivity (ODBC/JDBC) make the incremental discovery and enrichment of data possible for a greater and more varied audience of users than just data scientists. Hadoop can now reach an even wider array of users who are familiar with business intelligence tools such as Tableau and MicroStrategy.

That incremental discovery and enrichment process has two other major implications. First, it dramatically shortens the time between collecting data from source applications and extracting some signal from that data’s background noise. Second, it becomes a self-enforcing exercise in crowdsourcing the process of refining meaning from the data. Both issues had previously represented major bottlenecks in the exploitation of traditional data warehouses.

Hadoop’s traditional appeal

Historically Hadoop has been a favorite among organizations needing to store, process, and analyze massive volumes of multistructured data cost-effectively. Its primary uses have included tasks such as index building, pattern recognition across multisource data, analyzing machine data such as
sensors and communications networks, creating profiles that support recommendation engines, and sentiment analysis.

However, several obstacles have limited the scope of Hadoop’s appeal. The MapReduce programming framework only operated in batch mode, even when supporting SQL queries based on Hive. Because Hadoop was a repository that collected unrefined data from many sources — and with little structure or organization — data scientists were required to extract meaning from it.

The traditional appeal of RDBMS-based analytic applications

Relational database management systems (RDBMS) have traditionally been deployed as data warehouses for analytic applications when most of the questions were known up front. Their care and feeding required a sophisticated, multistep process and a lot of time. This process supported the need for strong information governance, verifiability, quality, traceability, and security.

Traditional data warehouses are ideal for a certain class of analytic applications. Their sweet spot includes both running the same reports and queries and tracking the same set of metrics over time. But if the questions changed, things would break and big parts of the end-to-end process would require redeveloping — often starting with the collection of new source data.

Moving toward a more unified platform for big data analytics

With the introduction of real-time query, Hadoop has taken a major step toward unifying the majority of big data analytic applications onto one platform. With that opportunity in mind, this research paper targets information technology professionals who have in-depth experience with traditional RDBMS and seek to understand where the Hadoop ecosystem and big data analytics fit.

In discussing this topic, we will address the following:

What’s driving the need for real-time analysis? (Real time can be broken down as either interactive or streaming.)
What’s driving the need for a more unified platform for big data analytics?
What will customers be able to do when they fully implement real-time query?
What are four key benefits of real-time query across customer use cases?
What does Impala look like under the covers?
How can we move toward a converged big data analytics platform?

The post Real-time query for Hadoop democratizes access to big data analytics appeared first on Gigaom.

A guide to big data workload-management challenges

George Gilbert — Tue, 19 Jun 2012 06:55:17 +0000

The explosive growth in the volume, velocity, variety and complexity of data has challenged both traditional enterprise application vendors as well as companies built around online applications. In response, new applications have emerged, ones that are real-time, massively scalable and have closed-loop analytics. Needless to say, these applications require very different technology underpinnings than what came before.

Traditional applications had a common platform that captured business transactions. The software pipeline extracted, cleansed and loaded the information into a data warehouse. The data warehouse reorganized the data primarily to answer questions that were known in advance. Tying the answers back into better decisions in the form of transactions was mostly an offline, human activity.

The emerging class of applications requires new functionality that closes the loop between incoming transactions and the analytics that drive action on those transactions. Closing the loop between decisions and actions can take two forms: Analytics can run directly on the transactional database in real time or in closely integrated but offline tasks running on Hadoop. Hadoop typically supports data scientists who take in data that’s far more raw and unrefined than the type found in a traditional enterprise data warehouse. The raw data makes it easier to find new patterns that define new analytic rules to insert back into the online database.

This paper is targeted at technology-aware business executives, IT generalists and those who recognize that many emerging applications need new data-management foundations. The paper surveys this class of applications and its technology underpinnings relative to more-traditional offerings from several high-level perspectives: the characteristics of the data, the distinctiveness of the new class of applications, and the emerging database technologies — labeled NoSQL, for lack of a better term — that support them. Although the NoSQL label has been applied to many databases, this paper will focus on the class of systems with a rich data model typified by Cassandra. Other databases in this class include HBase, DynamoDB and Oracle NoSQL.

The post A guide to big data workload-management challenges appeared first on Gigaom.

Report: Evolution of The Private Cloud

George Gilbert — Tue, 11 May 2010 07:00:15 +0000

Every 15 years or so, the IT world undergoes a tectonic shift. Technological forces collide and grind against one another, creating an upheaval that leaves the landscape irrevocably changed. The latest such shift is currently underway: the transition to computing as a service, also known as cloud computing. This change promises to make computing more like a utility such as electricity or telephony — users plug in and get the resources they need without much manual effort on the part of service providers.

Cloud computing has brought these benefits to Internet titans like Google, Salesforce.com and Amazon, and to their customers. Traditional enterprise IT has long aspired to the same advantages, but with a crucial distinction. Businesses want the option of greater control over governance, security and management that comes with using their own infrastructure.

For the better part of the last decade, cloud computing within the enterprise appeared elusive, short of totally replacing the hardware and software infrastructure to resemble large public web sites. Then came server virtualization, pioneered by VMware in the early part of the decade. At first, virtualization’s ability to tie disparate servers into a unified pool was used only for software development and testing. But gradually, it has become apparent that the technology was mature enough to deploy more widely. Suddenly, private clouds began to appear realistic.

This report is neither a comprehensive recipe for building a private cloud nor a complete review of all the products and vendors involved. Rather, it is a roadmap outlining the technology’s likely evolution, starting with the bottom layer in Figure 1. Readers familiar with cloud computing concepts at the infrastructure level will find the parts of the report that review lower layers of the IT stack somewhat remedial. They are there to set the context for the more forward-looking sections that describe how higher-level layers are likely to evolve.

The post Report: Evolution of The Private Cloud appeared first on Gigaom.

What VMware’s SpringSource Acquisition Means for Microsoft

George Gilbert — Tue, 18 Aug 2009 07:00:49 +0000

On Aug. 10, 2009, VMware announced a definitive agreement to acquire privately held open source Java application framework and platform developer SpringSource for $420 million ($331 million in cash, $31 million in equity for vested options, $58 million for unvested stocks/options). Customers will ultimately care about VMware’s acquisition of SpringSource because together the two will be able to offer a tightly integrated enterprise and cloud application platform similar to Microsoft’s server products, including the .NET application frameworks, Windows Server application runtime platform, and Systems Center management product offerings. The tight integration that VMware, Microsoft, and ultimately IBM and Oracle, aspire to offer — with slightly different approaches — is critical for bringing down dramatically the TCO of enterprise and cloud applications built on these platforms. This note examines the acquisition and its impact on a brewing battle between Microsoft and VMware.

The post What VMware’s SpringSource Acquisition Means for Microsoft appeared first on Gigaom.

Will Storage Go the Way of The Server?

George Gilbert — Tue, 12 May 2009 16:00:35 +0000

The storage industry is at on the cusp of the biggest structural change since networked storage (including SAN, NAS, and more recently iSCSI) began to substitute for direct-attached storage a decade ago. Despite being one of the fastest growth sectors in technology in terms of capacity, the economics for many participants are deteriorating. Several major technology and business model shifts will re-define the profit pools in the industry, leading to slimmer margins for all but the most innovative, software-driven players.

The long-term future of storage is about smart software that manages a large pool of cheap interchangeable hardware. However, in the near term, mainstream enterprise buyers continue to move cautiously while upgrading their existing installed base mostly with more of the same from vendors such as EMC and NetApp. But the current recession is making them more price-sensitive and creating pressure to try technology from newer vendors such as 3PAR and Data Domain for growing pockets of use cases. Cloud/online service providers are the most price-sensitive and open to new approaches since their storage capital and operating expenditures have a direct impact on their ability to offer competitive pricing.

Customers are transitioning from storage typically bought for a specific application to a more horizontal, virtual pool that better matches the shared resource model of their virtual servers. Much of the growth is occurring in two customer archetypes that are very different from the legacy enterprise data center characterized by scale-up architectures.

Scale up describes a model of using bigger, more expensive machines, whether for storage or servers, to get more performance. The scale-out model is when many smaller, inexpensive machines working together are used for better performance. Just about all enterprise software is written to scale up while cloud-based software is primarily written to scale out.

Many new high-growth workloads in the enterprise are best handled as NAS-based file-oriented data, as opposed to highly structured SAN-based block-oriented data. They include web serving, film and animation processing for media companies, seismic modeling for oil exploration, financial simulation in banking, etc. These workloads generate so much data that customers have been willing to try newer vendors with less expensive scale-out architectures such as Isilon.
The very largest cloud/online service providers, such as Google, Yahoo, Amazon Web Services and Microsoft tend to build their own scale-out storage software to run on commodity storage hardware. This do-it-yourself model is an extreme example of what analysts are referring to when they say storage will become as commoditized as servers.

Storage technology is morphing in the direction of server technology, more slowly in the enterprise and faster in the cloud.

Server virtualization is putting a layer in front of storage that over the next several years will start to homogenize the differences between storage products for applications and administrators.
As modular or commodity storage manages more workloads, the storage software can sit either on the x86 controller or the x86 server. That will make it easier for customers to benchmark and put pressure on hardware prices, even if the software comes from the same storage vendor providing the controller and disk drives.
Customers are rolling out storage efficiency functionality that improves utilization similar to server virtualization. However, customers are using technologies such as snapshots, thin provisioning and de-duplication to accelerate data growth, particularly backup or nearline, while only modestly compressing their storage budgets.
Flash memory will drastically improve the price/performance for virtually all classes of storage. In particular, over the next several years using flash as cache to augment magnetic disk performance will have a bigger impact than flash-based solid-state disks.

While these last two trends are very significant for the storage industry, we are treating them as outside the scope of this report, which deals with the commoditization of storage in the midst of a transition to virtualization and cloud computing.

Changing customer workloads and emerging technologies are driving changes in vendor business models.

For all but the most high performance and resilient systems, storage hardware and software will increasingly be sold and priced as two distinct parts of one integrated product line, starting especially with cloud/online service providers. Even though these two components will likely come from the same vendor most of the time, this change will force storage vendors to sell software based on business value rather than systems based on capacity.
To the extent that customers shift more of their data to the cloud, aggregate industry demand for storage will move from a ‘just in case’ capacity, upfront capex model to a ‘just in time’ capacity, ongoing opex model. This is because online service providers are running at much higher asset utilization than the typical customer can add capacity in more granular increments and are able to extract very favorable pricing from their suppliers. During this transiti0n period, which we can think of as a form of industry-wide thin provisioning coupled with collective bargaining, storage vendors may see a temporary slowdown in revenue growth. More importantly, they may experience lower margins for a prolonged period.
Truly interchangeable storage software and commodity hardware will likely be limited to the largest cloud / online service providers, such as Google, Yahoo, Amazon and Microsoft. Enterprises lack the scarce talent required to combine third-party or open source storage software with commodity hardware in a way that ensures scalability and resilience. In other words, the ‘mix and match’ model of server hardware and software is not likely to become prevalent in mainstream storage anytime soon.
The storage vendor mix in traditional enterprises is unlikely to be radically reshuffled anytime soon, since the innovative storage software challengers have to contend with customers’ concerns about interoperability, supportability and resilience. A major OEM endorsement of a startup vendor such as Virsto or Sanbolic would change that dynamic. While Cisco is the most likely vendor to fill the role of disruptor, HP, Dell or IBM might be somewhat more conflicted about accelerating storage hardware commoditization.

The post Will Storage Go the Way of The Server? appeared first on Gigaom.