Table of Contents
- Summary
- Definition
- Motivation
- Considerations for Using Data Catalogs
- Key Players
- Key Takeaways
- About Andrew Brust
- About GigaOm
- Copyright
1. Summary
Data catalogs, a category of product in the broad field of data governance, are emerging in popularity. That popularity has been brought on by the twin enterprise mandates of complying with data regulations and herding the growing number of repositories in the corporate data estate. But data catalogs are a legacy product category too, originally stemming from simple data dictionaries – essentially table layouts with plain-English descriptions of tables and fields. Today’s data catalogs have grown in capabilities, importance, and integration with other tools.
In a nutshell, data catalog platforms help organizations inventory their data by documenting data set content, location, and structure; and aligning business and technical metadata. This organization yields control, and having control helps enterprises:
- Achieve compliance with data protection regulations, through documentation and inventory. End users know where to get data and will avoid duplicating it. Organizations can control access to entire data sets where necessary and can better enforce role-based access to data subsets within them. The EU’s General Data Protection Regulation (GDPR) is in effect now, with very strict fines for non-compliance. The GDPR’s companion ePrivacy (ePR) regulation is pending, and the California Consumer Protection Act (CCPA) has been passed and will likely be in effect by the time you read this. These regulations demand the structure and controls that data catalogs provide.
- Improve data lake ROI by making data within the lake more discoverable and increasing the lake’s usability in general. A well-organized, searchable data catalog makes it easy to find relevant data, analyze it, derive insights, and make decisions with greater speed and conviction. These are the very reasons most enterprises built their data lakes in the first place.
- Unify the data landscape by creating a consolidated volume of information covering data lake, data warehouse, and operational databases. Implemented correctly, data catalogs integrate these components through a shared abstraction, helping customers derive new value from older warehouse and operational database assets.
- Bring data and the business closer together, by mapping business entity definitions onto data sets and columns within them. A great data catalog provides a business glossary that helps the business users find the data they need within the context of their own concepts, taxonomies, and vocabulary.
The summary, then, is that catalogs protect enterprises from regulatory jeopardy and benefit them by delivering more value from existing assets. Today’s data catalogs enable collaboration between custodians of the data (“data stewards” in contemporary parlance) and business users by mapping out the organization’s data, which makes it more usable for analysis, and thereby benefits the organization.
The vendors discussed in this report all provide baseline functionality (discussed in the Definition section, below) and each has their own emphasis. Broadly speaking, the products break down into those that are more governance-focused, and those that have a penchant for enabling self-service analysis in the organization by data enhancing data discoverability and usability. Within those two broad categories are sub-emphases, detailed in the diagram below.
Figure 1: Data Catalogs: Categories and Priorities
Each of the above designations will become clearer through the course of this report.