There is an old dream of having all information available in one place, just as there is an equally old desire to squirrel away the information ‘we’ need in a place controlled by just ‘us’. Neither of these tendencies are likely to ever completely obscure the other, but there are technologies that attempt to bridge this gap between centralised order and localised specialisation.
I explored one such technology during a workshop by Mike Ferguson of Intelligent Business Strategies in last week’s Enterprise Data Conference in London on information catalogues. These catalogues are a response to the emergence of data lakes, which are repositories of data in whatever form they are found. That is; organisations have started to put copies of the databases of transactional systems, along with social media, click streams, office documents, cloud platform data and API outputs and much more besides into a single place to enable people to analyse, report, visualise and train machine learning models. In these lakes, raw data sits next to neatly processed data reports. A data lake can take the form of a single Hadoop installation or a cloud storage instance, or it can be a logical layer above a wide variety of data storage systems.
Needless to say, keeping track of what is in such a data lake is critical, and that is where the information catalogue comes in. Such a piece of software can:
- Discover data in the lake much like a web search engine crawls the web
- Keep track of changes in the data
- Profile data in terms of data quality and data lineage or provenance
- Automatically tag data for characteristics such as personally identifiable information, data protection category and more
- Enable users to collaborate on defining their own terms and tag data items with it (or have the platform do it for them)
- Attach governance policies to data items, including via tags
- Attach governance policies to data artefacts such as reports or ETL jobs
- Provide faceted search over all data
- Provide REST APIs to allow other tools access to its contents
- Provide a data market place for ready-made business intelligence applications
The scope of particular products varies, of course. Not all products will be able to support all capabilities. The reason is that some are tied to a specific storage technology such as Apache Atlas to Apache Hadoop’s file system, or AWS Glue to Amazon’s AWS S3 storage service. Others are more tied to specific business intelligence tools such as Alteryx Connect or Qlik Podium Data Catalog. But others again are associated with data management and ETL platforms such as Informatica and Talend and therefore tend to have a wider purview.
Clearly, the data governance and data discovery capabilities of such tools are highly attractive for any organisation with heterogeneous information stores, but it does raise the question at what point it is worth implementing a dedicated information catalogue over making do with an assemblage of features of existing systems. Perhaps the best way to determine that inflection point is to make a ranked list of tools you can use for data governance, metadata management and data discovery – from a pile of spreadsheets to a full blown information catalogue – and agree at what point the pain of manually keeping track of all information outweighs the pain of buying and implementing a dedicated information catalogue