Category Archives: Business Intelligence

Information catalogues

There is an old dream of having all information available in one place, just as there is an equally old desire to squirrel away the information ‘we’ need in a place controlled by just ‘us’. Neither of these tendencies are likely to ever completely obscure the other, but there are technologies that attempt to bridge this gap between centralised order and localised specialisation.

I explored one such technology during a workshop by Mike Ferguson of Intelligent Business Strategies in last week’s Enterprise Data Conference in London on information catalogues. These catalogues are a response to the emergence of data lakes, which are repositories of data in whatever form they are found. That is; organisations have started to put copies of the databases of transactional systems, along with social media, click streams, office documents, cloud platform data and API outputs and much more besides into a single place to enable people to analyse, report, visualise and train machine learning models. In these lakes, raw data sits next to neatly processed data reports. A data lake can take the form of a single Hadoop installation or a cloud storage instance, or it can be a logical layer above a wide variety of data storage systems.

Needless to say, keeping track of what is in such a data lake is critical, and that is where the information catalogue comes in. Such a piece of software can:

  • Discover data in the lake much like a web search engine crawls the web
  • Keep track of changes in the data
  • Profile data in terms of data quality and data lineage or provenance
  • Automatically tag data for characteristics such as personally identifiable information, data protection category and more
  • Enable users to collaborate on defining their own terms and tag data items with it (or have the platform do it for them)
  • Attach governance policies to data items, including via tags
  • Attach governance policies to data artefacts such as reports or ETL jobs
  • Provide faceted search over all data
  • Provide REST APIs to allow other tools access to its contents
  • Provide a data market place for ready-made business intelligence applications

The scope of particular products varies, of course. Not all products will be able to support all capabilities. The reason is that some are tied to a specific storage technology such as Apache Atlas to Apache Hadoop’s file system, or AWS Glue to Amazon’s AWS S3 storage service. Others are more tied to specific business intelligence tools such as Alteryx Connect or Qlik Podium Data Catalog. But others again are associated with data management and ETL platforms such as Informatica and Talend and therefore tend to have a wider purview.

Clearly, the data governance and data discovery capabilities of such tools are highly attractive for any organisation with heterogeneous information stores, but it does raise the question at what point it is worth implementing a dedicated information catalogue over making do with an assemblage of features of existing systems. Perhaps the best way to determine that inflection point is to make a ranked list of tools you can use for data governance, metadata management and data discovery – from a pile of spreadsheets to a full blown information catalogue – and agree at what point the pain of manually keeping track of all information outweighs the pain of buying and implementing a dedicated information catalogue

Data standardisation at the Environment Agency

Funny how very different organisations can have very similar challenges. One of the things I wanted to find out from going to the IRM UK Enterprise Data Conference Europe is how other organisations went about agreeing data standards for their own organisation.

I got there in one: the very first talk was about exactly that topic. Becky Russell from the Environment Agency, with help from Nigel Turner of Global Data Strategy ltd, presented on their approach to the collaborative development of data standards.

The problem they addressed was the following: they had lots of separate teams working on different aspects of the same processes with many different systems. Because they worked on different aspects, they had different definitions for the same things, which made reporting on them difficult, costly and error prone. For example, one ‘thing’ or core data entity for the EA is ‘catchment area’. The team unearthed 16 definitions for those, some subtly, some very different.  Some definitions had the same labels, others not so much. Sounds familiar?

The core of the solution the EA team developed is to design a process for agreeing common definitions of major data entities such ‘catchment area’. They also develop a logical data model for those entities as well, where such a thing is needed. A data model specifies what attributes or dimensions an entity has, without quite specifying how a particular system needs to store it.

That solves the reporting problem, but the process has challenges of its own. To deal with them, they stuck to a couple of principles:

  • Be driven by a business problem, don’t just standardise something for the sake of it
  • Be business led. They know their domain best.
  • Have space for local as well as global standards. There can be good reasons for local exceptions to widely agreed definitions or data models.
  • Re-use external standards where you can.
  • Have supporting technologies in place.
  • Align the standards development to data governance structures.
  • Introduce the standard in new technology alone, because changing legacy systems is costly.

The latter was tricky, because it means that the benefits of standardisation can take quite a while to materialise. Data warehouses can help, in that respect.

This approach did work well, and allows the team to iteratively and pragmatically create more order in the data landscape of the organisation, without having to spend huge resources upfront. As the effort progresses, the standardisation process can become more formal where it needs to be.