Author Archives: Wilbert Kraan

Information catalogues

There is an old dream of having all information available in one place, just as there is an equally old desire to squirrel away the information ‘we’ need in a place controlled by just ‘us’. Neither of these tendencies are likely to ever completely obscure the other, but there are technologies that attempt to bridge this gap between centralised order and localised specialisation.

I explored one such technology during a workshop by Mike Ferguson of Intelligent Business Strategies in last week’s Enterprise Data Conference in London on information catalogues. These catalogues are a response to the emergence of data lakes, which are repositories of data in whatever form they are found. That is; organisations have started to put copies of the databases of transactional systems, along with social media, click streams, office documents, cloud platform data and API outputs and much more besides into a single place to enable people to analyse, report, visualise and train machine learning models. In these lakes, raw data sits next to neatly processed data reports. A data lake can take the form of a single Hadoop installation or a cloud storage instance, or it can be a logical layer above a wide variety of data storage systems.

Needless to say, keeping track of what is in such a data lake is critical, and that is where the information catalogue comes in. Such a piece of software can:

  • Discover data in the lake much like a web search engine crawls the web
  • Keep track of changes in the data
  • Profile data in terms of data quality and data lineage or provenance
  • Automatically tag data for characteristics such as personally identifiable information, data protection category and more
  • Enable users to collaborate on defining their own terms and tag data items with it (or have the platform do it for them)
  • Attach governance policies to data items, including via tags
  • Attach governance policies to data artefacts such as reports or ETL jobs
  • Provide faceted search over all data
  • Provide REST APIs to allow other tools access to its contents
  • Provide a data market place for ready-made business intelligence applications

The scope of particular products varies, of course. Not all products will be able to support all capabilities. The reason is that some are tied to a specific storage technology such as Apache Atlas to Apache Hadoop’s file system, or AWS Glue to Amazon’s AWS S3 storage service. Others are more tied to specific business intelligence tools such as Alteryx Connect or Qlik Podium Data Catalog. But others again are associated with data management and ETL platforms such as Informatica and Talend and therefore tend to have a wider purview.

Clearly, the data governance and data discovery capabilities of such tools are highly attractive for any organisation with heterogeneous information stores, but it does raise the question at what point it is worth implementing a dedicated information catalogue over making do with an assemblage of features of existing systems. Perhaps the best way to determine that inflection point is to make a ranked list of tools you can use for data governance, metadata management and data discovery – from a pile of spreadsheets to a full blown information catalogue – and agree at what point the pain of manually keeping track of all information outweighs the pain of buying and implementing a dedicated information catalogue

Adrian Mouat on ‘Understanding Docker and Containerisation’

Sometimes, what can seem just a useful innovation in IT infrastructure can have a significant effect higher up. Containerisation is one of those things, and one of its experts outlined the how and why in a Software Development Community of Practice industry talk.

One of the advantages of being in Edinburgh is that we have quite the tech scene on our door step. Sometimes literally, as when one of the pioneers of the now ubiquitous Docker container technology turns out to work out of the Codebase side of Argyle House. And that’s not the only connection Adrian has with us; he used to work at the EPCC part of the university. Which made the idea of inviting him over for a general talk on Docker and containerisation both compelling and do-able.

Being in IS, but somewhat removed from actually running server software, I was about as aware of the significance of containers as I was hazy on the details. Fortunately, I was the sort of audience Adrian’s talk was aimed at.

Specifically, he answered the main questions:

What is a container?

A portable, isolated computing environment that’s like a virtual machine, but one that shares its operating system kernel with its host. The point being that it is a lot more efficient in image size, start-up times etc. than a virtual machine.  Docker is a technology for making such containers.

What problem does it solve?

Containers solve the “it works for me” problem where a developer gets some software to work perfectly on her own machine, only to see it fail elsewhere because any of a myriad differences in the computing environment.

Why is it important?

Because it enables two significant trends in software development and architecture. One is the shift to microservices, which encapsulate functionality in small services that do just one thing, but do it well. Those microservices ideally live in their own environment, with no dependencies on anything else outside of their service interface. Container environments such as Docker are ideal for that purpose.

The other trend is devops- blurring the distinction between software development and operations, or at least bringing them much closer together. By making the software environment portable and ‘copy-able’, it becomes much easier and quicker to develop, test and deploy new versions of running software.

What’s the catch?

No technology is magic, so it was good to hear Adrian point to the limitations of Docker as well.  One is the danger of merely shifting the complexity of modern software applications from the inside of a monolithic application to a lot of microservices on the network. This can be addressed by good design and making use of container orchestration technology such as Kubernetes.

The other drawback is that containers are necessarily not great at sharing complex states. Because each small piece of software lives in splendid isolation in its own container with its own lifecycle, making sure that everyone of them is on the same page when they work together in a process is a challenge.

Overall, though, Docker’s ability to make software manageable is very attractive indeed, and, along with the shift to the cloud, could well mean that our Enterprise Architecture will look very different in a few years’ time.

The why and how of road mapping

When systems and processes are changing – and when the infrastructure they rely on doesn’t stay still either – it can become useful to see where the dependencies are. Roadmapping is a family of techniques to tackle that issue, and I went to a roadmapping workshop to delve deeper into them.

 

The workshop was organised by our architecture repository vendor (Avolution), but the idea is not dependent on that system. You can make a roadmap in PowerPoint. It’s just easier to marshal the data you need from a system that already has application and organisation data in it, and that knows about processes, projects and deliverables.

 

The reason you need so many different kinds of data indicates why a roadmap can be necessary: it can show the gaps, overlaps and dependencies between change programmes and projects. That is: while all the resources and what happens to them can be well understood and planned within a programme or project, there are aspects that can easily get lost between multiple programmes. For example, programme A could simply assume that some peripheral but crucial data will be available in two years’ time, only to find that another project B is planning to change that data within a year.

 

Roadmaps can also show when multiple predictable issues are due: new regulations, for example, or the end of life of technologies. Those are simple dates in principle, but often have multiple dependencies between them, and from there to the processes and people who rely on those technologies. For example, it’s easy to miss that one widely used webtool may have no planned successor, even though the application technology it is built with has a dependency on a platform or language that will come out of support in a few months’ time.

 

Roadmaps can take many forms, but there are broadly four different types, in ascending order of effort, complexity and sophistication:

Heatmaps and recommendation tagging can take the form of a simple table that uses colour to indicate which technologies are in what stage of their lifecycle. In the following example, a set of fundamental application server, database and operating system technologies are grouped together, tagged with their lifecycle stage and mapped to applications, databases, infrastructure and servers.

 

A lifecycle chart takes similar data and projects it on a timeline:

 

Work package diagrams take more data and start to map out what changes in which work package (of which project or programme).

 

Multiple architectures, finally, map out not one possible future, but multiple ones. This allows you to optimise the organisation around different goals, and compare what the results could be.

 

Pretty pictures, however, are not enough. Not even when they are based on real data. For success, a number of factors need attention:

  1. Provide clear direction
  2. Turn strategy into an executable plan
  3. Provide a compelling case for change
  4. Identify risks and the strategies to manage them
  5. Remain current, accurate and relevant.

For the University, change programmes such as Service Excellence cover points 1 through 3 well. Where data driven roadmaps could add value is in point 4; the identification of risk between programmes and projects, and the support of strategies to manage them. The challenge, however, is in 5; to keep the roadmaps current, accurate and relevant. For that, we need to involve the right people, set up processes and automate data import and visualisation as much as possible. Stay tuned.

 

(All screenshots are from Avolution ABACUS example files and slides)

Data standardisation at the Environment Agency

Funny how very different organisations can have very similar challenges. One of the things I wanted to find out from going to the IRM UK Enterprise Data Conference Europe is how other organisations went about agreeing data standards for their own organisation.

I got there in one: the very first talk was about exactly that topic. Becky Russell from the Environment Agency, with help from Nigel Turner of Global Data Strategy ltd, presented on their approach to the collaborative development of data standards.

The problem they addressed was the following: they had lots of separate teams working on different aspects of the same processes with many different systems. Because they worked on different aspects, they had different definitions for the same things, which made reporting on them difficult, costly and error prone. For example, one ‘thing’ or core data entity for the EA is ‘catchment area’. The team unearthed 16 definitions for those, some subtly, some very different.  Some definitions had the same labels, others not so much. Sounds familiar?

The core of the solution the EA team developed is to design a process for agreeing common definitions of major data entities such ‘catchment area’. They also develop a logical data model for those entities as well, where such a thing is needed. A data model specifies what attributes or dimensions an entity has, without quite specifying how a particular system needs to store it.

That solves the reporting problem, but the process has challenges of its own. To deal with them, they stuck to a couple of principles:

  • Be driven by a business problem, don’t just standardise something for the sake of it
  • Be business led. They know their domain best.
  • Have space for local as well as global standards. There can be good reasons for local exceptions to widely agreed definitions or data models.
  • Re-use external standards where you can.
  • Have supporting technologies in place.
  • Align the standards development to data governance structures.
  • Introduce the standard in new technology alone, because changing legacy systems is costly.

The latter was tricky, because it means that the benefits of standardisation can take quite a while to materialise. Data warehouses can help, in that respect.

This approach did work well, and allows the team to iteratively and pragmatically create more order in the data landscape of the organisation, without having to spend huge resources upfront. As the effort progresses, the standardisation process can become more formal where it needs to be.