A [data entity] by any other name…

When working with data, you’ll quickly realize how useful it is to have reliable unique identifiers to reference individual items or common entities in your data sets.

You want all references to the same customer to be easily identifiable, same with all references to the same location, all references to the same product, or language, or movie, or car or whatever.

If you want a quick lesson in the messiness of classifications and naming conventions, try to figure out the scientific name of a common red rose!

You will often come across data sets where this is not the case, and as pointed out in a previous post - Working with countries – in data - making this „right“ is not as straight forward as it might seem at first.

I like to illustrate this point with a story about the fall of Lehman Brothers. When the news broke in September 2008 that the bank had fallen, financial institutions around the world called for emergency meetings. They needed to understand their exposure to Lehman and what Lehman’s collapse might mean for their businesses. The financial world’s best analysts pored over their data to come up with quick estimates. Many of them came back with surprisingly good news. The exposure was not nearly as bad as they had assumed.

Well it turns out that „Lehman“ was not an easily identified, single entity you could search for in the data. In fact:

„The Lehman Brothers group consisted of 2,985 legal entities that operated in some 50 countries. Many of these entities were subject to host country national regulation as well as supervision by the Securities and Exchange Commission (SEC), through the Consolidated Supervised Entities (CSE) programme in the United States.“

This quote is taken from a report published by the Bank of International Settlements 18 months after Lehman’s fall. It took them that long to untangle the complicated mesh of subsidiaries and cross-ownership throughout the group, wading through different local regulations and registration methods - even with deep access to Lehman’s own data. And needless to say the analysts’ quick estimates 18 months earlier were often grossly understated, in some cases with grave consequences. In their defense, they had no unique identifiers for Lehman as a group to enable them to trace all these different entities in their data to the same owner.

So let it be said: Unique identifiers are serious business!

Pick an identifier, carefully

When you’re starting a project, either collecting new data or analyzing, cleaning up or consolidating existing data, you will want to carefully pick unique identifiers that make sense.

In existing data sets, look for identifying fields or columns that already exist in the data. They may not have a meaning on their own, but they could be internally consistent throughout a single set, such as product IDs in inventory data, or key fields from a database table where a numerical field in – say – the orders table refers to a customer in the customer table. Even if you don’t have both of those tables, the consistent IDs are nevertheless immensely useful.

These types of fields may already be enough for your purposes. If all you needed to do was to analyze orders by the top 10 customers – that’s it.

The problems arise when you either need to associate that data set with another data set from another origin, or if you don’t have a reliable ID field and you’ll have to determine it by analyzing text strings or even combinations of values in multiple fields or columns.

What’s that car?

Let’s say you have a data set about cars that originated in a free form user entry on a web site. How many different ways do you think people will come up with to spell „Mercedes Benz“? I have seen such a data set with over twenty different spellings – and I’m sure there is more innovation where that came from!

In this case it’s a question of reconciliation (sometimes referred to as entity resolution or deduplication). For many intents and purposes you just want to pick one way of spelling (preferably a somewhat correct one), and then replace all the other ways of spelling with that one. Note that you likely want to keep the original field for quality purposes or to display in a visualization or user interface, so creating a new column for the reconciled IDs is recommended. This is something that most data preparation tools do very well, and coming up with methods to do this without such tools in spreadsheet software or your favorite programming language is simple enough.

However, you may still have an issue if you need to combine or associate your data set with another one of a different origin. Before combining, you may need to clean the other set up in the same way, or determine what the numbers in a „Manufacturer_id“ column actually mean, then linking all the entries with the id „17” in that set with the „Mercedes“ value in your – now squeaky clean – column in the other table.

If you happen to be preparing a relational database model you’ll almost certainly end up using a (numerical) identifier in every table that references car manufacturers, and then a „manufacturers“ table with that ID as a primary key followed by the unified spelling, and other information about the manufacturer in subsequent fields.

But regardless of the tool or technique, you will want to use some sort of standard or widely used code as an ID rather than inventing your own if at all possible. That will allow you to piggyback on others’ work in collecting the codes, figuring out discrepancies and inconsistencies. It will also make it easier to associate data from other sources and to share your data with others if it comes to that.

For car manufacturers there is in fact a standard — ISO 3779 — determining what’s called the World Manufacturer Identifier (WMI). The WMI is in turn used as a part of the Vehicle Identification Number (VIN), which is a unique identifier for an individual vehicle. You can find more information about these IDs and common WMI codes on Wikipedia (the full VIN specification is a commercially available standard from ISO). Whenever possible, you should seek out and use such existing standard or commonly used codes. Inventing your own IDs should be a last resort.

Other useful codes

I’ve created a quick list of useful codes and standards for entities commonly encountered in data projects and posted in a separate post: Unique identifiers’ cheat sheet.

Pick an identifier, carefully

What’s that car?

Other useful codes

Stay in the loop