My Adventures in Coding: Data Entropy - Part 1

When I started this post, it was supposed to be about a page long. However, I've realized there is a lot more to cover than a single page will allow; so, I'm breaking it up.

Over the last week, I have spent between 10 and 20 hours cleaning up my contacts list, and I still have to put the corrected data back into my phone and email accounts. I have less than 1000 entries, but when there are no data entry standards or controls, things get pretty messed up. For example, I have all of the following in the Country field: "Canada", "Great Britain", "Israel", "RO", "United Kingdom", "United S", "United States", "United States of America", "United States ofAmerica","US","USA","ww", and "CA". There are 13 unique values but only 5 countries in which my contacts reside. "ww" is not one of them and I had to do an internet search to remember what "RO" meant. Although a person would be able to know what almost all of these different entries mean, a machine will have a much harder time, especially if you are coding from scratch and not using an existing heuristic or fuzzy match algorithm. And the Country field was just one of several fields that had been populated inconsistently. Just identifying which records were, in fact duplicates and not just similar took me nearly 2 hours.

The time I wasted fixing my data exemplifies an issue that has attracted my attention for nearly twenty years. I call it "Data Entropy". In thermodynamics, entropy is a measure of the energy that is not available for work. In the case of data entropy, it is a measure of the difficulty associated with collecting the data into useful information. Using my example above, if the useful information were how many of my contacts work in North America, a relatively low data entropy would allow the gathering of that information by simply counting the number of entries whose country field equaled "United States of America" or "Canada". Since my contacts list contains more than two values which mean either "United States of America" or "Canada", a more complex algorithm has to be employed to ensure these records are included in the count.

This is not just an academic exercise. There are real world, expensive consequences for database with a high level of data entropy. When companies need to move data from one application to another, such as when migrating to a new customer service application or adding data to a data warehouse, one of the most time consuming and complex challenges is cleaning up or transforming the data in order for it to be useful in the new environment. The effort required can cost millions of dollars for very large or complex data sets.

Like entropy in physics, the entropy of a data system will never decrease without external intervention (like manually cleaning up my contact list). More often than not, any database which is used regularly for any significant period of time, will suffer a continuing increase in its data entropy. My intention is to illustrate what factors contribute to increasing data entropy and what can be done to mitigate them.

I am 99.9999% certain the following list will change, but here are some of the items I plan to address.

Design early and often.
Data isn't formatted.
Oil and water don't mix, and neither do data types.
Murphy is a user. If he can do it, he will.
Everyone goes their own way; and uses their own abbreviations
Does it really mean what you think it means?
Where is Springfield? Context matters.
Time breaks all things. Meaning and use change over time
Mo' systems, Mo' trouble. Beware conflicting usages by different applications

On a slightly different note:
I did manage to write VBA code that extracts my contacts information from Outlook and writes it into my Orion spreadsheet. If you're interested, I've added the code to my sample page.

My Adventures in Coding

Friday, May 9, 2014

Data Entropy - Part 1 - Introduction

No comments:

Post a Comment