My Adventures in Coding: May 2014

Saturday, May 10, 2014

Data Entropy, Part 2 -- Design early and often

"Just build it. If there are problems, we'll fix it later."

I can't tell you the number of times I've heard project managers tell me that. Nobody seems to want to spend time on design. And those who are willing to spend time on application design don't seem to think that designing the data structure is important. "We're going to handle that in the application." is often the refrain. If some application coders had their way, the database would consist of a table for each screen, with nothing but variable character columns named col_1, col_2 and so on.

The truth is that data matters. So why wouldn't you spend time planning how you will store it, manage changes to the structure and account for future growth, not just in size but in structure as well.

Design Early

I start every design with the same, standard, normalized, logical schema. It consists of 7 top level entity types ("Activity", "Agreement", "Asset", "Data", "Event", "Location", and "Party") and relationships for each intersection of two entities, for example, "Activity_Location". I have yet to encounter an entity I could not model as a subtype of one of these 7 top level entities or a relationship between them. This allows for very flexible designs, while maintaining a simple set of data definitions. Additionally, it simplifies the process of extending the data structures down stream, since there is already a general case for any conceivable data element.

Of course, the logical model has to be denormalized to implement an efficient physical database. Subtyping from the standard schema above, I find the most common denormalization is moving referential columns into one of the associated entity tables for 1:n or 1:1 relationships. In the normalized schema above, even these relationships are modeled as subtypes of the intersection tables.

... and Often

A common mistake made at this point is to abandon the logical model and focus entirely on the physical database. The problem is that by doing this you lose all the advantages of having started with a standard model. More changes will lead to greater variance from the original design and increases inthe real and potential data entropy of the system.

The way this should be handled, is to revisit the logical model for every proposed change to the database. In this way, the designer can identify the correct entity which needs to be changed, extended, or subtyped and how it is mapped into the physical database.

A trivial example of this exists in my Bird Dog application. One of the things I am tracking is my relationship to the people in the application. Logically, this is a relationship between User (a subtype of Asset) and Person (subtype of Party) and would normally be a many to many relationship. However, I am currently the only user so, I took a shortcut and implemented the reference as a column in the Party table. I don't even have a User entity. Eventually, I will need to implement the actual m:n relationship so that the system can support multiple users. If the developer failed to consult the logical model design, it's likely that the existing "party_current_user_relationship_type_name" be completely ignored. New tables for User and the intersection between User and Party created. Initially, the data in the old column will be duplicated in the new table and as time passes the data becomes stale and conflicts with the new intersection table. If the developers had consulted the logical model, they would have known to use the data in the column to populate the intersection table and to remove the column. Doing this would have avoided both of the data entropy problems.

On a slightly different note:
I was going to create a SQL script to implement the top level entities and add it to the Sample Code page. But the fact is until they are subtyped, the attributes are extremely limited. Instead, here are the definitions I use.
Activity -- Anything that takes place over a period of time. Examples include tasks and meetings
Agreement -- An accord among two or more Parties. Examples include contracts and orders.
Asset -- Anything that has value or may be owned. Examples include bank accounts, equipment, labor categories, user accounts
Datum -- Information that is independent of an entity. Examples include documents, media files, and category and classification definitions.
Event -- Anything that occurs at a point in time. Examples include financial transactions, state transitions and status changes.
Location -- Where something may be found, sent or received. Examples include coordinates, property addresses, geopolitical entities, URLs and file system locations.
Party -- Anything that can take action, enter into an agreement or possess assets. Examples include people and companies.

Friday, May 9, 2014

Data Entropy - Part 1 - Introduction

When I started this post, it was supposed to be about a page long. However, I've realized there is a lot more to cover than a single page will allow; so, I'm breaking it up.

Over the last week, I have spent between 10 and 20 hours cleaning up my contacts list, and I still have to put the corrected data back into my phone and email accounts. I have less than 1000 entries, but when there are no data entry standards or controls, things get pretty messed up. For example, I have all of the following in the Country field: "Canada", "Great Britain", "Israel", "RO", "United Kingdom", "United S", "United States", "United States of America", "United States ofAmerica","US","USA","ww", and "CA". There are 13 unique values but only 5 countries in which my contacts reside. "ww" is not one of them and I had to do an internet search to remember what "RO" meant. Although a person would be able to know what almost all of these different entries mean, a machine will have a much harder time, especially if you are coding from scratch and not using an existing heuristic or fuzzy match algorithm. And the Country field was just one of several fields that had been populated inconsistently. Just identifying which records were, in fact duplicates and not just similar took me nearly 2 hours.

The time I wasted fixing my data exemplifies an issue that has attracted my attention for nearly twenty years. I call it "Data Entropy". In thermodynamics, entropy is a measure of the energy that is not available for work. In the case of data entropy, it is a measure of the difficulty associated with collecting the data into useful information. Using my example above, if the useful information were how many of my contacts work in North America, a relatively low data entropy would allow the gathering of that information by simply counting the number of entries whose country field equaled "United States of America" or "Canada". Since my contacts list contains more than two values which mean either "United States of America" or "Canada", a more complex algorithm has to be employed to ensure these records are included in the count.

This is not just an academic exercise. There are real world, expensive consequences for database with a high level of data entropy. When companies need to move data from one application to another, such as when migrating to a new customer service application or adding data to a data warehouse, one of the most time consuming and complex challenges is cleaning up or transforming the data in order for it to be useful in the new environment. The effort required can cost millions of dollars for very large or complex data sets.

Like entropy in physics, the entropy of a data system will never decrease without external intervention (like manually cleaning up my contact list). More often than not, any database which is used regularly for any significant period of time, will suffer a continuing increase in its data entropy. My intention is to illustrate what factors contribute to increasing data entropy and what can be done to mitigate them.

I am 99.9999% certain the following list will change, but here are some of the items I plan to address.

Design early and often.
Data isn't formatted.
Oil and water don't mix, and neither do data types.
Murphy is a user. If he can do it, he will.
Everyone goes their own way; and uses their own abbreviations
Does it really mean what you think it means?
Where is Springfield? Context matters.
Time breaks all things. Meaning and use change over time
Mo' systems, Mo' trouble. Beware conflicting usages by different applications

On a slightly different note:
I did manage to write VBA code that extracts my contacts information from Outlook and writes it into my Orion spreadsheet. If you're interested, I've added the code to my sample page.