My Adventures in Coding: Data Entropy, Part 2 -- Design early and often

"Just build it. If there are problems, we'll fix it later."

I can't tell you the number of times I've heard project managers tell me that. Nobody seems to want to spend time on design. And those who are willing to spend time on application design don't seem to think that designing the data structure is important. "We're going to handle that in the application." is often the refrain. If some application coders had their way, the database would consist of a table for each screen, with nothing but variable character columns named col_1, col_2 and so on.

The truth is that data matters. So why wouldn't you spend time planning how you will store it, manage changes to the structure and account for future growth, not just in size but in structure as well.

Design Early

I start every design with the same, standard, normalized, logical schema. It consists of 7 top level entity types ("Activity", "Agreement", "Asset", "Data", "Event", "Location", and "Party") and relationships for each intersection of two entities, for example, "Activity_Location". I have yet to encounter an entity I could not model as a subtype of one of these 7 top level entities or a relationship between them. This allows for very flexible designs, while maintaining a simple set of data definitions. Additionally, it simplifies the process of extending the data structures down stream, since there is already a general case for any conceivable data element.

Of course, the logical model has to be denormalized to implement an efficient physical database. Subtyping from the standard schema above, I find the most common denormalization is moving referential columns into one of the associated entity tables for 1:n or 1:1 relationships. In the normalized schema above, even these relationships are modeled as subtypes of the intersection tables.

... and Often

A common mistake made at this point is to abandon the logical model and focus entirely on the physical database. The problem is that by doing this you lose all the advantages of having started with a standard model. More changes will lead to greater variance from the original design and increases inthe real and potential data entropy of the system.

The way this should be handled, is to revisit the logical model for every proposed change to the database. In this way, the designer can identify the correct entity which needs to be changed, extended, or subtyped and how it is mapped into the physical database.

A trivial example of this exists in my Bird Dog application. One of the things I am tracking is my relationship to the people in the application. Logically, this is a relationship between User (a subtype of Asset) and Person (subtype of Party) and would normally be a many to many relationship. However, I am currently the only user so, I took a shortcut and implemented the reference as a column in the Party table. I don't even have a User entity. Eventually, I will need to implement the actual m:n relationship so that the system can support multiple users. If the developer failed to consult the logical model design, it's likely that the existing "party_current_user_relationship_type_name" be completely ignored. New tables for User and the intersection between User and Party created. Initially, the data in the old column will be duplicated in the new table and as time passes the data becomes stale and conflicts with the new intersection table. If the developers had consulted the logical model, they would have known to use the data in the column to populate the intersection table and to remove the column. Doing this would have avoided both of the data entropy problems.

On a slightly different note:
I was going to create a SQL script to implement the top level entities and add it to the Sample Code page. But the fact is until they are subtyped, the attributes are extremely limited. Instead, here are the definitions I use.
Activity -- Anything that takes place over a period of time. Examples include tasks and meetings
Agreement -- An accord among two or more Parties. Examples include contracts and orders.
Asset -- Anything that has value or may be owned. Examples include bank accounts, equipment, labor categories, user accounts
Datum -- Information that is independent of an entity. Examples include documents, media files, and category and classification definitions.
Event -- Anything that occurs at a point in time. Examples include financial transactions, state transitions and status changes.
Location -- Where something may be found, sent or received. Examples include coordinates, property addresses, geopolitical entities, URLs and file system locations.
Party -- Anything that can take action, enter into an agreement or possess assets. Examples include people and companies.

My Adventures in Coding

Saturday, May 10, 2014

Data Entropy, Part 2 -- Design early and often

Design Early

... and Often

No comments:

Post a Comment