My Adventures in Coding: Data Entropy, Part 4 -- Oil and water don't mix, and neither do data types

Have you ever tried to find the sum of "A", 23 and an artichoke? Would you be able to find it's average? Max? Min? Ask a database program and it's likely to return an answer. But is it the answer that you would expect? Many newbie coders (and even a few experienced coders) default every attribute of their data structures to varchar and float. The problem with this approach is when it comes time to put data together you have to find the data dictionary in order to avoid making mistakes that will corrupt the output. Even more egregious is the need to add code to format or transform data in order to make sure that errors won't occur. The Data Entropy in these systems is incredibly high. A lot of work must be put in to get useful information out.

I actually saw this table definition once:

CREATE TABLE personnel
     ( id         INT           PRIMARY KEY
     , pers_col1  VARCHAR(100)
     , pers_col2  VARCHAR(25)
     );

This is simplified from the actual implementation (there were more columns), but you get the gist. I'm convinced that the developer simply used the default wizard and increased the column size when it didn't work.

Granted, a detailed data dictionary that documents every column, it's usage and definition can be tremendously valuable. However, I propose that you start by defining a set of functional attribute types, map these to a default set of database column types and use the names of these attribute types (or a standard abbreviation) as the suffix of the type.

Here are the ones I always use with the default MySQL Column definition:

Attribute Type	Abbreviation	Definition	MySQL Column Type
identifier	id	An abstract value that uniquely identifies an entity.	int(11)
globally unique identifier	gd	Special case of an Identifier, where the value is a 32 character hexadecimal value designed to be unique in every context.	char(32)
name	nm	The name ascribed to an entity which is normally used to identify it.	varchar(100)
number	nb	A human readable identifying value. (e.g., Part Numbers or Social Security Number)	varchar(50)
code	cd	A short value identifying a classification or category. These are human readable and should be avoided unless the implementation requires coding according to a standards guide. For example, "CAT5" or "CAT6" are codes.	varchar(25)
date	da	Contains only the calendar date	date
time	tm	Contains only the time of day	datetime
quantity	qt	A numeric measure of something which uses some unit of merit, such as pounds, meters, or liters.	float
count	ct	The number of things in the identified set.	int
amount	am	A financial quantity, expressed in some defined currency.	decimal(30,15)
description	ds	A long text which is used to describe the entity to the user	varchar(500)
sequence	sq	A value used to order items within a group	int
timestamp	ts	An audit value indicating the system time at which something occurred.	datetime
flag	fg	A boolean value. This is used when an entity is or isn't. I prefer the term flag, over "boolean" because that's what these columns do, they "flag" something about the entity.	boolean

By having a clearly defined function based set of attribute types, you've set a foundation for building a system that is as self-documenting as is possible. From this point, it's a matter of consistent usage and maintaining the definitions over time.

On a slightly different note:
When naming columns also beware of homonyms and synonyms. They can bite you in the behind if you're not careful to avoid them whenever possible and document every time you can't avoid it.

My Adventures in Coding

Thursday, February 5, 2015

Data Entropy, Part 4 -- Oil and water don't mix, and neither do data types

No comments:

Post a Comment