Thursday, February 5, 2015

Data Entropy, Part 4 -- Oil and water don't mix, and neither do data types

Have you ever tried to find the sum of "A", 23 and an artichoke?  Would you be able to find it's average? Max? Min?  Ask a database program and it's likely to return an answer.  But is it the answer that you would expect?  Many newbie coders (and even a few experienced coders) default every attribute of their data structures to varchar and float.  The problem with this approach is when it comes time to put data together you have to find the data dictionary in order to avoid making mistakes that will corrupt the output.  Even more egregious is the need to add code to format or transform data in order to make sure that errors won't occur.  The Data Entropy in these systems is incredibly high.  A lot of work must be put in to get useful information out.

I actually saw this table definition once:
CREATE TABLE personnel
     ( id         INT           PRIMARY KEY
     , pers_col1  VARCHAR(100)
     , pers_col2  VARCHAR(25)
     );

This is simplified from the actual implementation (there were more columns), but you get the gist. I'm convinced that the developer simply used the default wizard and increased the column size when it didn't work.

Granted, a detailed data dictionary that documents every column, it's usage and definition can be tremendously valuable.  However, I propose that you start by defining a set of functional attribute types, map these to a default set of database column types and use the names of these attribute types (or a standard abbreviation) as the suffix of the type.

Here are the ones I always use with the default MySQL Column definition:


Attribute TypeAbbreviationDefinitionMySQL Column Type
identifieridAn abstract value that uniquely identifies an entity.int(11)
globally unique identifiergdSpecial case of an Identifier, where the value is a 32 character hexadecimal value designed to be unique in every context.char(32)
namenmThe name ascribed to an entity which is normally used to identify it.varchar(100)
numbernbA human readable identifying value.  (e.g., Part Numbers or Social Security Number)varchar(50)
codecdA short value identifying a classification or category.  These are human readable and should be avoided unless the implementation requires coding according to a standards guide.  For example, "CAT5" or "CAT6" are codes.varchar(25)
datedaContains only the calendar datedate
timetmContains only the time of daydatetime
quantityqtA numeric measure of something which uses some unit of merit, such as pounds, meters, or liters.float
countctThe number of things in the identified set.int
amountamA financial quantity, expressed in some defined currency.decimal(30,15)
descriptiondsA long text which is used to describe the entity to the uservarchar(500)
sequencesqA value used to order items within a groupint
timestamptsAn audit value indicating the system time at which something occurred.datetime
flagfgA boolean value.  This is used when an entity is or isn't.  I prefer the term flag, over "boolean" because that's what these columns do, they "flag" something about the entity.boolean
By having a clearly defined function based set of attribute types, you've set a foundation for building a system that is as self-documenting as is possible. From this point, it's a matter of consistent usage and maintaining the definitions over time.

On a slightly different note:
When naming columns also beware of homonyms and synonyms.  They can bite you in the behind if you're not careful to avoid them whenever possible and document every time you can't avoid it.