Thursday, February 5, 2015

Data Entropy, Part 4 -- Oil and water don't mix, and neither do data types

Have you ever tried to find the sum of "A", 23 and an artichoke?  Would you be able to find it's average? Max? Min?  Ask a database program and it's likely to return an answer.  But is it the answer that you would expect?  Many newbie coders (and even a few experienced coders) default every attribute of their data structures to varchar and float.  The problem with this approach is when it comes time to put data together you have to find the data dictionary in order to avoid making mistakes that will corrupt the output.  Even more egregious is the need to add code to format or transform data in order to make sure that errors won't occur.  The Data Entropy in these systems is incredibly high.  A lot of work must be put in to get useful information out.

I actually saw this table definition once:
CREATE TABLE personnel
     ( id         INT           PRIMARY KEY
     , pers_col1  VARCHAR(100)
     , pers_col2  VARCHAR(25)
     );

This is simplified from the actual implementation (there were more columns), but you get the gist. I'm convinced that the developer simply used the default wizard and increased the column size when it didn't work.

Granted, a detailed data dictionary that documents every column, it's usage and definition can be tremendously valuable.  However, I propose that you start by defining a set of functional attribute types, map these to a default set of database column types and use the names of these attribute types (or a standard abbreviation) as the suffix of the type.

Here are the ones I always use with the default MySQL Column definition:


Attribute TypeAbbreviationDefinitionMySQL Column Type
identifieridAn abstract value that uniquely identifies an entity.int(11)
globally unique identifiergdSpecial case of an Identifier, where the value is a 32 character hexadecimal value designed to be unique in every context.char(32)
namenmThe name ascribed to an entity which is normally used to identify it.varchar(100)
numbernbA human readable identifying value.  (e.g., Part Numbers or Social Security Number)varchar(50)
codecdA short value identifying a classification or category.  These are human readable and should be avoided unless the implementation requires coding according to a standards guide.  For example, "CAT5" or "CAT6" are codes.varchar(25)
datedaContains only the calendar datedate
timetmContains only the time of daydatetime
quantityqtA numeric measure of something which uses some unit of merit, such as pounds, meters, or liters.float
countctThe number of things in the identified set.int
amountamA financial quantity, expressed in some defined currency.decimal(30,15)
descriptiondsA long text which is used to describe the entity to the uservarchar(500)
sequencesqA value used to order items within a groupint
timestamptsAn audit value indicating the system time at which something occurred.datetime
flagfgA boolean value.  This is used when an entity is or isn't.  I prefer the term flag, over "boolean" because that's what these columns do, they "flag" something about the entity.boolean
By having a clearly defined function based set of attribute types, you've set a foundation for building a system that is as self-documenting as is possible. From this point, it's a matter of consistent usage and maintaining the definitions over time.

On a slightly different note:
When naming columns also beware of homonyms and synonyms.  They can bite you in the behind if you're not careful to avoid them whenever possible and document every time you can't avoid it.

Monday, January 5, 2015

Data Entropy, Part 3 -- Data isn't formatted.

If you use MS Outlook, then you are probably familiar with the way in which Outlook stores phone numbers; that is as separate Country Code, Area Code, Number and Extension.  If you make sure you store these values separately, then Outlook will display the numbers in a standard format.  However, there is nothing in the system that will prevent you from storing the numbers in a non-standard format.  Outlook simply throws the entire thing into the "Number" field and displays it as you entered.  What this illustrates is the fact that formats are and should be kept separate from the data which they display.

When data gets shared, the format in which it is stored can cause confusion or even errors.  Take for example a field that stores the amount of available storage used by files on your harddrive.  Given the wide variance in file sizes, we have grown accustomed to seeing the units appended to the end, like 45GB or 3.5TB.  But if you stored those values "45GB" and "3.5TB" in your database, how would you start adding them up?  How would you be able to total the amount of storage used by .gif files versus the amount used by .mp3s?  The logic required becomes mind numbing and maintaining that code requires considering all of the ways in which a user may enter the value.  The right solution is to decide on a unit of measure for the storage metric, (Personally, I like bytes in this case) and stick to it.  Then you can modify your display algorithms to provide units.  Additionally, you have a standard value by which you can transfer the data to any system without worrying about how the information is formatted.

On a slightly different note:
A group of colleagues and I started reminiscing about backup media and what it took to store all those floppies/CDs/DVDs "back in the day".  So, I decided to take a look back and see what it would take to store 1 terabyte of information using some common media from our past.  Here are the results.

Media Type Capacity # to store 1TB
5-1/4 In floppy disk 360 KB 2,982,617
3.5 in HD floppy disk 1.44 MB 728,178
8 in floppy disk 6.2 MB 169,126
SuperDisk - LS120 120 MB 8,739
SuperDisk - LS240 240 MB 4,370
CD-ROM 650 MB 1,614
Zip Disk 750 MB 1,399
Jaz Disk 1 GB 1,024
DVD (2-Layer) 9.4 GB 109