Saturday, July 11, 2020

Data Entropy, Part 5 -- Murphy is a user. If he can do it, he will.

Early in my career, I was developing an application where the workflow and data entry was straightforward (or so I thought).  However, a few days after I submitted the work to our test team, they came back with a number of issues.  As they described the steps that caused the problems, I realized they weren't following the workflow that we had defined.  When I asked why they did what they did, the lead tester responded, "Because we can."  Which opened my eyes to a stark reality.  If users can do something, they will do it and that includes entering bad data.  This is why we need constraints and we need to enforce them.

Constraints are gateways that protect your data from errors.  Don't side step them.  Often, I see pick lists that allow additions whenever a user can't find the value she searches.  If users can add any value, then the foreign key you defined on the column isn't really being enforced.  If possible use smart searches that look for variations on the way people input values.

You're a user too!
It's very tempting (and sometimes justified) to disable constraints for data loads.  By all means, consider the performance hit you would take.  But, for goodness sake, please validate the data before and after you perform the load to make sure your constraints are enforced.

Thursday, February 5, 2015

Data Entropy, Part 4 -- Oil and water don't mix, and neither do data types

Have you ever tried to find the sum of "A", 23 and an artichoke?  Would you be able to find it's average? Max? Min?  Ask a database program and it's likely to return an answer.  But is it the answer that you would expect?  Many newbie coders (and even a few experienced coders) default every attribute of their data structures to varchar and float.  The problem with this approach is when it comes time to put data together you have to find the data dictionary in order to avoid making mistakes that will corrupt the output.  Even more egregious is the need to add code to format or transform data in order to make sure that errors won't occur.  The Data Entropy in these systems is incredibly high.  A lot of work must be put in to get useful information out.

I actually saw this table definition once:
CREATE TABLE personnel
     ( id         INT           PRIMARY KEY
     , pers_col1  VARCHAR(100)
     , pers_col2  VARCHAR(25)
     );

This is simplified from the actual implementation (there were more columns), but you get the gist. I'm convinced that the developer simply used the default wizard and increased the column size when it didn't work.

Granted, a detailed data dictionary that documents every column, it's usage and definition can be tremendously valuable.  However, I propose that you start by defining a set of functional attribute types, map these to a default set of database column types and use the names of these attribute types (or a standard abbreviation) as the suffix of the type.

Here are the ones I always use with the default MySQL Column definition:


Attribute TypeAbbreviationDefinitionMySQL Column Type
identifieridAn abstract value that uniquely identifies an entity.int(11)
globally unique identifiergdSpecial case of an Identifier, where the value is a 32 character hexadecimal value designed to be unique in every context.char(32)
namenmThe name ascribed to an entity which is normally used to identify it.varchar(100)
numbernbA human readable identifying value.  (e.g., Part Numbers or Social Security Number)varchar(50)
codecdA short value identifying a classification or category.  These are human readable and should be avoided unless the implementation requires coding according to a standards guide.  For example, "CAT5" or "CAT6" are codes.varchar(25)
datedaContains only the calendar datedate
timetmContains only the time of daydatetime
quantityqtA numeric measure of something which uses some unit of merit, such as pounds, meters, or liters.float
countctThe number of things in the identified set.int
amountamA financial quantity, expressed in some defined currency.decimal(30,15)
descriptiondsA long text which is used to describe the entity to the uservarchar(500)
sequencesqA value used to order items within a groupint
timestamptsAn audit value indicating the system time at which something occurred.datetime
flagfgA boolean value.  This is used when an entity is or isn't.  I prefer the term flag, over "boolean" because that's what these columns do, they "flag" something about the entity.boolean
By having a clearly defined function based set of attribute types, you've set a foundation for building a system that is as self-documenting as is possible. From this point, it's a matter of consistent usage and maintaining the definitions over time.

On a slightly different note:
When naming columns also beware of homonyms and synonyms.  They can bite you in the behind if you're not careful to avoid them whenever possible and document every time you can't avoid it.

Monday, January 5, 2015

Data Entropy, Part 3 -- Data isn't formatted.

If you use MS Outlook, then you are probably familiar with the way in which Outlook stores phone numbers; that is as separate Country Code, Area Code, Number and Extension.  If you make sure you store these values separately, then Outlook will display the numbers in a standard format.  However, there is nothing in the system that will prevent you from storing the numbers in a non-standard format.  Outlook simply throws the entire thing into the "Number" field and displays it as you entered.  What this illustrates is the fact that formats are and should be kept separate from the data which they display.

When data gets shared, the format in which it is stored can cause confusion or even errors.  Take for example a field that stores the amount of available storage used by files on your harddrive.  Given the wide variance in file sizes, we have grown accustomed to seeing the units appended to the end, like 45GB or 3.5TB.  But if you stored those values "45GB" and "3.5TB" in your database, how would you start adding them up?  How would you be able to total the amount of storage used by .gif files versus the amount used by .mp3s?  The logic required becomes mind numbing and maintaining that code requires considering all of the ways in which a user may enter the value.  The right solution is to decide on a unit of measure for the storage metric, (Personally, I like bytes in this case) and stick to it.  Then you can modify your display algorithms to provide units.  Additionally, you have a standard value by which you can transfer the data to any system without worrying about how the information is formatted.

On a slightly different note:
A group of colleagues and I started reminiscing about backup media and what it took to store all those floppies/CDs/DVDs "back in the day".  So, I decided to take a look back and see what it would take to store 1 terabyte of information using some common media from our past.  Here are the results.

Media Type Capacity # to store 1TB
5-1/4 In floppy disk 360 KB 2,982,617
3.5 in HD floppy disk 1.44 MB 728,178
8 in floppy disk 6.2 MB 169,126
SuperDisk - LS120 120 MB 8,739
SuperDisk - LS240 240 MB 4,370
CD-ROM 650 MB 1,614
Zip Disk 750 MB 1,399
Jaz Disk 1 GB 1,024
DVD (2-Layer) 9.4 GB 109

Saturday, May 10, 2014

Data Entropy, Part 2 -- Design early and often

"Just build it.  If there are problems, we'll fix it later."

I can't tell  you the number of times I've heard project managers tell me that.  Nobody seems to want to spend time on design. And those who are willing to spend time on application design don't seem to think that designing the data structure is important.  "We're going to handle that in the application." is often the refrain.  If some application coders had their way, the database would consist of a table for each screen, with nothing but variable character columns named col_1, col_2 and so on.

The truth is that data matters.  So why wouldn't you spend time planning how you will store it, manage changes to the structure and account for future growth, not just in size but in structure as well.

Design Early

I start every design with the same, standard, normalized, logical schema.  It consists of 7 top level entity types ("Activity", "Agreement", "Asset", "Data", "Event", "Location", and "Party") and relationships for each intersection of two entities, for example, "Activity_Location".  I have yet to encounter an entity I could not model as a subtype of one of these 7 top level entities or a relationship between them.  This allows for very flexible designs, while maintaining a simple set of data definitions. Additionally, it simplifies the process of extending the data structures down stream, since there is already a general case for any conceivable data element.

Of course, the logical model has to be denormalized to implement an efficient physical database.  Subtyping from the standard schema above, I find the most common denormalization is moving referential columns into one of the associated entity tables for 1:n or 1:1 relationships.  In the normalized schema above, even these relationships are modeled as subtypes of the intersection tables.

... and Often

A common mistake made at this point is to abandon the logical model and focus entirely on the physical database.  The problem is that by doing this you lose all the advantages of having started with a standard model.  More changes will lead to greater variance from the original design and increases inthe real and potential data entropy of the system.

The way this should be handled, is to revisit the logical model for every proposed change to the database.  In this way, the designer can identify the correct entity which needs to be changed, extended, or subtyped and how it is mapped into the physical database.

A trivial example of this exists in my Bird Dog application.  One of the things I am tracking is my relationship to the people in the application.  Logically, this is a relationship between User (a subtype of Asset) and Person (subtype of Party) and would normally be a many to many relationship.  However, I am currently the only user so, I took a shortcut and implemented the reference as a column in the Party table.  I don't even have a User entity.  Eventually, I will need to implement the actual m:n relationship so that the system can support multiple users.  If the developer failed to consult the logical model design, it's likely that the existing "party_current_user_relationship_type_name" be completely ignored.  New tables for User and the intersection between User and Party created.  Initially, the data in the old column will be duplicated in the new table and as time passes the data becomes stale and conflicts with the new intersection table.  If the developers had consulted the logical model, they would have known to use the data in the column to populate the intersection table and to remove the column.  Doing this would have avoided both of the data entropy problems.

On a slightly different note:
I was going to create a SQL script to implement the top level entities and add it to the Sample Code page.  But the fact is until they are subtyped, the attributes are extremely limited.  Instead, here are the definitions I use.
Activity -- Anything that takes place over a period of time.  Examples include tasks and meetings
Agreement -- An accord among two or more Parties.  Examples include contracts and orders.
Asset -- Anything that has value or may be owned.  Examples include bank accounts, equipment, labor categories, user accounts
Datum -- Information that is independent of an entity.  Examples include documents, media files, and category and classification definitions.
Event -- Anything that occurs at a point in time.  Examples include financial transactions, state transitions and status changes.
Location -- Where something may be found, sent or received.  Examples include coordinates, property addresses, geopolitical entities, URLs and file system locations.
Party -- Anything that can take action, enter into an agreement or possess assets.  Examples include people and companies.

Friday, May 9, 2014

Data Entropy - Part 1 - Introduction

When I started this post, it was supposed to be about a page long.  However, I've realized there is a lot more to cover than a single page will allow; so, I'm breaking it up.

Over the last week, I have spent between 10 and 20 hours cleaning up my contacts list, and I still have to put the corrected data back into my phone and email accounts. I have less than 1000 entries, but when there are no data entry standards or controls, things get pretty messed up. For example, I have all of the following in the Country field: "Canada", "Great Britain", "Israel", "RO", "United Kingdom", "United S", "United States", "United States of America", "United States ofAmerica","US","USA","ww", and "CA".  There are 13 unique values but only 5 countries in which my contacts reside.  "ww" is not one of them and I had to do an internet search to remember what "RO" meant.  Although a person would be able to know what almost all of these different entries mean, a machine will have a much harder time, especially if you are coding from scratch and not using an existing heuristic or fuzzy match algorithm.  And the Country field was just one of several fields that had been populated inconsistently.  Just identifying which records were, in fact duplicates and not just similar took me nearly 2 hours.

The time I wasted fixing my data exemplifies an issue that has attracted my attention for nearly twenty years.  I call it "Data Entropy".  In thermodynamics, entropy is a measure of the energy that is not available for work.  In the case of data entropy, it is a measure of the difficulty associated with collecting the data into useful information.  Using my example above, if the useful information were how many of my contacts work in North America, a relatively low data entropy would allow the gathering of that information by simply counting the number of entries whose country field equaled "United States of America" or "Canada".  Since my contacts list contains more than two values which mean either "United States of America" or "Canada", a more complex algorithm has to be employed to ensure these records are included in the count.

This is not just an academic exercise.  There are real world, expensive consequences for database with a high level of data entropy.  When companies need to move data from one application to another, such as when migrating to a new customer service application or adding data to a data warehouse, one of the most time consuming and complex challenges is cleaning up or transforming the data in order for it to be useful in the new environment.  The effort required can cost millions of dollars for very large or complex data sets.

Like entropy in physics, the entropy of a data system will never decrease without external intervention (like manually cleaning up my contact list).  More often than not, any database which is used regularly for any significant period of time, will suffer a continuing increase in its data entropy.  My intention is to illustrate what factors contribute to increasing data entropy and what can be done to mitigate them.

I am 99.9999% certain the following list will change, but here are some of the items I plan to address.

On a slightly different note:
I did manage to write VBA code that extracts my contacts information from Outlook and writes it into my Orion spreadsheet.  If you're interested, I've added the code to my sample page.

Friday, April 18, 2014

Less is better

Once I got testing working, I turned to log generation. Nothing helps debug more than log files. Apache just put out a release candidate for version 2 of their Log4j utility. I've used version 1 in the past and, of course, "new is always better".

Unlike my experience with v1, I decided to actually learn how the utility is supposed to work, so I could create real log files, and actually manage the output with valid configuration files.  I read the introduction and implementation documentation.  I figured I was ready to try it out.

I downloaded the zip file, unzipped it into my shared library folder.  Created a library in NetBeans included all the jar files, linked them into my project, wrote some test code into my java and ...


Null Test:  Caused an ERROR
org/slf4j/ILoggerFactory
java.lang.NoClassDefFoundError: org/slf4j/ILoggerFactory
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
 at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
Test org.looman.birddog.data.DataEntityTypeTest FAILED

What!!!  I did everything the way I was told.  Back to the internet for help.  Turns out the documentation for Apache Log4j 2 doesn't include information on implementing the jar files into your classpath.  I guess they assume if you're using it, you must be advanced enough to know how to put jar files into your projects properly.

UGGGHHHHH!!!!!

More internet.  By far the best (and by "best" I means "worst") answer I found was this gem:  "You have to provide one of the various SLF4J implementation .jar files in the classpath, as well as the interface .jar file. This is documented."  by the user EJP

Documented?  Really?  Where?  What implementation .jar files are available?  No links were provided.  That was the whole answer.  And the really surprising thing is this answer got four upvotes!!

Finally, I found this post by SContributor.  The solution is that only two jar files are needed:  log4j-api.2.0-{version}.jar and log4j-core-2.0-{version}.jar.  The {version} may be different, as it refers to the unique version of the jar file.  In my case, I have release candidate 1, so the file names are log4j-api.2.0-rc1.jar and log4j-core-2.0-rc1.jar.  SContributor's solution referenced the Beta 9 version, so the file name was slightly different.  (Log4j is about to move from Release Candidate to General Availability, so your downloaded files might be slightly different).  I removed all the other jar files from the library and lo and behold, it worked beautifully!

Sometimes including everything is not only overkill, but it kills your application altogether.

On a slightly different note:
It took me a bit to get Log4j to find my configuration file.  The solution:  Using the project properties, make sure the folder containing your configuration file is included in the list of source package folders.

Friday, April 11, 2014

Making Progress

In my last post, I lamented how difficult it is to unit test the Persistence layer.  My problem revolves around creation of a valid EntityManager.  Today, I revisited Joerg Gross' DevBlog entry "JPA Unit test setup with NetBeans". This time, I worked through each step of his sample code and managed to get my test case to compile and successfully instantiate an EntityManager.

But what was even more useful, was that the failures in instantiating the EntityManager clearly identified several issues with my Entity classes.  Now that I've finally got the EntityManager working, I can move on to actually building test cases.  Stay tuned.

In my implementation, I did two things differently from Mr. Gross.  First, I don't have a web tier, so the reference to /WEB-INF/classes/ in the build file isn't relevant.  Also, I didn't create an orm.xml file. 

Here is the code as I implemented it:

In the Test class:
  @BeforeClass
  public static void setUpClass() {
    /* Code adapted from
     *  Author: Joerg Gross
     *  Title: DevBlog "JPA Unit test setup with NetBeans"
     *  URL: http://eercp.blogspot.com/2011/12/ide-setup-with-netbeans.html
     */
    Properties props = new Properties();
    props.put("javax.persistence.transactionType", "RESOURCE_LOCAL");
    props.put("javax.persistence.jtaDataSource", "");
    props.put("javax.persistence.nonJtaDataSource", "");
    props.put("eclipselink.jdbc.driver", "org.apache.derby.jdbc.EmbeddedDriver");
    props.put("eclipselink.jdbc.url", "jdbc:derby:testDB;create=true");
    props.put("eclipselink.jdbc.user", "");
    props.put("eclipselink.jdbc.password", "");
    props.put("eclipselink.ddl-generation", "drop-and-create-tables");
    props.put("eclipselink.logging.level", "INFO");

    try{
    em = Persistence.createEntityManagerFactory("BirdDogDataPU",props).createEntityManager();
    } catch(PersistenceUnitLoadingException e){
      System.out.println(e.getMessage());
      throw e;
    }

  }

In the build-impl.xml file:
    
        
        
            
        
    


On a slightly different note:
Does anyone else ever enter their password into the username field? About 20% of the time, my login pages will open without the username auto-populated and I automatically start typing my password, oblivious to the fact my cursor is in the wrong place. Have you ever done this?  No, just me. Yeah, I figured as much.