Monday, January 5, 2015

Data Entropy, Part 3 -- Data isn't formatted.

If you use MS Outlook, then you are probably familiar with the way in which Outlook stores phone numbers; that is as separate Country Code, Area Code, Number and Extension.  If you make sure you store these values separately, then Outlook will display the numbers in a standard format.  However, there is nothing in the system that will prevent you from storing the numbers in a non-standard format.  Outlook simply throws the entire thing into the "Number" field and displays it as you entered.  What this illustrates is the fact that formats are and should be kept separate from the data which they display.

When data gets shared, the format in which it is stored can cause confusion or even errors.  Take for example a field that stores the amount of available storage used by files on your harddrive.  Given the wide variance in file sizes, we have grown accustomed to seeing the units appended to the end, like 45GB or 3.5TB.  But if you stored those values "45GB" and "3.5TB" in your database, how would you start adding them up?  How would you be able to total the amount of storage used by .gif files versus the amount used by .mp3s?  The logic required becomes mind numbing and maintaining that code requires considering all of the ways in which a user may enter the value.  The right solution is to decide on a unit of measure for the storage metric, (Personally, I like bytes in this case) and stick to it.  Then you can modify your display algorithms to provide units.  Additionally, you have a standard value by which you can transfer the data to any system without worrying about how the information is formatted.

On a slightly different note:
A group of colleagues and I started reminiscing about backup media and what it took to store all those floppies/CDs/DVDs "back in the day".  So, I decided to take a look back and see what it would take to store 1 terabyte of information using some common media from our past.  Here are the results.

Media Type Capacity # to store 1TB
5-1/4 In floppy disk 360 KB 2,982,617
3.5 in HD floppy disk 1.44 MB 728,178
8 in floppy disk 6.2 MB 169,126
SuperDisk - LS120 120 MB 8,739
SuperDisk - LS240 240 MB 4,370
CD-ROM 650 MB 1,614
Zip Disk 750 MB 1,399
Jaz Disk 1 GB 1,024
DVD (2-Layer) 9.4 GB 109

Saturday, May 10, 2014

Data Entropy, Part 2 -- Design early and often

"Just build it.  If there are problems, we'll fix it later."

I can't tell  you the number of times I've heard project managers tell me that.  Nobody seems to want to spend time on design. And those who are willing to spend time on application design don't seem to think that designing the data structure is important.  "We're going to handle that in the application." is often the refrain.  If some application coders had their way, the database would consist of a table for each screen, with nothing but variable character columns named col_1, col_2 and so on.

The truth is that data matters.  So why wouldn't you spend time planning how you will store it, manage changes to the structure and account for future growth, not just in size but in structure as well.

Design Early

I start every design with the same, standard, normalized, logical schema.  It consists of 7 top level entity types ("Activity", "Agreement", "Asset", "Data", "Event", "Location", and "Party") and relationships for each intersection of two entities, for example, "Activity_Location".  I have yet to encounter an entity I could not model as a subtype of one of these 7 top level entities or a relationship between them.  This allows for very flexible designs, while maintaining a simple set of data definitions. Additionally, it simplifies the process of extending the data structures down stream, since there is already a general case for any conceivable data element.

Of course, the logical model has to be denormalized to implement an efficient physical database.  Subtyping from the standard schema above, I find the most common denormalization is moving referential columns into one of the associated entity tables for 1:n or 1:1 relationships.  In the normalized schema above, even these relationships are modeled as subtypes of the intersection tables.

... and Often

A common mistake made at this point is to abandon the logical model and focus entirely on the physical database.  The problem is that by doing this you lose all the advantages of having started with a standard model.  More changes will lead to greater variance from the original design and increases inthe real and potential data entropy of the system.

The way this should be handled, is to revisit the logical model for every proposed change to the database.  In this way, the designer can identify the correct entity which needs to be changed, extended, or subtyped and how it is mapped into the physical database.

A trivial example of this exists in my Bird Dog application.  One of the things I am tracking is my relationship to the people in the application.  Logically, this is a relationship between User (a subtype of Asset) and Person (subtype of Party) and would normally be a many to many relationship.  However, I am currently the only user so, I took a shortcut and implemented the reference as a column in the Party table.  I don't even have a User entity.  Eventually, I will need to implement the actual m:n relationship so that the system can support multiple users.  If the developer failed to consult the logical model design, it's likely that the existing "party_current_user_relationship_type_name" be completely ignored.  New tables for User and the intersection between User and Party created.  Initially, the data in the old column will be duplicated in the new table and as time passes the data becomes stale and conflicts with the new intersection table.  If the developers had consulted the logical model, they would have known to use the data in the column to populate the intersection table and to remove the column.  Doing this would have avoided both of the data entropy problems.

On a slightly different note:
I was going to create a SQL script to implement the top level entities and add it to the Sample Code page.  But the fact is until they are subtyped, the attributes are extremely limited.  Instead, here are the definitions I use.
Activity -- Anything that takes place over a period of time.  Examples include tasks and meetings
Agreement -- An accord among two or more Parties.  Examples include contracts and orders.
Asset -- Anything that has value or may be owned.  Examples include bank accounts, equipment, labor categories, user accounts
Datum -- Information that is independent of an entity.  Examples include documents, media files, and category and classification definitions.
Event -- Anything that occurs at a point in time.  Examples include financial transactions, state transitions and status changes.
Location -- Where something may be found, sent or received.  Examples include coordinates, property addresses, geopolitical entities, URLs and file system locations.
Party -- Anything that can take action, enter into an agreement or possess assets.  Examples include people and companies.

Friday, May 9, 2014

Data Entropy - Part 1 - Introduction

When I started this post, it was supposed to be about a page long.  However, I've realized there is a lot more to cover than a single page will allow; so, I'm breaking it up.

Over the last week, I have spent between 10 and 20 hours cleaning up my contacts list, and I still have to put the corrected data back into my phone and email accounts. I have less than 1000 entries, but when there are no data entry standards or controls, things get pretty messed up. For example, I have all of the following in the Country field: "Canada", "Great Britain", "Israel", "RO", "United Kingdom", "United S", "United States", "United States of America", "United States ofAmerica","US","USA","ww", and "CA".  There are 13 unique values but only 5 countries in which my contacts reside.  "ww" is not one of them and I had to do an internet search to remember what "RO" meant.  Although a person would be able to know what almost all of these different entries mean, a machine will have a much harder time, especially if you are coding from scratch and not using an existing heuristic or fuzzy match algorithm.  And the Country field was just one of several fields that had been populated inconsistently.  Just identifying which records were, in fact duplicates and not just similar took me nearly 2 hours.

The time I wasted fixing my data exemplifies an issue that has attracted my attention for nearly twenty years.  I call it "Data Entropy".  In thermodynamics, entropy is a measure of the energy that is not available for work.  In the case of data entropy, it is a measure of the difficulty associated with collecting the data into useful information.  Using my example above, if the useful information were how many of my contacts work in North America, a relatively low data entropy would allow the gathering of that information by simply counting the number of entries whose country field equaled "United States of America" or "Canada".  Since my contacts list contains more than two values which mean either "United States of America" or "Canada", a more complex algorithm has to be employed to ensure these records are included in the count.

This is not just an academic exercise.  There are real world, expensive consequences for database with a high level of data entropy.  When companies need to move data from one application to another, such as when migrating to a new customer service application or adding data to a data warehouse, one of the most time consuming and complex challenges is cleaning up or transforming the data in order for it to be useful in the new environment.  The effort required can cost millions of dollars for very large or complex data sets.

Like entropy in physics, the entropy of a data system will never decrease without external intervention (like manually cleaning up my contact list).  More often than not, any database which is used regularly for any significant period of time, will suffer a continuing increase in its data entropy.  My intention is to illustrate what factors contribute to increasing data entropy and what can be done to mitigate them.

I am 99.9999% certain the following list will change, but here are some of the items I plan to address.
  • Design early and often.
  • Data isn't formatted.
  • Oil and water don't mix, and neither do data types.
  • Murphy is a user.  If he can do it, he will.
  • Everyone goes their own way; and uses their own abbreviations
  • Does it really mean what you think it means?
  • Where is Springfield?  Context matters.
  • Time breaks all things. Meaning and use change over time
  • Mo' systems, Mo' trouble. Beware conflicting usages by different applications

On a slightly different note:
I did manage to write VBA code that extracts my contacts information from Outlook and writes it into my Orion spreadsheet.  If you're interested, I've added the code to my sample page.

Friday, April 18, 2014

Less is better

Once I got testing working, I turned to log generation. Nothing helps debug more than log files. Apache just put out a release candidate for version 2 of their Log4j utility. I've used version 1 in the past and, of course, "new is always better".

Unlike my experience with v1, I decided to actually learn how the utility is supposed to work, so I could create real log files, and actually manage the output with valid configuration files.  I read the introduction and implementation documentation.  I figured I was ready to try it out.

I downloaded the zip file, unzipped it into my shared library folder.  Created a library in NetBeans included all the jar files, linked them into my project, wrote some test code into my java and ...

Null Test:  Caused an ERROR
java.lang.NoClassDefFoundError: org/slf4j/ILoggerFactory
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(

What!!!  I did everything the way I was told.  Back to the internet for help.  Turns out the documentation for Apache Log4j 2 doesn't include information on implementing the jar files into your classpath.  I guess they assume if you're using it, you must be advanced enough to know how to put jar files into your projects properly.


More internet.  By far the best (and by "best" I means "worst") answer I found was this gem:  "You have to provide one of the various SLF4J implementation .jar files in the classpath, as well as the interface .jar file. This is documented."  by the user EJP

Documented?  Really?  Where?  What implementation .jar files are available?  No links were provided.  That was the whole answer.  And the really surprising thing is this answer got four upvotes!!

Finally, I found this post by SContributor.  The solution is that only two jar files are needed:  log4j-api.2.0-{version}.jar and log4j-core-2.0-{version}.jar.  The {version} may be different, as it refers to the unique version of the jar file.  In my case, I have release candidate 1, so the file names are log4j-api.2.0-rc1.jar and log4j-core-2.0-rc1.jar.  SContributor's solution referenced the Beta 9 version, so the file name was slightly different.  (Log4j is about to move from Release Candidate to General Availability, so your downloaded files might be slightly different).  I removed all the other jar files from the library and lo and behold, it worked beautifully!

Sometimes including everything is not only overkill, but it kills your application altogether.

On a slightly different note:
It took me a bit to get Log4j to find my configuration file.  The solution:  Using the project properties, make sure the folder containing your configuration file is included in the list of source package folders.

Friday, April 11, 2014

Making Progress

In my last post, I lamented how difficult it is to unit test the Persistence layer.  My problem revolves around creation of a valid EntityManager.  Today, I revisited Joerg Gross' DevBlog entry "JPA Unit test setup with NetBeans". This time, I worked through each step of his sample code and managed to get my test case to compile and successfully instantiate an EntityManager.

But what was even more useful, was that the failures in instantiating the EntityManager clearly identified several issues with my Entity classes.  Now that I've finally got the EntityManager working, I can move on to actually building test cases.  Stay tuned.

In my implementation, I did two things differently from Mr. Gross.  First, I don't have a web tier, so the reference to /WEB-INF/classes/ in the build file isn't relevant.  Also, I didn't create an orm.xml file. 

Here is the code as I implemented it:

In the Test class:
  public static void setUpClass() {
    /* Code adapted from
     *  Author: Joerg Gross
     *  Title: DevBlog "JPA Unit test setup with NetBeans"
     *  URL:
    Properties props = new Properties();
    props.put("javax.persistence.transactionType", "RESOURCE_LOCAL");
    props.put("javax.persistence.jtaDataSource", "");
    props.put("javax.persistence.nonJtaDataSource", "");
    props.put("eclipselink.jdbc.driver", "org.apache.derby.jdbc.EmbeddedDriver");
    props.put("eclipselink.jdbc.url", "jdbc:derby:testDB;create=true");
    props.put("eclipselink.jdbc.user", "");
    props.put("eclipselink.jdbc.password", "");
    props.put("eclipselink.ddl-generation", "drop-and-create-tables");
    props.put("eclipselink.logging.level", "INFO");

    em = Persistence.createEntityManagerFactory("BirdDogDataPU",props).createEntityManager();
    } catch(PersistenceUnitLoadingException e){
      throw e;


In the build-impl.xml file:

On a slightly different note:
Does anyone else ever enter their password into the username field? About 20% of the time, my login pages will open without the username auto-populated and I automatically start typing my password, oblivious to the fact my cursor is in the wrong place. Have you ever done this?  No, just me. Yeah, I figured as much.

Monday, April 7, 2014

Unit Testing JPA is Painful!!!

So, I decided since I'm rebuilding my BirdDog app from scratch, I'd take the time at each tier to implement thorough unit testing.  Should be easy enough, right.  Not so much.  First, there isn't any automatic help in NetBeans (or anywhere else it seems) for creating the necessary persistence objects in order to test the entity classes.  It is easy to create the JUnit Test classes, though.

So, I went in search of some help.  So far, I've found a few promising web pages.  I had a few bumps, but one thing remains broken.  I can't get the EntityManager to create properly.  UGH!!!!

Here is my current error:
Could not initialize class

PersistenceHelper is a static class for setting up a test database in Derby and creating the EntityManager.  Here's the code that's failing:

  private static final EntityManager entityManager;
  static {
    Properties props = new Properties();
    props.put("javax.persistence.transactionType", "RESOURCE_LOCAL");
    props.put("javax.persistence.nonJtaDataSource", "");
    props.put("eclipselink.jdbc.driver", "org.apache.derby.jdbc.EmbeddedDriver");
    props.put("eclipselink.jdbc.url", "jdbc:derby:testDB;create=true");
    props.put("eclipselink.jdbc.user", "");
    props.put("eclipselink.ddl-generation", "drop-and-create-tables");
    props.put("eclipselink.logging.level", "INFO");
    entityManager = Persistence.createEntityManagerFactory("BirdDogDataPU").createEntityManager();

[EDIT 11-Apr-2014: I found a great source for formatting the code to make it more readable as code.  This is an implementation of SyntaxHighlighter by Alex Gorbatchev. David Craft provided easy to follow instructions for using this on his blog, CraftyFella's Blog.]

These are the two pages I found most useful.
Unit Testing your Persistence Tier code - By Neil Buesing on Nov. 9 2010
DevBlog: JPA Unit test setup with NetBeans - Joerg Gross

I'm done for the night.  Hopefully, the light will shine on soon.

On a slightly different note:
If anyone is following this, I have a question for you.  Would it be beneficial for me to upload my code into SourceForge?  Personally, I don't need the version control, etc.  However, if people are interested in seeing the code or helping me learn, then I'll go through the trouble of creating a project.  You have to reply in the comments and let me know you'd like to see this.

Good night.

Thursday, April 3, 2014

Upgrading to NetBeans 8

I upgraded to NetBeans 8 a couple days ago and so far, I am really enjoying the changes.  One of my favorites is not well advertised and probably won't appear on anyone else's Top New Features List.

The refactoring in variable renaming is much improved.  First, when you rename an inner variable, you have the option of renaming all the associated Getters and Setters.  That alone is a significant time saver for me.

The second feature I've noticed in refactoring variable names happens in JPA Entity classes.  Relationships between classes depend on the inner variable names.  In the past when I changed the names, I would often forget one or two of the related class references and my code would fail.  Then I would spend hours trying to find the problem, because the log files weren't very clear on the problem.  Now, when you refactor a variable name in one class, the "MappedBy" value in the related class is updated automatically, removing one of my most common nemeses in debugging.

If you're interested in the new NetBeans IDE, you can download it here.  There are certainly enough new features to warrant the upgrade.  If you're new to NetBeans, I recommend it.  The learning curve is quick, better than Eclipse or JDeveloper, and the NetBeans Platform for developing Rich Client apps is a great time saver.