Friday, July 31, 2009

The Importance of Being Informative

If you are a public transit commuter in Washington DC, no doubt you are spending a fair portion of your time wondering about the state of our subway transit. I have been commuting via subway for a fair number of years now, and during that time I have often made observations about the dynamics of our subway, and how they relate to the design and architecture of applications that we build for the web.


It was not so long ago that WMATA installed status signs above the tracks in every subway station. As an example, I have included a photo of one of those signs.

This represents a big improvement over what had existed before, which was limited to a set of flashing lights at track level meant to warn you that a train was just arriving, but really nothing more than that. Without these signs, passengers would stand in the subway station without any clue as to when the next train will arrive.

As you can see the color of the line (RED), the size of the train in cars (6), the direction (GLENMONT) and the arrival time (ARR) is presented on these status signs. This is a good thing because it helps newly arriving customers that enter the train station understand how long they have to wait before their train arrives.

But I submit that this is not good enough. As a matter of fact, with just a little more thought metro could have done much better. Here are all the things I think are actually wrong with the current implementation:
  1. These electronic dynamic signs are being used to present mostly static information. Over half of the sign is used to present the name of the line (RD) and the direction of the train (GLENMONT), but for a given platform on our subway this never really changes! The train station is littered with signs that tell you the direction of the train on the track, and the train itself tells you the final station so having this information on a electronic sign is pretty much redundant. Displaying static information is not the best way to use a dynamic presentation sign.
  2. Although the "next train arriving information" is certainly relevant, it is not the most pertinent to a metro traveller. The status signs tell you how long you have to wait until the next car arrives, but is this really what you want to know? Or is there something more pertinent than this?
  3. The signs are in the wrong location. These signs provide important information about the state of the system, yet they are located far inside the metro. In most stations, most of these signs are not even visible from the areas where customers may purchase tickets.
So how could this have been done better? Well, let us look at the relevance issue first. My observation is that although the next train arriving information is good to have, I think what I really care about is when will I arrive at my destination? When I arrive at a given metro station I would want to know if the system is performing at peak capability for the route I want to travel on. These signs actually give the traveler information that is really only important from the WMATA's point of view and not from the passenger's own standpoint.

So what would be a good way to represent this? What I am suggesting looks very much like the flight status boards you see in modern airport terminals, except instead of displaying departure times for trains leaving the station it should show the expected arrival times for all the stations in the system accessible from this track (direction) at this station.

And finally, the last change I would suggest is to make these signs viewable from the street entrances. Why? As a traveler I want to know the state of metro before I pay my fare. I should be able to decide whether or not I want to ride the train, take a taxi, or drive to my destination. This is very important because open systems with finite capacity (like subways and highways) cannot control the arrival patterns of new users. As a matter of course, systems like these should always provide real time performance capability feedback to its end users. In the case of Washington DC's subway, this would allow the system to recover from performance issues more gracefully. When the system is overloaded, some customers would not enter the train station if they were made aware of issues at the gate. Taxi cab drivers will also be able to use this information to pick up passengers that, due to the current overloaded state of the subway system, would not want to rely on metro to get them to their destinations on time.

I have addressed the three aforementioned issues, but there is still one issue that I have not handled. It is a matter of money. Metro has already invested quite a bit of money on the signs they have, and money in this climate is hard to find. How could WMATA implement the improvements I am suggesting at minimal cost to the taxpayers of DC?

Well, one way to do this would be to reuse the existing signs. There is a problem with this approach. The Washington DC metro has over 80 metro stations. To show all 80 of those stations using the existing status boards would be a challenge to say the least. How could we present this amount information given such a limitation in real estate?

What if WMATA used colored symbols (shapes) combined with a prioritized information set? One example of such a presentation could be to model this using a combination of popular station hops and system status colors. The picture below gives you an example of what one such sign would look like:


So instead of showing the end of the line information and the arrival time of the next train, we can show the expected arrival times for key stations in the system. What stations would qualify as key stations? This would depend on the station you are waiting in, but it would most often be transfer stations, and popular exit points. So if you get on the Red line train at Takoma station and you are heading downtown during the morning rush hour, the message board should give you the expected arrival times for Takoma, Union Station, Gallery Place, and Metro Center. For the vast number of travelers entering Takoma I would expect that this would be more than enough to determine whether or not the system is working well. The color coded symbols will tell you whether or not the expected arrival time is considered normal or abnormal.

In the sign above, the 7 minute delay between Gallery Place and Metro Center is considered below normal performance, and as such, a red heptagon (for color blind people a change in shape is helpful) indicates that the system is performing below expectations between Gallery Place and Metro Center.

In many ways, the information sharing challenges experienced by Washington DC's metro system are similar to many of the issues we architects experience in creating high performance web sites. Displaying accurate system status to end users is a design factor too often left out. This blog posting addresses this mistake in a highly used and very public transit system but the techniques used to solve those problems can be directly transfered to similar situations on the Web.

Monday, July 27, 2009

Why Do We Continue to Join?

Just recently my wife and I built a new walk-in closet. When we were finished, in preparing to use the new space, I thought about how my clothes used to be packed and realized there were a lot of inefficiencies. As an example, I used to keep my boxers in a separate drawer from my undershirts. I had been doing this since my childhood. It meant that every day that I got ready for work I would open one drawer to get a pair of boxers, then open another drawer to fetch my undershirt. Thinking about it a bit, it seemed to be such a waste of time to keep them in separate drawers. Why didn't I just put my boxers and undershirts in the same drawer?

So that, along with other improvements, is exactly what I did. Now getting dressed in the morning is a little bit easier for me.

This is what I want to talk about today. Revisiting old data schema arrangements and asking ourselves whether or not this is the best schema for our product?

Most of the database schemas today, backing traditional CRM applications are designed in such a way that data entities belonging to the same customer record are stored in different tables. To bring those entities together to form the entire customer record, these same CRM applications rely heavily on relational database joins. Individual data entities are stored in separate tables, but are retrieved via joins to make up the customer record. This would make a lot of sense if these records were normally retrieved in parts, but more and more this is not the case.

Instead, the business expects us to retrieve customer records in whole. Now there is talk of the 360 degree view of the customer. Although there are varying opinions of what this means, it is generally understood that it means that all information about the customer should be retrievable via the customers unique identification number. If you think this sounds like a hashtable key, value relationship I will have to agree with you.

So why do we continue to store data in this manner? I believe that one word gives us the answer, and that word is "tradition". We have an entire generation of programmers, relational database administrators, and technical managers that believe that "good" database design always incorporates normalized entities that are connected via foreign keys to other entities. They think this without giving much thought to the way the data they are storing will actually be retrieved.

Instead should we not revisit the notion that customer records need to be separated into different data entities? I would submit that once this examination is completed, it will become apparent that relational schemas make little or no sense for the operational stores of modern CRM applications.

Does this mean that there is no place for relational databases? Of course not! Relational databases are great for creating reports and performing ad-hoc data analytics. As they are designed today, using a key value store for reporting would be painful at best. But for the types of canned data requests that modern day call centers produce, a key value store is tough to beat.

Instead, what this does mean is that more consideration towards using key value stores in the enterprise is needed. There are several that are becoming popular. MemCache makes a lot of sense when caching joined information by key, as does EHCache. Using this approach, this joined information can be stored in a persistent relational database, but once the expensive join is completed, the now flattened information from that join can be stored in memory and retrieved via any number of keys. Amazon's S3 remains a viable option, as does the open source distributed key stores like Voldemort, Cassandra, and Couch DB. Finally, Oracle's Berkeley DB bears some looking into as well.

Wednesday, July 22, 2009

Latest Hannibal Release Sports Persistence via Amazon's S3 Key Store!

Now you can use S3 as a persistent store when using Hannibal. The wiki has all the details.

Hannibal Version 0.40 Release 593 just Uploaded.

This release contains numerous bug fixes as well as support for Amazon's S3 persistence. In addition, developers may now use the Hannibal's S3Realm.

Here are the release notes.