Reducing SQL Lookup Tables and Function Properties in NHibernate
One of the points made in Jeffrey's agile development boot camp last week that struck a chord with me was that many database-centric designs have lookup tables that aren't really just data. That is, tables that hold values such as statuses that, if modified, can easily break existing code or won't function as expected without the addition of code. We have a lot of these in our codebase, and we've been struggling with the questions of how to address these values in our system. Do we use Enums? If so, do we try and keep the enums in sync with the database values? If not, do we want to use magic numbers/strings to refer to these codes? It's a real PITA to work with these things and really the only reason they're in the database at all is so we have some referential integrity locking down the allowable values of the status column(s).
You can avoid this problem if you push the status logic into objects and don't rely on the database to implement referential integrity for you. Create a Status class that has all of the logic necessary for rendering a status code in a variety of ways (for instance a short code, a long name, and a description, all of which are difficult to implement on Enums without attributes or other extension approaches). Have your primary class reference the status via composition (e.g. a property). Then, in your persistence layer, you can store just the short code value (or integer key) of the class, but without the requirement that there be a lookup table involved. If referential integrity is of great importance, a constraint can easily be applied to the database. The constraint would then be updated with the build at the same time as new values are added to the Status class. The status can also be used similarly to an Enum assuming it has some static properties representing its value instances, like so:
1: public class WorkOrderStatus
2: {3: public static readonly WorkOrderStatus Assigned = new WorkOrderStatus(
4: "ASD", "Assigned", "Assigned", 2, true);
5: 6: public static readonly WorkOrderStatus Cancelled = new WorkOrderStatus(
7: "CAN", "Cancelled", "Cancelled", 5, false);
8: 9: public static readonly WorkOrderStatus Complete = new WorkOrderStatus(
10: "CMP", "Complete", "Complete", 4, false);
11: ... 12: }NHibernate can map this status directly if you create a WorkOrderStatusType class that inherits from NHibernate.Type.PrimitiveType and overrides its methods. The mapping then is simply:
1: <class name="WorkOrder">
2: <property name="Status"
3: type="WorkOrderStatusType, Infrastructure"
4: not-null="true" />
5: ...6: </class>
One very common pattern I've noticed as software evolves is the migration from simple boolean state codes on objects to multi-value states and eventually to the need for some kind of an event log. In the example above, perhaps originally we started with a WorkOrderStatus that was simply a boolean with states for Complete and Incomplete, and it has since moved on to include the states listed above (and several others). Now the customer wants to know when the status changed, and by whom, so they can track the WorkOrder through its workflow.
Typically this requires a bunch of changes in the database and some application logic to get the current status from the event log as the most recent one. Further, the WorkOrder might expose dates for LastUpdated and CompletedDate which should be tied to this same event log, though prior to adding the event log they were set in application code when these state transformations occurred. This is an example wherein NHibernate's support for functions can be very helpful (see also this post by Ayende). Assuming there is an Event table which has columns for BeginStatus and EndStatus as well was a DateTime column, date properties for the WorkOrder could be mapped like so:
1: <class name="WorkOrder">
2: <property name="LastAssignedDate" formula="(
3: select max(e.DateTime) 4: from Event e 5: where (e.WorkOrderId = Id)6: and (e.EndStatus = 'ASD'))"/>
7: <property name="LastUpdatedDate" formula="(
8: select max(e.DateTime) 9: from Event e 10: where (e.WorkOrderId = Id))"/>
11: ...12: </class>




Comments
Josh said on 06 Oct 2008 at 2:54 PM
We used enums instead of lookup tables for ALL of our statuses were I work right now. It becomes a maintenance nightmare since its easier to create an enum, then a database table you don't have to think if its a good idea or not. So you will get 5 enums that are pretty much the same thing, just with an s or something added to one of the values. Then you spend time shifting the enums around because you don't want to have a circular dependency on the assembly.
Also how do you handle outside reporting? Do you code a case statement into a column to get the statuses to the report. If you do that then you run a chance of one of them changing names and you have to change 100+ reports.
I've just been burned by this approach from a maintenance perspective, and it hurts years down the road.
Michael K. Campbell said on 06 Oct 2008 at 3:26 PM
I'm not sure I'd advocate this approach either. It completely violates 3rd normal form. (Not that 3rd normal form is a requirement from a religious perspective - but because it's designed to avoid the problem of having CAN, CANCELLED, Cancel, etc all represent (roughly) the same logical entity.)
And, as Josh mentions, reporting and access to your data by external systems then requires those systems to learn way too much about your business logic.
In other words, you're not getting rid of that pesky coupling that makes this whole affair so difficult - you're spreading it around into potentially multiple applications instead of centralizing it.
Or... worse, you're building lots of sprocs and UDFs that will do lookups on these text or int values to see what they SHOULD mean and then adding that extra overhead to any reporting or interfaces from additional systems that want to use your data.
Case in point... image in medical records system that went this route. 2 years down the road with tons of data loaded into it, the prospect of moving that data to another system would become so painful that it would cease to be an option. (Think of all the translation effort that would need to go through different types of values preserved in non-existant 'lookups'. It would get unweildy quite quickly.)
So, personally, I'd avoid an approach like this unless you were certain that you wouldn't need to access your data from external sources.
In my own experience, even though it takes more time and effort to code it, I've found that the best approach to tackling this issue is to store lookups/values in tables/SQL in the 'traditional' way, and then just lazy-load them into cache as needed. In other words... if I've got a list of different work order states, those get stored in the DB. If i add a new state, then that's added to my DB, and then pulled out when needed, and stored in cache. That way I'm not continuously thumping the DB, but my logic also doesn't break. (Though, I'm not addressing versioning or concurrency concerns at all in this overly-simple explanation....)
Jeffrey Palermo said on 06 Oct 2008 at 7:31 PM
I've had lots of luck with this approach for a transactional database schema. One system owns the schema, so the schema isn't shared. A reporting schema would be denormalized and have the display text of the status in the table it belongs to. We don't report off of our transactional schema. We also don't allow system integration directly with the database. We expose endpoints on the application or do periodic data dumps.
I can appreciate the perspective, and that's why there will never be only 1 way to write software.
Josh said on 07 Oct 2008 at 8:32 AM
Jeffrey,
How would you handle a case of a bulk insert? We have a case where a third party gives us completed transactions, we then need to make a record of it in our system. Using endpoints and the objects of our system to insert these records correctly causes a huge performance problem. So if we get 50,000 records to insert, it takes us 3 hours. When if we had a schema with look up tables and values that the system didn't hold for us, it would take 5 minutes to get that data loaded into our system. Now your system may not be the same, but going from database, to code, to database just to insert records has a middle man that causes some slowdown.
ssmith said on 07 Oct 2008 at 10:01 AM
@josh,
There are often tradeoffs that have to be made between "ideal" or "pure" architectural solutions and performance. Consider denormalizing for performance as a case in point. Typically it is best to start out with the ideal architecture and downgrade as necessary for performance than to prematurely optimize. In your case, if I were faced with the same scenario, I would default to going through my domain objects and taking advantage of their validation and other logic to do the work, but if that couldn't meet my performance needs, then I might go the direct SQL->SQL route, realizing that I'd be duplicating business logic in the process (most likely).
Chris Marisic said on 07 Oct 2008 at 5:00 PM
I prefer separating the need from ever talking to the database for code tables and declare everything in Enums and take advantage of ComponentModel
public enum EventSeverity
{
[Description("Event has faulted")] Error,
[Description("Event may require further inspection")] Warning,
[Description("Event failure critical, immediate attention is required")] Critical,
[Description("Event was successful")] Informational = 0
}
Then you can store the character code value, integer code value (if needed), and a meaningful expression regarding the code value. You will still manually have to keep this in sync with the database but it alleviates the need to ever do stupid calls to the database to look up descriptions for code values or creating sprocs to return code value / description lists to populate combo boxes.
Ali Ozgur said on 23 Jan 2009 at 3:23 AM
@Chris Marisic,
I agree with you, we only need to write good tests to catch the situation when the lookup table has a new record and our domian does not support the new value.