Soft delete is a commonly-used pattern amongst database-driven business applications. In my experience, however, it usually ends up causing more harm than good. Here’s a few reasons why it can fail in bigger applications, and some less-painful alternatives to consider.

Tomato, Tomato

I’ve seen a few different implementations of this pattern in action. First is the standard deleted flag to indicate an item should be ignored:

SELECT * FROM Product WHERE IsDeleted = 0

Another style uses meaningful status codes:

SELECT * FROM Task WHERE Status = 'Pending'

You can even give an item a fixed lifetime that starts and ends at a specific time (it might not have started yet):

SELECT * FROM Policy WHERE GETDATE() BETWEEN StartDate AND EndDate

Highway of death

All of these styles are all flavours of the same concept: instead of pulling dead or infrequently-used items out of active set, you simply mark them and change queries to step over the corpses at runtime.

This is a trade-off: soft delete columns are easy to implement, but incur a cost to query complexity and database performance later down the track.

Complexity

To prevent mixing active and inactive data in results, all queries must be made aware of the soft delete columns so they can explicitly exclude them. It’s like a tax; a mandatory WHERE clause to ensure you don’t return any deleted rows.

This extra WHERE clause is similar to checking return codes in programming languages that don’t throw exceptions (like C). It’s very simple to do, but if you forget to do it in even one place, bugs can creep in very fast. And it is background noise that detracts away from the real intention of the query.

Performance

At first glance you might think evaluating soft delete columns in every query would have a noticeable impact on performance.

However, I’ve found that most RDBMSs are actually pretty good at recognizing soft delete columns (probably because they are so commonly used) and does a good job at optimizing queries that use them. In practice, filtering inactive rows doesn’t cost too much in itself.

Instead, the performance hit comes simply from the volume of data that builds up when you don’t bother clearing old rows. For example, we have a table in a system at work that records an organisations day-to-day tasks: pending, planned, and completed. It has around five million rows in total, but of that, only a very small percentage (2%) are still active and interesting to the application. The rest are all historical; rarely used and kept only to maintain foreign key integrity and for reporting purposes.

Interestingly, the biggest problem we have with this table is not slow read performance but writes. Due to its high use, we index the table heavily to improve query performance. But with the number of rows in the table, it takes so long to update these indexes that the application frequently times out waiting for DML commands to finish.

This table is becoming an increasing concern for us — it represents a major portion of the application, and with around a million new rows being added each year, the performance issues are only going to get worse.

Back to the original problem

The trouble with implementing soft delete via a column is that it simply doesn’t scale well for queries targeting multiple tables — we need a different strategy for larger data models.

Let’s take a step back and examine the reasons why you might want to implement soft deletes in a database. If you think about it, there really are only four categories:

  1. To provide an ‘undelete’ feature.
  2. Auditing.
  3. For soft create.
  4. To keep historical items.

Let’s look at each of these and explore what other options are available.

Soft delete to enable undo

Windows 7 Recycle Bin

Human error is inevitable, so it’s common practice to give users the ability to bring something back if they delete it by accident. But this functionality can be tricky to implement in a RDBMS, so first you need to ask an important question — do you really need it?

There are two styles I have encountered that are achieved via soft delete:

  1. There is an undelete feature available somewhere in the UI, or
  2. Undelete requires running commands directly against the database.

If there is an undo delete button available somewhere for users, then it is an important use case that needs to be factored into your code.

But if you’re just putting soft delete columns on out of habit, and undelete still requires a developer or DBA to run a command against the database to toggle the flags back, then this is a maintenance scenario, not a use case. Implementing it will take time and add significant complexity to your data model, and add very little benefit for end users, so why bother? It’s a pretty clear YAGNI violation — in the rare case you really do need to restore deleted rows, you can just get them from the previous night’s backup.

Otherwise, if there really is a requirement in your application for users to be able to undo deletes, there is already a well-known pattern specifically designed to take care of all your undo-related scenarios.

The memento pattern

Soft delete only supports undoing deletes, but the memento pattern provides a standard means of handling all undo scenarios your application might require.

It works by taking a snapshot of an item just before a change is made, and putting it aside in a separate store, in case a user wants to restore or rollback later. For example, in a job board application, you might have two tables: one transactional for live jobs, and an undo log that stores snapshots of jobs at previous points in time:

If you want to restore one, you simply deserialize it and insert it back in. This is much cleaner than messing up your transactional tables with ghost items, and lets you handle all undo operations together using the same pattern.

Soft delete for auditing

Another common practice I have seen is using soft delete as a means of auditing: to keep a record of when an item was deleted, and who deleted it. Usually additional columns are added to store this information:

As with undo, you should ask a couple of questions before you implement soft delete for auditing reasons:

  • Is there a requirement to log when an item is deleted?
  • Is there a requirement to log any other significant application events?

In the past I have seen developers (myself included) automatically adding delete auditing columns like this as a convention, without questioning why it’s needed (aka cargo cult programming).

Deleting an object is only one event we might be interested in logging. If a rogue user performs some malicious acts in your application, updates could be just as destructive so we should know about those too.

One possible conclusion from this thought is that you simply log all DML operations on the table. You can do this pretty easily with triggers:

-- Log all DELETE operations on the Product table
CREATE TRIGGER tg_Product_Delete ON Product AFTER DELETE
AS
	INSERT INTO [Log]
	(
		[Timestamp],
		[Table],
		Command,
		ID
	)
	SELECT
		GETDATE(),
		'Product',
		'DELETE',
		ProductID
	FROM
		Deleted

CREATE TRIGGER tg_Product_Update ON Product AFTER UPDATE
AS
-- ...etc

Contextual logging

The Hangover

If something goes wrong in the application and we want to retrace the series of steps that led to it, CREATES, UPDATES and DELETES by themselves don’t really explain much of what the user was trying to achieve. Getting useful audit logs from DML statements alone is like trying to figure out what you did last night from just your credit card bill.

It would be more useful if the logs were expressed in the context of the use case, not just the database commands that resulted from it. For example, if you were tracking down a bug, would you rather read this:

[09:30:24] DELETE ProductCategory 142 13 dingwallr

… or this?

[09:30:24] Product 'iPhone 3GS' (#142) was removed from the catalog by user dingwallr. Categories: 'Smart Phones' (#13), 'Apple' (#15).

Logging at the row level simply cannot provide enough context to give a true picture of what the user is doing. Instead it should be done at a higher level where you know the full use-case (e.g. application services), and the logs should be kept out of the transactional database so they can be managed separately (e.g. rolling files for each month).

Soft create

The soft delete pattern can also be extended for item activation — instead of simply creating an item as part of the active set, you create it in an inactive state and flick a switch or set a date for when it should become active.

For example:

-- Get employees who haven't started work yet
SELECT * FROM Employee WHERE GETDATE() < StartDate

This pattern is most commonly seen in publishing systems like blogs and CMSs, but I’ve also seen it used as an important part of an ERP system for scheduling changes to policy before it comes into effect.

Soft delete to retain historical items

Raiders of the Lost Database

Many database-backed business applications are required to keep track of old items for historical purposes — so users can go back and see what the state of the business was six months ago, for example.

(Alternatively, historical data is kept because the developer can’t figure out how to delete something without breaking foreign key constraints, but this really amounts to the same thing.)

We need to keep this data somewhere, but it’s no longer immediately interesting to the application because either:

  • The item explicitly entered a dormant state – e.g. an expired eBay listing or deactivated Windows account, or
  • The item was implicitly dropped from the active set – e.g. my Google Calendar appointments from last week that I will probably never look at again

I haven’t included deleted items here because there are very few cases I can think of in business applications where data is simply deleted (unless it was entered in error) — usually it is just transformed from one state to another. Udi Dahan explains this well in his article Don’t Delete — Just’ Don’t:

Orders aren’t deleted – they’re cancelled. There may also be fees incurred if the order is canceled too late.

Employees aren’t deleted – they’re fired (or possibly retired). A compensation package often needs to be handled.

Jobs aren’t deleted – they’re filled (or their requisition is revoked).

Unless your application is pure CRUD (e.g. data grids), these states most likely represent totally different use cases. For example, in a timesheet application, complete tasks may be used for invoicing purposes, while incomplete tasks comprise your todo list.

Different use cases, different query needs

Each use case has different query requirements, as the information the application is interested in depends on the context. To achieve optimal performance, the database should reflect this — instead of lumping differently-used sets of items together in one huge table with a flag or status code as a discriminator, consider splitting them up into separate tables.

For example, in our job board application, we might store open, expired and filled listings using a table-per-class strategy:

Physically separating job listings by state allows us to optimize them for different use cases — focusing on write performance for active items, and read performance for past ones, with different columns, indexes and levels of (de)normalization for each.

Isn’t this all overkill?

Probably. If you’re already using soft delete and haven’t had any problems then you don’t need to worry — soft delete was a sensible trade-off for your application that hasn’t caused any serious issues so far.

But if you’re anticipating growth, or already encountering scalability problems as dead bodies pile up in your database, you might like to look at alternatives that better satisfy your application’s requirements.

The truth is soft delete is a poor solution for most of the problems it promises to solve. Instead, focus on what you’re actually trying to achieve. Keep everything simple and follow these guidelines:

  • Primary transactional tables should only contain data that is valid and active right now.
  • Do you really need to be able to undo deletes? If so, there are dedicated patterns to handle this.
  • Audit logging at the row level sucks. Do it higher up where you know the full story.
  • If a row doesn’t apply yet, put it in a queue until it does.
  • Physically separate items in different states based on their query usage.

Above all, make sure you’re not gold plating tables with soft delete simply out of habit!

November 20th, 2009 | 22 Comments

A while ago I wrote a post how to lose traction on a personal software project, based on mistakes I have made myself in the past that have slowed down or even completely stopped progress. Today I want to share a tip that has greatly improved my time management since I’ve started doing it, and helped combat many “programmer’s block” moments I am notorious for.

Coming back to work

After you take a break from a coding project for a while, getting back into the groove can be difficult. Stopping for a while can be a great way to gain perspective and maybe reevaluate your goals of what the application should and shouldn’t be, but it can also be really hard starting again where you left off.

If you know you will be leaving for a while, and want to return later, you can try leaving a test failing, or some other obvious problem that needs fixing (like a syntax error), but these only keep you going in the short term. It can be difficult choosing the next big piece of functionality to work on.

Keeping an eye on the big picture

Have you ever been in the situation where, working on an application, you surround yourself with a forest of rich infrastructure code — configuration, data access, logging, UI widgets etc — and then stop and realise that’s all your application is: an empty shell that doesn’t do anything useful yet?

Or alternatively, do you ever find yourself getting lost in detail, giving unfair attention to making one little component perfect while neglecting the other 90% needed to be up and running for the alpha release?

All those little extra “clean up” tasks you want to do

While coding I’m always spotting things I want to clean up like refactorings, missing comments, source code formatting, etc. I want to finish my current task before starting something new (and can’t until I check in my current work anyway), but I do want to do these little clean ups at some stage.

All these problems really boil down to one simple question — what should I work on next?

Patrick Star's to-do list

The programmers’ to-do list

I have found from own personal experience that sitting down to cut code without a well-thought-out plan of attack is asking for trouble. As well as the examples above, I generally skip from task to task, dabbling in things I find interesting, but not gaining much real traction towards a useful application.

Over the past few months, I have discovered that keeping a detailed to-do list is a great tool for combating these problems and staying on track. The premise is very simple: take a high-level block of work — e.g. implementing a single user story — and break it down into all the little programming steps you need to do to get there, no matter how insignificant or obvious.

Here’s a dumb example snippet of a to-do list for a web app I am working on in my spare time (the whole thing runs about four pages total). You can see the ‘high level’ tasks get vaguer towards the end, because I haven’t planned them out yet, and right at the bottom there are unimportant cleanup tasks.

A snippet of my to-do list

The key here is to be fine-grained; you want to see all the little tasks that needs to be done, and then tick them off to see your progress. Even if they are always assumed — like validation — it’s like a calendar, only useful if you know all your appointments are in there. And it doesn’t really matter where your high-level tasks start, as long as together they add up to a program that will be useful to a user.

Defer little tasks

If you think of an easy little ‘clean up’ task you’d like to do, don’t do it now — just write it down instead. Why? Because:

  • If you do it now, it could sidetrack you from your current focus
  • It might be a low priority and a poor use of your time right now
  • Instead of doing it now, it could be an easy way to get back into the groove later

For example, if you’ve got half an hour free before going somewhere, and want to spend it on your application, you don’t want to be starting fresh on some giant new piece of work. Why not spend the minutes crossing off one or two of those little clean-up tasks you’ve been meaning to do?

Planning sessions

Alternatively, if you’re not in the mood to code, you could use the time as a planning session and try to think of the next big task you want to achieve, and all the small steps that make it up. This is just as helpful as cutting coding, and makes you think about the big picture.

To-do list software

I’m a geek, so naturally, I want some flash tool for managing my to-do list. There are a lot to-do list applications on the web like Todoist and Remember the Milk. I tried a few, but eventually just went back to a bulleted, indented Word document because I found it by far the quickest to work with.

July 13th, 2009 | 25 Comments

This Monday, we released the first of several stages in a 6-month upgrade project to a big ERP system. As with any software project, it was a big rush at the end, but particularly so this time because I was lead, and we were doing a lot of architectural improvements (which I will talk about here in a series of articles over the next few weeks).

Anyway, in the race for the deadline, I worked late nights and weekends for sixteen days in a row. The stress was fine, because I was really enjoying the work, but by the end, I found I could hardly hold a conversation with someone about anything other than the project itself. It was as if all my social skills had somehow been sapped — the mind was willing, but the tongue had nothing to talk about.

I’ve never experienced such a thing before, and it was pretty strange. It’s important not to neglect your social life when you’re working overtime to finish a project. It’s very easy to start saying no to parties and spending time with your friends, believing you’re doing a good thing by having early nights and recharging your batteries. My advice to you is DON’T. You need the distraction right now more than ever.

June 25th, 2009 | 9 Comments

CouchDB logo

Over the past couple of weeks I’ve been playing with Apache CouchDB. If you haven’t seen it yet, you should check it out — it’s a document database that stores data as a flat collection of JSON objects (aka documents) with no schema, no normalization, no indexes, and no JOINs.

Communication is via RESTful HTTP commands — for example:

GET http://127.0.0.1:5984/books/42

…for retrieving a document ID ‘42′ from a database named ‘books’. This would return the following JSON document:

{
  "_id":"some_doc_id",
  "_rev":"946B7D1C",
  "Title":"Harry Potter",
  "Author":{
    "Initials":"JK",
    "Surname":"Rowling"
  },
  "Pages":"436",
  "Tags":["wizards", "children", "magic"]
}

It sounds pretty crazy if you’re used to big relational databases, but it’s an important shift that is becoming more and more popular with the likes of Google BigTable and Amazon SimpleDB. The fact is, if you’re not writing a transactional application where normalization is a priority, do you really need all the complexity of a relational database? Many common web applications like blogs and CMSs and Twitter could work quite happily with databases like CouchDB. Check out these two great articles that explain it further:

Anyway, CouchDB is still quite young, and hasn’t seen a lot of attention outside the Ruby and Python communities. This sucks. I would like to see a well-designed open-source library for .NET applications using CouchDB for persistence. Here are some features such a library might provide over and above raw HTTP WebRequests:

Requirement #1: Generics and automatic serialization/deserialization of .NET objects

At time of writing, I can only find one CouchDB available for .NET. It’s called SharpCouch, and only works at the string level:

string json = db.GetDocument("127.0.0.1", "books", "42");

Instead, it shouldn’t be too hard to use something like JSON.NET and generics to automatically map between JSON strings and live .NET objects:

Book book = db.GetDocument<Book>("127.0.0.1", "books", "42");

Requirement #2: mapping RESTful HTTP error codes to .NET exceptions

Being a REST app, CouchDB uses HTTP status codes to indicate different states of success or failure, e.g. 404 when a document doesn’t exist. Such errors should be mapped from raw WebExceptions into nicer types like DocumentNotFoundException and DocumentConflictException.

Requirement #3: Implicit Offline Optmistic Lock via document revisions

Offline Optimistic Lock is a pattern used to prevent concurrency problems when two people edit the same piece of data at the same time. Basically, instead of locking an object while one person uses it, you allow multiple users access at the same time, but each user checks to see if it has changed before they modify it.

Optimistic Offline Lock is usually implemented via versioning. As well as an ID, each object has a version number that gets automatically incremented each time the object is modified. Before a client saves changes to the object, it checks to see if the version number has changed since it retrieved it. If it has, then you know someone else has modified the object in the mean time, and your copy is now stale.

While this is optional in most relational databases (for example, if you use a rowversion column in SQL Server), document versioning is a non-negotiable feature of CouchDB: to update a document, you have to know its ID and it’s revision number.

The easiest way of keeping track of this would simply be to add a _rev property to all your domain entities:

public class Book
{
    public string Title { get; set; }
    public string Author { get; set; }
    public string Publisher { get; set; }
    public string _rev { get; set; } // hmm
}

Actually though, I’d prefer not to worry about revision numbers at all. They are only required by CouchDB, and I don’t want to let such concerns leak into my domain model. Instead, document revision numbers should be stored in a separate Identity Map — a list of references to all hydrated database objects currently active in my application.

Requirement #4: LINQ expressions for Map/Reduce functions

CouchDB uses map/reduce functions expressed in javascript to define its views. For example:

map: function(book) {
    emit(doc.NumberOfPages, doc);
  }
}
reduce: function(keys, values) {
   return sum(values);
}

This seems like a pretty good fit for LINQ expressions, no? Perhaps something like:

public BooksByNumberOfPages : IMapReduce<Book, int>
{
    public IEnumerable<int> Map(IEnumerable<Book> books)
    {
        return books.select(b => b.NumberOfPages);
    }

    public int Reduce(IEnumerable<int> keys, IEnumerable<Books> values)
    {
         return keys.Sum();
    }
}

This is not strictly required, and indeed probably wouldn’t get used that much because CouchDB views are very static, but could be a nice bit of syntactic sugar.

Requirement #5: ID generators like NHibernate

CouchDB encourages you to provide your own IDs for documents — otherwise it will give you big fat ugly GUIDs. I would like to see ID generators similar to NHibernate’s hi/lo algorithm, which produces human-readable and approximately-sequential IDs.

Final words

From these ideas, clearly revision tracking with an identity map is the most important — in fact this was the reason I decided not to use CouchDB for a recent project to which it was well suited (because I didn’t have time to write a persistence framework for it first). Long term, however, CouchDB has a lot to offer the .NET community, and combined with officially-maintained ports for Windows, a well-supported CouchDB .NET library could bring benefit to many projects.

May 2nd, 2009 | 10 Comments

So you’ve started writing your first program — great stuff! Here are a few tips and traps to watch out for along the way.

Long periods of time when your program doesn’t compile and/or run at all

You enthusiastically start work on some big changes (e.g. re-architecting your application), but stop because you hit a brick wall, or don’t have the time to finish it. The source code is left in a crippled and unfinished state. You can’t do anything with any of your code until this is fixed, and the longer you leave it, the more you’ll forget and the harder it will be to get started again.

In Continuous Integration this is called a broken build, and is a big no-no because it impacts other peoples ability to work. Without the pressure of a team environment pushing you forward, having a roadblock like this in your path makes it very easy to lose faith and motivation.

Make massive changes on a whim without revision control or a backup

There’s nothing worse than changing your mind about the design of something after you’ve already thrown away the old one. When you do something that involves changing or deleting existing code, be sure to make some sort of backup in case things don’t work out.

If you haven’t taken the plunge with revision control yet, I highly recommend looking at some of the free SVN or GIT hosting services out there — once you get into it, you’ll never look back.

Ignore the YAGNI principle and focus on the fun/easy bits

Focusing on things like validation, eye candy or even general purpose ‘utility’ functions is a great way to build up a large complex code base that doesn’t do anything useful yet. Focus on the core functionality of your software first — your main features should be complete before you start thinking about nice-to-haves.

Throw your code away and start from scratch

As Netscape famously discovered a few years ago, throwing away existing code to start afresh is almost never a good idea. Resist the urge and make a series of small, manageable code refactorings instead.

Start programming before you have a clear goal in mind

Instead of a command line tool, maybe my application would be better as a simple GUI application? Or, I was originally writing my homebrew game for my old Xbox360, but now that we’ve bought a Wii, I’ll port it to that instead!

What are you actually trying to achieve? Spend some time with a pen and some paper coming up with a really clear vision of what you’re trying to create — e.g. screen mock-ups. If you don’t know what you’re writing from the start, the goal posts will keep moving as you change your mind, and you’ll have no chance of finishing it.

Get carried away with project hype before you’ve actually got anything to show for yourself

Spending hours trying to think of the perfect name for your software, designing an icon, choosing the perfect open-source license and making a website won’t get you any closer to having a working application. Get something up and running first, and worry about telling people about it later.

Start a million new features and don’t finish any of them

Jumping from one idea to another without finishing anything is like spinning a car’s wheels and not going anywhere. Make a list of features your program should have, and put them in order of most-to-least important. Work on them, one-at-a-time, in that order.

February 12th, 2009 | 12 Comments