The trouble with soft delete

Soft delete is a commonly-used pattern amongst database-driven business applications. In my experience, however, it usually ends up causing more harm than good. Here’s a few reasons why it can fail in bigger applications, and some less-painful alternatives to consider.

Tomato, Tomato

I’ve seen a few different implementations of this pattern in action. First is the standard deleted flag to indicate an item should be ignored:

SELECT * FROM Product WHERE IsDeleted = 0

Another style uses meaningful status codes:

SELECT * FROM Task WHERE Status = 'Pending'

You can even give an item a fixed lifetime that starts and ends at a specific time (it might not have started yet):

SELECT * FROM Policy WHERE GETDATE() BETWEEN StartDate AND EndDate

Highway of death

All of these styles are all flavours of the same concept: instead of pulling dead or infrequently-used items out of active set, you simply mark them and change queries to step over the corpses at runtime.

This is a trade-off: soft delete columns are easy to implement, but incur a cost to query complexity and database performance later down the track.

Complexity

To prevent mixing active and inactive data in results, all queries must be made aware of the soft delete columns so they can explicitly exclude them. It’s like a tax; a mandatory WHERE clause to ensure you don’t return any deleted rows.

This extra WHERE clause is similar to checking return codes in programming languages that don’t throw exceptions (like C). It’s very simple to do, but if you forget to do it in even one place, bugs can creep in very fast. And it is background noise that detracts away from the real intention of the query.

Performance

At first glance you might think evaluating soft delete columns in every query would have a noticeable impact on performance.

However, I’ve found that most RDBMSs are actually pretty good at recognizing soft delete columns (probably because they are so commonly used) and does a good job at optimizing queries that use them. In practice, filtering inactive rows doesn’t cost too much in itself.

Instead, the performance hit comes simply from the volume of data that builds up when you don’t bother clearing old rows. For example, we have a table in a system at work that records an organisations day-to-day tasks: pending, planned, and completed. It has around five million rows in total, but of that, only a very small percentage (2%) are still active and interesting to the application. The rest are all historical; rarely used and kept only to maintain foreign key integrity and for reporting purposes.

Interestingly, the biggest problem we have with this table is not slow read performance but writes. Due to its high use, we index the table heavily to improve query performance. But with the number of rows in the table, it takes so long to update these indexes that the application frequently times out waiting for DML commands to finish.

This table is becoming an increasing concern for us — it represents a major portion of the application, and with around a million new rows being added each year, the performance issues are only going to get worse.

Back to the original problem

The trouble with implementing soft delete via a column is that it simply doesn’t scale well for queries targeting multiple tables — we need a different strategy for larger data models.

Let’s take a step back and examine the reasons why you might want to implement soft deletes in a database. If you think about it, there really are only four categories:

  1. To provide an ‘undelete’ feature.
  2. Auditing.
  3. For soft create.
  4. To keep historical items.

Let’s look at each of these and explore what other options are available.

Soft delete to enable undo

Windows 7 Recycle Bin

Human error is inevitable, so it’s common practice to give users the ability to bring something back if they delete it by accident. But this functionality can be tricky to implement in a RDBMS, so first you need to ask an important question — do you really need it?

There are two styles I have encountered that are achieved via soft delete:

  1. There is an undelete feature available somewhere in the UI, or
  2. Undelete requires running commands directly against the database.

If there is an undo delete button available somewhere for users, then it is an important use case that needs to be factored into your code.

But if you’re just putting soft delete columns on out of habit, and undelete still requires a developer or DBA to run a command against the database to toggle the flags back, then this is a maintenance scenario, not a use case. Implementing it will take time and add significant complexity to your data model, and add very little benefit for end users, so why bother? It’s a pretty clear YAGNI violation — in the rare case you really do need to restore deleted rows, you can just get them from the previous night’s backup.

Otherwise, if there really is a requirement in your application for users to be able to undo deletes, there is already a well-known pattern specifically designed to take care of all your undo-related scenarios.

The memento pattern

Soft delete only supports undoing deletes, but the memento pattern provides a standard means of handling all undo scenarios your application might require.

It works by taking a snapshot of an item just before a change is made, and putting it aside in a separate store, in case a user wants to restore or rollback later. For example, in a job board application, you might have two tables: one transactional for live jobs, and an undo log that stores snapshots of jobs at previous points in time:

If you want to restore one, you simply deserialize it and insert it back in. This is much cleaner than messing up your transactional tables with ghost items, and lets you handle all undo operations together using the same pattern.

Soft delete for auditing

Another common practice I have seen is using soft delete as a means of auditing: to keep a record of when an item was deleted, and who deleted it. Usually additional columns are added to store this information:

As with undo, you should ask a couple of questions before you implement soft delete for auditing reasons:

  • Is there a requirement to log when an item is deleted?
  • Is there a requirement to log any other significant application events?

In the past I have seen developers (myself included) automatically adding delete auditing columns like this as a convention, without questioning why it’s needed (aka cargo cult programming).

Deleting an object is only one event we might be interested in logging. If a rogue user performs some malicious acts in your application, updates could be just as destructive so we should know about those too.

One possible conclusion from this thought is that you simply log all DML operations on the table. You can do this pretty easily with triggers:

-- Log all DELETE operations on the Product table
CREATE TRIGGER tg_Product_Delete ON Product AFTER DELETE
AS 
	INSERT INTO [Log]
	(
		[Timestamp],
		[Table],
		Command,
		ID
	)
	SELECT
		GETDATE(),
		'Product',
		'DELETE',
		ProductID
	FROM
		Deleted

CREATE TRIGGER tg_Product_Update ON Product AFTER UPDATE
AS
-- ...etc

Contextual logging

The Hangover

If something goes wrong in the application and we want to retrace the series of steps that led to it, CREATES, UPDATES and DELETES by themselves don’t really explain much of what the user was trying to achieve. Getting useful audit logs from DML statements alone is like trying to figure out what you did last night from just your credit card bill.

It would be more useful if the logs were expressed in the context of the use case, not just the database commands that resulted from it. For example, if you were tracking down a bug, would you rather read this:

[09:30:24] DELETE ProductCategory 142 13 dingwallr

… or this?

[09:30:24] Product 'iPhone 3GS' (#142) was removed from the catalog by user dingwallr. Categories: 'Smart Phones' (#13), 'Apple' (#15).

Logging at the row level simply cannot provide enough context to give a true picture of what the user is doing. Instead it should be done at a higher level where you know the full use-case (e.g. application services), and the logs should be kept out of the transactional database so they can be managed separately (e.g. rolling files for each month).

Soft create

The soft delete pattern can also be extended for item activation — instead of simply creating an item as part of the active set, you create it in an inactive state and flick a switch or set a date for when it should become active.

For example:

-- Get employees who haven't started work yet
SELECT * FROM Employee WHERE GETDATE() < StartDate

This pattern is most commonly seen in publishing systems like blogs and CMSs, but I’ve also seen it used as an important part of an ERP system for scheduling changes to policy before it comes into effect.

Soft delete to retain historical items

Raiders of the Lost Database

Many database-backed business applications are required to keep track of old items for historical purposes — so users can go back and see what the state of the business was six months ago, for example.

(Alternatively, historical data is kept because the developer can’t figure out how to delete something without breaking foreign key constraints, but this really amounts to the same thing.)

We need to keep this data somewhere, but it’s no longer immediately interesting to the application because either:

  • The item explicitly entered a dormant state – e.g. an expired eBay listing or deactivated Windows account, or
  • The item was implicitly dropped from the active set – e.g. my Google Calendar appointments from last week that I will probably never look at again

I haven’t included deleted items here because there are very few cases I can think of in business applications where data is simply deleted (unless it was entered in error) — usually it is just transformed from one state to another. Udi Dahan explains this well in his article Don’t Delete — Just’ Don’t:

Orders aren’t deleted – they’re cancelled. There may also be fees incurred if the order is canceled too late.

Employees aren’t deleted – they’re fired (or possibly retired). A compensation package often needs to be handled.

Jobs aren’t deleted – they’re filled (or their requisition is revoked).

Unless your application is pure CRUD (e.g. data grids), these states most likely represent totally different use cases. For example, in a timesheet application, complete tasks may be used for invoicing purposes, while incomplete tasks comprise your todo list.

Different use cases, different query needs

Each use case has different query requirements, as the information the application is interested in depends on the context. To achieve optimal performance, the database should reflect this — instead of lumping differently-used sets of items together in one huge table with a flag or status code as a discriminator, consider splitting them up into separate tables.

For example, in our job board application, we might store open, expired and filled listings using a table-per-class strategy:

Physically separating job listings by state allows us to optimize them for different use cases — focusing on write performance for active items, and read performance for past ones, with different columns, indexes and levels of (de)normalization for each.

Isn’t this all overkill?

Probably. If you’re already using soft delete and haven’t had any problems then you don’t need to worry — soft delete was a sensible trade-off for your application that hasn’t caused any serious issues so far.

But if you’re anticipating growth, or already encountering scalability problems as dead bodies pile up in your database, you might like to look at alternatives that better satisfy your application’s requirements.

The truth is soft delete is a poor solution for most of the problems it promises to solve. Instead, focus on what you’re actually trying to achieve. Keep everything simple and follow these guidelines:

  • Primary transactional tables should only contain data that is valid and active right now.
  • Do you really need to be able to undo deletes? If so, there are dedicated patterns to handle this.
  • Audit logging at the row level sucks. Do it higher up where you know the full story.
  • If a row doesn’t apply yet, put it in a queue until it does.
  • Physically separate items in different states based on their query usage.

Above all, make sure you’re not gold plating tables with soft delete simply out of habit!

November 20, 2009

35 Comments

Tobin Harris on November 21, 2009 at 7:52 pm.

Nice post, and I was pleasantly surprised to see yUML diagrams in there :)

Even for small systems I see the problems you highlight when using status flags. In our last collaboration tool we had status flags on every entity (Pending, Active, Deleted). Querying is more complex and the general flags hide the real business intention behind an operation.

Conceptually I bung this stuff under “Entity lifecycle” problem. It’s still not easy enough to handle entity state transitions and workflow in applications. The patterns are there but still feel clunky and hard work to implement.

I actually wonder if our database systems and programming languages are missing a few abstractions for expressing and dealing with Entity state-related operation and workflows.

bezarah on November 21, 2009 at 10:58 pm.

Classic:

“Getting useful audit logs from DML statements alone is like trying to figure out what you did last night from just your credit card bill”

I’ve implemented soft delete over and over again. It’s usually being done because those driving the requirements don’t trust their users with a delete button.

The vast majority of the time, no one ever tries to recover the information; they just want to know what happened.

A solid audit trail is probably all that’s needed.

BZ

Joachim Schipper on November 21, 2009 at 11:29 pm.

Couldn’t you solve the performance issue rather simply by creating a view and indexing only on that (i.e. by using a materialized view or partial index)? You most likely don’t need all those indices on your historical data – which is unlikely to ever be accessed…

FlyingAvatar on November 21, 2009 at 11:50 pm.

Ditto Joachim.

Adding an index to the isDeleted/status column will be as efficient as having the data in a different table. Arguably the overall load may be lower, since you don’t need to move the row to a different location to perform the delete.

Creating a view for non-deleted data removes the need for a where clause, which is the only “complexity” to speak of.

“Undo” / “Audit Trail” is an entirely separate problem, but the suggestion to make a table per class breaks normalization rules, is no more efficient and is more complex to implement than just having a status column in a single table.

Blah on November 22, 2009 at 12:15 am.

Check out the book “Developing Time Oriented Database Application in SQL” for a systematic look at some options for storing historical data. A wider topic than just soft deletion, but related.

It’s out of print and the author made it available on his site:
http://www.cs.arizona.edu/people/rts/tdbbook.pdf

Most of the time, I use a main table representing the current state and also maintain a history table.

Kane Barton on November 22, 2009 at 1:06 am.

Hi Richard

You’ve put together a well written and comprehensive article (something which is difficult to do). However in response to your statement “The truth is soft delete is a poor solution for most of the problems it promises to solve” it seems that you might be suggesting developers focus on the technology solution rather than modelling the business scenario. Udi Dahan has an excellent post that describes why you should always prefer a “soft” delete: http://www.udidahan.com/2009/09/01/dont-delete-just-dont/

Matthew W on November 22, 2009 at 2:51 am.

Another reason to soft-delete instead of delete is the “massive cascading foreign key dependencies” problem.

It’s much easier to just set a flag on something than worry about all the triggers and hooks that need to run to tidy up associated data after something is removed.

In simple cases a few “on delete cascade”s will do the trick, but often you need application-level trigger code too, and it can get very hairy.

Not saying this is right or wrong, but it’s often a factor in the decision.

By the way do you have any references for

“most RDBMSs are actually pretty good at recognizing soft delete columns (probably because they are so commonly used) and does a good job at optimizing queries that use them.”

Are there really optimisations geared specifically to these kinds of columns, or is it just the same stuff which apply more generally to frequently-used indexed columns?

I’ve found the extra where clauses can kill performance on certain kinds of queries (especially those involving joins). Often you need to add the relevant deleted flag as a column to certain indexes you’re using, and in some cases denormalize by storing the deleted flag of associated rows alongside foreign keys to those rows and then including it in appropriate indexes on the associated tables too.

Bit of a PITA.

Adam Hayward on November 22, 2009 at 3:07 am.

Wow, really classy use of a photo of the Highway of Death

http://en.wikipedia.org/wiki/Highway_of_Death

Peter Wilson on November 22, 2009 at 4:32 am.

I agree with most of what you say. Some of your solutions require union queries to access historic data – not a great win.

The real problem is the lack of support by DB vendors for this feature. If your Db supports indexed views (without a major pain in the butt, yes MSSQL we are talking about you) then the efficiency of maintaining active views derived from large datasets would be moot. Programs could access an Active view without extra Where clauses.

David on November 22, 2009 at 6:50 am.

You say: “The trouble with implementing soft delete via a column is that it simply doesn’t scale well — we need a different strategy for larger tables.”

This is a blanket generalization and is incorrect. You should have said that they don’t scale well for queries that target multiple records. If 99.999% of your queries access a record by ID, then you don’t have a problem. This is our use-case for one of our systems at work, this particular one serving millions of users. Users can delete their records on the database, and this sets the time_removed to the current unix epoch time from its original value of 0.

This is a good blog post, but it could be better. Here are some ideas:

1) use a non-judgmental title such as “Alternatives to soft delete”

2) incorporate some examples of scalable uses of soft-delete that work well in practice.

Cody on November 22, 2009 at 7:46 am.

I know its DBMS specific but the MySQL Archive engine can be a good solution to leverage. It only supports SELECT and INSERT operations. You could have an automated process that migrates X month old data into their Archive table counter-parts.

If your application must adhere to government regulation (Sarbanes-Oxley, HIPAA, etc) than the issues you speak of are very real and become much larger than just an “annoyance” but a hard requirement.

peter on November 22, 2009 at 7:56 am.

Most RDBMSes use the soft delete mechanism under the covers to flag rows as deleted. The act of Shrinking a DB actually tells the DB to delete those rows and regroup the undeleted rows.

Richard on November 22, 2009 at 8:15 am.

@ Matthew W: no references, that’s just what I’ve found from experience.

@ Kane: developers should definitely focus on the business scenario. Most of my posts are about DDD – I link to that same article!

@ David: fair point. Fixed.

BTW the class-per-table arrangement in this article is just one proposed solution, and not necessarily the best. Sharding, partitioning, indexed views, temporal databases etc are some other possibilities that might work better but are beyond the scope of this article (and my knowledge) to go in-depth on. Would be great to see some follow ups/benchmarks exploring this.

Mark Harrison on November 22, 2009 at 9:22 am.

Nice article, and some interesting points to consider.

I have some situations analagous to soft delete, and have created “where status = active” views so people won’t have to remember to do this themselves.

erwin smout on November 22, 2009 at 11:07 am.

Exceptionally good post if you really are what you claim”a young software developer”.

However.

There are two aspects I fundamentally disagree with :

“If a row doesn’t apply yet, put it in a queue until it does.”
“Keeping historical data”

If the USER (I mean, NOT the AUDITOR) of the application has a business requirement to be able to record future-dated information, then there is no way around the fact that you have to build support for that requirement.

If the USER (once again, NOT the AUDITOR) has a business requirement to be able to query historical data, then that historical data has to be considered as “operational data just like any other”, and there is no way around the fact that you have to include support for that requirement in your application.

(True temporal dbms’s would be a great help to do that, but they don’t exist.)

Richard on November 22, 2009 at 12:01 pm.

@ erwin: a queue could be another table, or some other persistence mechanism. You’re right the system should support the business requirement record future-dated information — I was just suggesting it puts it aside in a special place until that date arrives and the item becomes active.

Likewise with historical data — keep it, but store it differently to the live stuff.

I’m 23 by the way!

Think Before Coding on November 24, 2009 at 4:02 am.

The user usually don’t mean delete the row from the database (as udi is spotting).
Then, depending on the context, there are different technical way to do what the user is requesting.
*If you use a RDBMS* and if the data should not be used anymore in the context, delete the row, otherwise, mark it.

The important thing is that there are usually several contexts and thus models in an application. Don’t try to fit everything in a single model if you have several contexts with different requirements.
In a history reporting context, you’ll add new entries on a OrderCancelled event, in a context that work on current state, you’ll delete it, in a context whose goal is to track that the order has been cancelled on the shipping site, it’s still another story.

Steven Quick on November 25, 2009 at 3:59 pm.

The probably from a database level perspective it the database can’t enforce that Job ID xyz should never be in both the Expired and the ActiveListing itself easily.

It makes the sub type approach a little scary because you’re totally dependant on your domain/data services always doing the right thing… ie: Always remove the row from the ActiveListing after Inserting the row into the ExpiredListing.

Steven Quick on November 25, 2009 at 4:01 pm.

Opps first paragraph above should have read:

From a database level perspective, the database itself can’t enforce that Job ID xyz should not be in both the Expired and the ActiveListing at the same time, not easily like with unique indexes and constraints on a single table.

Wesam Alalem on February 4, 2010 at 3:05 pm.

thanks a lot,
Great points mentioned, I’ll keep this as reference.
Good luck,

Nalin Jayasuriya on June 11, 2010 at 1:09 am.

There is another downside to soft delete with Microsoft’s Entity Framework and DomainService usage.

As of Visual Studio 2010 (and EF 4), there is no simple way to expressing fetching primary entities while ‘including’ related entities and excluding the ‘soft deleted’ related entities. There is no syntax that expresses this requirement in the Linq to Entities model.

One more reason to re-consider having that ‘isDeleted’ flag!

- Nalin J

Ron on August 4, 2010 at 12:01 pm.

Excellent!

From my experience, end users want to know what happened (the history of actions), so it is better to keep the deleted items in a separate table instead of implementing soft-delete.

implementing soft delete requires extra attention to primary database table to delete old rows and if your application is a huge one that inserts huge data per minute, soft-delete will definitely affect performance.

Tarek Mahmud Apu on August 5, 2010 at 6:58 am.

It’s a nice post. Thanks a lot.

I have faced lots of problem on this issue. To maintain foreign key integrity and for reporting purposes, normally we change the status to “delete”.

I like the technique of storing the inactive data in separate table than the soft delete. But do u think creating view based on status can give better solution?

I liked that you mentioned “developers should definitely focus on the business scenario”. Sometimes client force to do that because they think this is the standard way though they need not use the soft delete.. so it causes some extra effort on developing.

Stefan Bundy on August 6, 2011 at 10:52 am.

Thanks for the post. I’m bookmarking your blog and will certainly return!

Ranu Mandan on September 19, 2011 at 5:50 pm.

there is one more way of handling archived data , is by writing a purging job against an isdeleted = true records

Brian Bokor on April 10, 2012 at 5:40 am.

Do you have a good example of how you might attack needing data and associations around for reporting, but don’t actually need the record visible? I guess what I mean is how having a status on a table or an archive table seems to really bring the same complexities from my view point. Either it is more if statements or more intense SQL joins and given most bottlenecks always seem to come from the database with crap SQL, I think i would rather see more if else. Do you see an easier way to minimize joins and scenarios for data that has to be around for years (like PCI Compliant Service)…data mining would be terrible. Just wanted some thoughts, reactions??

Richard on April 29, 2012 at 1:51 pm.

@Brian: move the archive data into a separate archive/reporting store, denormalized, read-optimized and including active and inactive data.

Depending on the size and constraints, this could simply be a separate denormalized table, or a separate schema, a separate DB or even a cube or something. But basically the idea is to separate your transactional (live, normalized, write-optimized) data from historical reporting data (inactive, denormalized, read-optimized). If you try to do both in the same table, concerns will leak from one side to the other (e.g. reports will have to do too many JOINs, and the write side will have to have WHERE Active = 1 all over the place).

Lelala on October 2, 2012 at 3:32 pm.

Thanks for that very good article!
Especially the discussion in the comments was a helper to decide.

Pingback: benpowell.org

Leave a Reply