Friday, 16 July 2010

A SQL deadlock with only two transactions querying only one table.

Ok - first write-up! We found an issue where two SQL transactions deadlocked. Both were doing the same job - INSERT values into a data table and then SELECT the data back, each within a transaction. There was other data being inserted into other tables within the same transactions which helped to make the issue appear as if it might have been more complicated than it turned out to be. It didn't matter about the other data in the end. Re-running the SQL traces in Query Analyser showed the deadlock with just the insert and select queries. We had created a deadlock simply by inserting some data into a table and then reading it back in two places at once. Madness.

Background:

A customer reports that publishing a sub-clip in our system fails occasionally. A "clip" (to us) is the metadata associated with a piece of media. A "sub-clip" is a metadata reference to a small piece of that media. So for example, J. Random Journalist is watching a recording of the latest Reuters feed (recording as a clip in our system), and wants to add some footage to a news bulletin. They mark up the sub-clip and publish the data into our system for use later.

In the database, the sub-clip is represented by a load of "standard" clip metadata but is also described in a many-to-one relationship with our original clip. That lets us track usage of the original media etc.

It's also valid for an application to register a "shot selection" in this way - maybe the journalist wants 3 or 4 sections from that news feed to be edited together for their story. Same many-to-one relationship, but multiple source clips this time.

Reproduction:
Simply publishing two shot selections or sub-clips at the same time would trigger the issue. We got a server-side SQL trace. Separating queries from the SPIDs for the two threads of the offending network service, we have:

SPID 1 and SPID 2:

SET TRANSACTION ISOLATION LEVEL READ COMMITTED
BEGIN TRANSACTION
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562666, 3546238, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562666, 3562385, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562666, 3546238, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562666, 3562385, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562666, 3546238, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562666, 3562385, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562666, 3546238, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562666, 3562385, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562666, 3546238, ...)
SET TRANSACTION ISOLATION LEVEL READ COMMITTED
BEGIN TRANSACTION
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562668, 3545736, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562668, 3545736, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562668, 3545736, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562668, 3545736, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562668, 3545736, ...)
SELECT * FROM OffendingTable WHERE ExternalKey = 3562668 ORDER BY SomeOfTheOtherData
SELECT * FROM OffendingTable WHERE ExternalKey = 3562666 ORDER BY SomeOfTheOtherData


It was a little more interleaved in the trace, and a separate issue where hideously inefficient LIKE queries were being used spaced out all the calls over several seconds, making the window of opportunity for this issue large enough to cause a problem.

Surely this can't cause a deadlock? Surely we'd expect one of the processes to block while the other completed. There's nothing difficult here, is there? Madness!

Running the SQL against our development database on MSSQL 2005 - no problem. Running the SQL against our regression database on MSSQL 2000 - no problem (the customer was running SQL 2000 - it's a long-established site). A WTF moment... where do we go from here?

With a moment of clarity, our DBA is looking at the site backup and notices that OffendingTable has NO indexes. We all know that tables without indexes are bad. But of course, our database knowledge in the company grew organically, as did our database design and upgrade strategy. That table is meant to have an index on it. It has no primary key, but that's another story entirely ;-)

Sure enough - running the two SQL traces on the site backup deadlocked.

Putting the index in place stopped the deadlocks.

The "gut feeling" I have is that if we were to investigate the reproduction deeper into the SQL lock mechanics, we'd find some locks on some system resource somewhere that MSSQL is using to cover for the lack of index. We'll find that the inserts somehow lock parts of this resource as they are writing, but that the selects require a lock on the whole resource. Hence the deadlock. SPID 1 has some of the resource locked; SPID 2 has some more of the resource locked and then both try to lock the whole resource.

Good here, isn't it.

Monday, 28 June 2010

It's Hanselmann's fault...

This post on Scott Hanselman's blog made me get around to setting up a lovely geeky blog. If just one person manages to solve a problem because I shared some of the wierd stuff I uncover when trying to develop or fix software, it'll have been more than worth it.

Alternatively, I might be able to save myself a load of time if I can point people at this, instead of explaining in gory detail the latest stupid problem I just found. It could save many friends the pain of having me rant on at them about it too. Win win win.

I seem to have an unerring ability to get involved in software issues where Googling the answer leads to no more than twenty links. Fifteen will be totally irrelevent. Four will be only slightly irrelevent, badly phrased forum questions from people with similar-ish problems. The answers will be wrong. Some will be from "experts" (spelt "i-d-i-o-t-s") who don't know what they are talking about but posted anyway because nature abhors a vacuum. Some will be from well-meaning individuals who simply haven't read the question properly. If you're lucky a Microsoft MVP will have regurgitated the official line, having totally missed the point. The MVPs are usually very helpful, but are often answering subtly the wrong question. Sometimes I get an answer to a problem I wasn't actually researching there :)

If I'm lucky, one answer will be right. Sometimes it will be a combination of posts that paint the full picture. I can't be the only person looking for these answers. Maybe if I put together posts joining some of the dots, there will be two successful search results for the other guys.

Of course, most of the stuff here won't be that deep. Never mind eh ;-)