Background:
A customer reports that publishing a sub-clip in our system fails occasionally. A "clip" (to us) is the metadata associated with a piece of media. A "sub-clip" is a metadata reference to a small piece of that media. So for example, J. Random Journalist is watching a recording of the latest Reuters feed (recording as a clip in our system), and wants to add some footage to a news bulletin. They mark up the sub-clip and publish the data into our system for use later.
In the database, the sub-clip is represented by a load of "standard" clip metadata but is also described in a many-to-one relationship with our original clip. That lets us track usage of the original media etc.
It's also valid for an application to register a "shot selection" in this way - maybe the journalist wants 3 or 4 sections from that news feed to be edited together for their story. Same many-to-one relationship, but multiple source clips this time.
Reproduction:
Simply publishing two shot selections or sub-clips at the same time would trigger the issue. We got a server-side SQL trace. Separating queries from the SPIDs for the two threads of the offending network service, we have:
SPID 1 and SPID 2:
SET TRANSACTION ISOLATION LEVEL READ COMMITTED
BEGIN TRANSACTION
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562666, 3546238, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562666, 3562385, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562666, 3546238, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562666, 3562385, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562666, 3546238, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562666, 3562385, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562666, 3546238, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562666, 3562385, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562666, 3546238, ...)
SET TRANSACTION ISOLATION LEVEL READ COMMITTED
BEGIN TRANSACTION
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562668, 3545736, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562668, 3545736, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562668, 3545736, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562668, 3545736, ...)
INSERT INTO OffendingTable (ExternalKey,SourceClipExternalKey,OtherData) VALUES (3562668, 3545736, ...)
SELECT * FROM OffendingTable WHERE ExternalKey = 3562668 ORDER BY SomeOfTheOtherData
SELECT * FROM OffendingTable WHERE ExternalKey = 3562666 ORDER BY SomeOfTheOtherData
It was a little more interleaved in the trace, and a separate issue where hideously inefficient LIKE queries were being used spaced out all the calls over several seconds, making the window of opportunity for this issue large enough to cause a problem.
Surely this can't cause a deadlock? Surely we'd expect one of the processes to block while the other completed. There's nothing difficult here, is there? Madness!
Running the SQL against our development database on MSSQL 2005 - no problem. Running the SQL against our regression database on MSSQL 2000 - no problem (the customer was running SQL 2000 - it's a long-established site). A WTF moment... where do we go from here?
With a moment of clarity, our DBA is looking at the site backup and notices that OffendingTable has NO indexes. We all know that tables without indexes are bad. But of course, our database knowledge in the company grew organically, as did our database design and upgrade strategy. That table is meant to have an index on it. It has no primary key, but that's another story entirely ;-)
Sure enough - running the two SQL traces on the site backup deadlocked.
Putting the index in place stopped the deadlocks.
The "gut feeling" I have is that if we were to investigate the reproduction deeper into the SQL lock mechanics, we'd find some locks on some system resource somewhere that MSSQL is using to cover for the lack of index. We'll find that the inserts somehow lock parts of this resource as they are writing, but that the selects require a lock on the whole resource. Hence the deadlock. SPID 1 has some of the resource locked; SPID 2 has some more of the resource locked and then both try to lock the whole resource.
Good here, isn't it.