Software development can be a pretty scary business. Things randomly break, and you sometimes don’t even know why. Since Halloween is this week, let’s look at some examples from people here at Don’t Panic Labs.
When Optimizations Don’t
The first is something many of us have experienced: everything in production works fine…until it doesn’t. In this situation, we had a stored procedure that worked fine for months. It typically took only one second to finish, but it suddenly started taking several minutes to run.
Parameter sniffing. SQL Server tries to optimize your stored procedure by looking at what it intends to do. Sometimes SQL Server looks at your variables and guesses wrong. These guesses change over time, so your stored procedure could work for quite a while and suddenly begin dragging or quit altogether. The thing that makes parameter sniffing particularly bad is that if you pull the query out of the stored procedure and run it, it will often work just fine.
So how to work around this? The easiest way to see if this is the problem is to set parameters to local variables within the stored procedure and see if the problem goes away. This local variable hack isn’t the best way to solve the problem, but it will be useful for diagnosing it quickly.
(This story comes from Andy Unterseher)
The Missing QA Magic
Another classic example is when software works great in QA, but not so much in production. Nothing can be more frustrating than trying to ship some software and being stuck on a production release for hours.
We ran into this issue years ago when we were shipping a big dashboard / warehouse solution. Queries were taking minutes in production but were super-fast on QA. After hours of frustration, the answer to this problem turned out to be simple: we needed to rebuild the database statistics. The lesson here is never to forget to rebuild statistics if you are making significant database changes as part of a release.
(This story comes from Tony Wilsman)
My favorite spooky story happened when I was working at a CDN company. I got an error email late one night saying we had an overflow on an identity column. After a bit of investigation, I realized that the error message was accurate. We had overflowed the max value of an integer on our primary id column (yep, 2 billion). This was particularly scary because it meant we had to shut down our log processing while we fixed this issue. And repairing a database table with two billion records isn’t a quick task. We had to change the integer column from 32-bit to 64-bit. The lesson here is to make sure you design your solutions (and databases) for success.
The Phantom Unplugger
Another classic is the oscillating fan problem. At a previous job, we had a server that kept going off and on, which was causing more and more angst as we tried to deploy software. The ultimate problem, an oscillating fan. As it moved back and forth, it kept pulling out a network cable. The lesson here is to make sure everything is plugged in and stays that way.
(This story comes from Nathan Wilkinson)
Spooky software problems are everywhere, and all developers have these kinds of stories. They are fun to talk about after the fact, but at the time they are just scary and frustrating. Share some of your experiences in the comments below.