Down, working on it ...
Incident Report for Simplero
Postmortem

Phew. That was a looooong f'ing downtime. Almost two hours.

My wife Nomi and I were on our way home from dinner at one of our favorite local spots, Lucien, when I got the notice that the server was down. Up in the elevator to the 20th floor, of course people had to get out on 6 other floors before us...

Not that it ended up mattering much. It was not an easy fix.

Or actually, it was, but it took me a looooong time to fix it.

Starting with the end:

What was the cause of the downtime?

A disk filling up on one of the database replicas. With the disk full, the main database server refused to write any new data, causing pretty much every single request to wait for about a minute, then fail. With all requests hanging, all backend processes were pretty quickly occupied, and nothing was getting through.

The immediate fix was to just take that server off-line. Then the main database server happily started storing data again. We were back up!

But getting to that point took quite a while. I'll get to why in a minute.

There were several other things that made it worse. I should get alerts when disks are 80% full, but they did not fire in this instance. I'll have to figure out why and make sure they happen in the future.

Another factor is that the database has grown more in the last two days than it has in the month before that. So the growth has really accelerated, which caught my systems off guard.

What am I doing to prevent it in the future?

I'm actually pretty far down the road to having a new server setup. I've learned a ton since I created the current setup about a year ago. Before that, I had a sysadmin guy in Denmark doing stuff for me, but he decided to move on, and I decided to take it on myself and do something that was more scaleable and better documented.

Now I'm going to be moving fully into the Amazon cloud. I think they can do a better job managing a database than I can, and I'm going to let them. What happened here tonight could not happen with the database on Amazon. Their infrastructure makes it easier to scale and manage.

I've already moved our background processing to Amazon, because background processes are less sensitive to time lag between the code processing the request and the database server. I already have a replica of our main database running on Amazon, and I'm currently testing to make sure everything's ready to make the big switch.

It'll be another 2-3 days but then we should be ready to make the switch.

In the current setup, I have one box that basically runs everything, with two additional boxes just for database replication and backups.

With the new setup, I'll have the database as a separate Amazon service (RDS), with replication in multiple geographical zones and automatic failover. Then I'll have two separate boxes processing requests, sitting behind a load-balancer, and two separate boxes running background jobs. This setup means we can easily spin up more boxes running front-end requests or background jobs, depending on need, as well as take one box offline at a time to apply security patches or run other types of upgrade. Everything is documented and scripted, so new servers can be put into production in a manner of minutes.

There's a couple other services involved, but this is the bulk of it. I'm sure this new setup will have its own learning curve, but it should be much more reliable and serve us well far into the future.

So why did it take you so long?

Why did it take me nearly TWO HOURS to find out that a disk had filled up?

Because I kept looking in all the wrong places.

At first, I thought it was a memory issue. I saw messages saying processes had been "reaped" and "killed", which could indicate a memory issue.

Then I thought it had to with the server being flooded with requests, because I saw a lot of requests coming in from some bots.

Then I thought it was a locking issue. I've been having some problems with transaction deadlocks, which I've been able to fix so far. Yet all the messages said something with "lock wait timeout exceeded", as in, it was waiting for a lock, and it couldn't get it (presumably because some other process had a lock on it).

Then I thought it had to do with how we were doing locking one some of the database statements. I saw a few statements come up again and again, and figured it had something to do with them. I tried making some changes to the code to see if that changed things. It didn't.

All the while, it looked like the web server was fine, the application servers were fine, the database server was fine. I kept prodding, but it showed nothing unusual.

Then I tried to do a simple insert statement on the database command line, and it hung. Hm. Again, I thought it was a locking issue, because that's what it said, and that's what I'd recently struggled with.

Finally, I decided to log into one of the replicas, just to see if that was feeling any better. And that's when I noticed: The disk was full.

And I know from experience that that can take down the whole database cluster.

I quickly found the culprit (the binlogs had grown like crazy), I got it shut down, and then the main database server started accepting inserts and updates again. Phew, we're back online.

Immediate next steps

I'm doubling the disk size on that particular server so we can bring it online again without taking Simplero down. A task which is proving problematic, because even though I told Linode to resize it, they haven't, so the minute I bring it back online, it takes down Simplero. Maybe if I change the IP address on it ...

Anyway, I don't really need it to come back online, except for forensics. I want to find out WHY my disk space alerts did not fire, so I can make sure they do in the future. After that, I can let it go, and just create a new replica. Or drop it, and move the whole thing to Amazon sooner.

Finally I'm going to double down on my efforts to migrate everything to the Amazon cloud, especially the database, so I don't have to be the one to manage database infrastructure, when they clearly have better talent on staff for that particular task.

Post-Post-Mortem

As I was writing this, the disk on the other replica filled up, and we went down again. This time I found out quickly, though.

It also means we're not safe yet. That box had plenty of free disk space, and it's filling up fast.

I think this whole situation was caused by another change I made earlier today. Every time an email we send is sent, delivered, bounced, or marked spam, our email provider, SendGrid, notifies Simplero about the event. As our email volume has gone up, so has these notifications, to the point where they can sometimes flood our server and prevent "regular" requests from coming through.

The change I made was to move all the processing of these notifications to our background queue. But it seems that there's a significant overhead to this in the form of database updates that get logged in the binlog, and which now cause the binlogs to grow like crazy.

So for now, I've changed it back to the old way, while I figure out how to deal with it better.

As we complete the move to AWS, this particular problem should go away, since we can do away with binlogs (and anyway, it's their problem then), but I'm still sure there are things here that could be improved.

So I'll keep hacking away at this.

Keeping a service like Simplero in the air is harder than one might think. I'm learning every day.

Posted Nov 11, 2015 - 10:17 EST

Resolved
This incident has been resolved.
Posted Nov 10, 2015 - 22:32 EST
Monitoring
Phew!

That was nasty.

We're back up. I'll keep monitoring, and write up a post-mortem.
Posted Nov 10, 2015 - 22:30 EST
Update
I'm still working on this. I've never experienced anything quite like this. It's proving way tricky to get it back online. I'm so sorry guys.
Posted Nov 10, 2015 - 22:03 EST
Investigating
We are currently investigating this issue.
Posted Nov 10, 2015 - 20:41 EST