Down right now
Incident Report for Simplero
Postmortem

Today we had an outage that lasted roughly 100 minutes. Way too long, obviously.

The root cause was a disk that filled up on a secondary database server.

it happened while I was at event at the Empire State Building. At first I tried fixing it via my iPhone, but I had to realize that I couldn't, and instead hop on a train back home to my laptop.

Once home, it still took me a while to find out what was going on. I realized while on the train that what had happened right before that was a database transaction timeout. That pointed me to the database being the problem.

After a lot of searching, I finally took out the sledgehammer and simply restarted the database. That helped, but the whole cluster didn't come up right.

That's when I noticed that the disk had filled up on one of the secondary database servers in the cluster. I immediately deleted some files that were no longer needed, which freed up the space, but then I had problems actually getting the cluster to start back up properly. You'd think these things were pretty straight-forward, but by golly, they're not. And because 6 months can easily pass between you having to actually do any of these operations, it doesn't exactly get any easier.

But I finally did manage to get the whole cluster back up, and we're now back in business.

I'm so sorry about this, and naturally, the very first thing I'm going to do is make sure that all the servers in the cluster have alerts and monitoring set up to ensure we have enough free disk space.

Thank you for your continued support. It means the world to me. Together, we'll go make great things happen.

Posted May 11, 2015 - 20:31 EDT

Resolved
Finally managed to crack this one. So sorry folks.
Posted May 11, 2015 - 20:11 EDT
Update
Now I managed to identify the real root cause here: A disk that had filled up. Getting things back online. Knock on wood...
Posted May 11, 2015 - 19:40 EDT
Identified
I've identified the problem as being a lock in the database. Had to take a subway home to get to fix it. I'm still figuring out what to do about it.
Posted May 11, 2015 - 19:25 EDT
Investigating
I'm on it. Don't have man ETA right now.
Posted May 11, 2015 - 18:42 EDT