Database upgrade
Incident Report for Simplero
Postmortem

Thursday we set in motion some infrastructure upgrades—very carefully, behind the scenes. But it turns out Maria DB has a bug that caused it to “leak” memory when using a certain kind of data compression, and over the course of several hours, it consumed all available memory, slowed down, and rebooted itself. That caused the few minutes of downtime on Thursday evening.

We’ve never had a problem like that before, but there’s a first time for everything. Now we have alarms so we’ll be notified of any memory issues with the database long before they cause a problem.

We also decided to upgrade Maria DB to a version that fixes the memory leak bug. It’s a so-called minor version upgrade, and Amazon even offers to do them for you automatically during a short regularly-scheduled maintenance window, so we expected a few minutes of downtime. Instead, as you know, there was over an hour Saturday night when the database (and hence the entire application) was inaccessible. And once the process started there was no stopping it: we were at the mercy of Amazon Web Services.

Going forward, we’ll announce ahead of time on status.simplero.com and in our Facebook group any time we plan even a few minutes of downtime.

And we’re implementing a plan to be able to upgrade the database with—for real—no more than a few minutes of downtime.

Posted Oct 26, 2020 - 16:59 EDT

Resolved
And, we're back! Sorry that took a bit longer than expected. All is safe and sound and operational.
Posted Oct 24, 2020 - 22:10 EDT
Identified
We're currently doing a database upgrade. We expect to be back online in a few minutes. Sorry for the wait.
Posted Oct 24, 2020 - 21:32 EDT
This incident affected: Web interface and API and Background processing.