Any of you who have tried to access phpBB over the past few days might have experienced a variety of different errors, most notably one about being unable to find the database `phpbb`. This is the full, slightly more technical post mortem of the issues highlighted in the announcement here.
Infrastructure
Firstly, some background on our infrastructure. It is hosted very generously and free of charge by Oregon State University Open Source Lab [OSUOSL or OSL] which is funded by donations and grants from individuals and large organisations (such as Google). We run our systems across a number of virtual machines (VMs), some of which are hosted on our own dedicated machines. Others were moved onto the main OSL cluster back in January as our old dedicated machines became more unreliable and unstable. We also run a number of services on centralised and function dedicated machines by the OSL such as their database servers, mailing list instance etc. which are used by a number of projects. They also very generously manage many of our VMs (using Chef), something which we’d like to make a blog post about in the future.
We now host all our databases on OSL’s main database cluster of two virtual machines served by a virtual ip which will use server SQL1 when possible or failover to SQL2 if SQL1 is down (which has happened a few times). SQL1 is their master (read & write) and SQL2 is their slave (Read-only) and SQL1 replicated to SQL2.
Replication Errors
On July 15th a number of issues with replication from SQL1 to SQL2 were noticed with some session tables which caused replication to be paused and a large number of statements were skipped. OSL then restarted replication but on the 16th we began to experience even more issues in far greater numbers than on the previous day and on the 20th a decision was made to entirely reload SQL2. Anticipating the only effect would be a bit of a slowdown on SQL1 due to a large number of reads, they went to do the maintenance at 00:00 on the 31st of July. The standard procedure for that is break replication, delete the databases off SQL2 one-by-one, and resync from SQL1.
SQL1 Issues
At 00:20 SQL1 started experiencing errors and investigations began immediately. Despite the fact that it was believed to be a master-slave configuration between the two servers, it was in fact a master-master replication meaning although SQL1->SQL2 replication had halted, SQL2->SQL1 had not and SQL2 replicated a lack of data back to SQL1 causing the dropping of databases to occur on both machines. Immediately database dropping was halted and therefore some other project databases were unaffected but phpBB databases had already been dropped. As we had no valid failover to a read-only server (Normally this would be SQL2), OSL were then left to restore backups. However, the backups server had quite slow I/O and there were a lot of databases to restore (Just one of our databases is ~33GB and we have a number of databases and there are a number of other projects on the cluster). The backup restoration finished at 10:49 on the 1st August and took such a long time due to the aforementioned reasons and issues with problematic database structures. Once the backup restoration was complete the binlogs (essentially a log of all sql queries executed) were replayed to catch the backups up to just before the maintenance. The binlog replaying finished at around 04:22 and production databases then began to be moved back into production. More details on the SQL1 and replication issues can be read about on OSL’s own postmortem.
Maintenance Page on .com
Unfortunately, throughout most of this time we were just displaying an error saying that the `phpbb` database did not exist. Due to the time of year most of our team who would normally work on putting up (and then taking down once things were fixed) a maintenance page were away on holiday or leave (without internet or without their login credentials that they’d need such as ssh keys or sudo logins or on restricted internet connections not allowing SSH) and team members who did have access to the internet didn’t have the necessary access to repositories or our servers. Therefore, only information available about the downtime was from our twitter account and facebook page. For this we do sincerely apologise as we are aware many of you were unaware of why our site was down.
Missing Data
On Wednesday we began to realise there was some data missing after we discovered some posts had disappeared. Replication from SQL1 to SQL2 broke on the 15th July and we take backups from SQL2. The backup OSL had restored therefore was the one from the 15th July. OSL found the binlogs (/var/lib/mysql) on SQL2 and then replayed those in order to bring the state of the database back to the present. Unfortunately the command used to replay the binlogs only replayed the sql commands from the 17th July and this gap was only realised after OSL had blown away /var/lib/mysql on SQL2 in order to restart replication from SQL1 to SQL2 once we were back up and in production.
This means all actions done between the 15th and 17th July on any *.phpbb.com sites will have been lost.
Notifications
Also due to some unusual behaviour, everyone’s notifications settings on phpBB.com have been reset to never send out emails for notifications. You can still change your settings here but in order to ensure people don’t miss notifications, we’ve set everyone’s settings to now email for the following but you can of course revert this change if you so wish:
- Someone replies to a topic to which you are subscribed
- Someone quotes you in a post
- Someone creates a topic in a forum to which you are subscribed
- Someone sends you a private message
Into the future
OSL are implementing a number of policy changes and changes to their backup procedures (Read more here) in order to prevent this sort of thing from happening again, as are we. We understand many of you depend on phpbb.com for support as well as the resources it provides (downloads, documentation etc.). We will be looking at provisions to ensure that our infrastructure is less interdependent and removing single points of failure. This has also highlighted some rare edge-case bugs in phpBB which we are looking to patch as a matter of priority. We’ll also look at how we can better communicate downtime in the future, make appropriate maintenance pages easier to display, and ensure we always have people around who can deal with such situations.
Often when these kinds of situations arise, we receive questions about how our community can help us. While we appreciate the gesture of making a donation to us, the phpBB project does not accept financial donations. If you would still like to make a financial gift, you can support us indirectly by donating to OSUOSL. You can also help us directly by being active in our support forums, IRC, or by submitting source code patches. We are only here because of the community behind us, so anything you do to help the rest of the community helps us.
We do apologise once again for any problems this might have caused you.
As a note, all times are UTC+1 (British Time).