Three lessons are discussed after troubleshooting an error message of a failed MySQL database container for my WordPress on AWS Fargate. This is a sort of “public service announcement” should anyone run into the same issue running this particular solution design as discussed in my earlier post: New Way To Run WordPress on AWS Fargate
Table of Content
First, lets recap the solution:
- The WordPress Service is made up of one ECS task, which is comprised of two containers: WordPress (with Apache/PHP on Linux) and MySQL. Images are available for free on Docker Hub.
- The containers are setup with mounts to an external EFS volume. Specifically different access points (or directories) on this volume. WordPress will mount its access point as /var/www/html and MySQL will mount its access point as /var/lib/mysql. Backups of the volume are regularly scheduled via the AWS Backup service.
- This particular setup allows the containers to share the network interface: WordPress can connect to MySQL by specify localhost:3306. You can think of the setup as two processes on a single machine.
- Containers are configured so that WordPress does not start until MySQL has started is ready for connections.
- The containers are protected by being in their own private subnet (no internet access). When WordPress needs to reach out to the internet, its requests are routed to a NAT Gateway*, which does not allow incoming traffic.
- The load balancer and NAT Gateway* are in the public subnet. They connect to the Internet via the Internet Gateway (not shown). These services are fully managed by AWS.
* NAT Gateway has been replaced with NAT Instances: Saving Money (Pretty Easily) With NAT Instances
Troubleshooting the Database Container
A review of the database container log showed the following error:
[ERROR] [MY-012526] [InnoDB] Upgrade is not supported after a crash or shutdown with innodb_fast_shutdown = 2. This redo log was created with MySQL 8.0.29, and it appears logically non empty. Please follow the instructions at http://dev.mysql.com/doc/refman/8.0/en/upgrading.html
First, I did not expect that my database version would change over time. From the error message seems that minor upgrades are happening automatically. When I set up the task definitions I specified MySQL 8.0 as its best to specify a tag version and not let the image default to the latest. Therefore my expectation was that my database version would remain at 8.0 until I decided to update. So how are the upgrades taking place?
Recall that Docker Hub is a public repository that Fargate can pull from. Its normal for Fargate to remove and recreate tasks as needed – it is managing the containers, not you. In this case when they are recreated the image version is pulled from Docker Hub.
A review of the Dockerfile for the official MySQL 8.0 image shows a recent posted update for 8.0.31. So it seems this image for 8.0 is not static as I assumed but is updated periodically. (This may not be a bad thing – just for me something I did not expect.) As of this writing, the image tag for MySQL 8.0 and 8.0.31 appear to have the same MySQL Dockerfile content. So, as my DB container was cycled, the image pulled for MySQL 8.0 will change over time.
So, if I want a specific version of MySQL, seems there are at least two options:
- keep Docker Hub as the image repository and specify a full version like 8.0.31, or
- create my own repository with which to store a specific image version, and specify this resource in the task definition script
There are merits to each option but for my use case I’m not interested in maintaining a public image in my repository.
Redo Subdirectory Error
Finding remedies related to the above error message, it seemed removing the redo logs was the way to go. I SSH’d to my EC2, I mounted the EFS volume used by the database and removed the redo logs: ib_logfile0 and ib_logfile1.
After updating the service to deply a new task, the logs then showed this error:
[ERROR] [MY-013862] [InnoDB] Neither found #innodb_redo subdirectory, nor ib_logfile* files in ./
I then went back into the volume and created the #innodb_redo directory:
making sure the ownership was the same as the other files in the directory. Yet another service update…and this time we’re up an running!
Learning About the Problem
After recently learning that this site was and taking a look at the DB container log, I saw that for a better part of the month the site was down. Wonderful. Did I missed the event notification?
I had configured a Route53 healthcheck to keep an eye on the site by periodically testing a page and in the event of an invalid or failed response, notify me via email. And yet with the site down, Route53 is still reporting the healthcheck in a “HEALTHY” status. So what gives?
Upon closer inspection the health check was disabled – I immediately enabled it. When disabled, Route53 reports the healthcheck as…(you guessed it)
How it became disabled and when I may never know, but it does explain why I never received a notification.
- Ensure your backups are running and try a restore or two to get a sense of how long it may take. I initially thought my data may have been corrupted and tried a data restore. My EFS volume for the entire site is just shy of 100MB. And the restore jobs took over 1.5 hours to complete, which was unexpected given the amount of data but reasonable given the high durability and availability of the service managing data replication over multiple Availability Zones.
- Ensure your monitoring configuration is setup correctly – Just looking for a “healthy” status from Route53 console is not good enough: make sure it is enabled.
- Check your task configuration – especially, when using images not owned by you. Terraform and the like make it easy to ensure task configuration does not change and if it has changed pull it back in line. This is some of what IaC tools buys you. This is a good synopsis of IaC and popular tools IaC does not check the images themselves. Even if you specify a specific image version staying away from using the Latest tag may not be good enough and was not for me with MySQL. Double check the Dockerfile. Alternatively, build and maintain your own images.
There are many ways to architect solutions. The cloud presents myriad services and tools that enable solutions to fit your particular use case and needs. However, leveraging cloud services and the perceived ease and speed with which solutions can be developed also means you must pay attention to the operational and configuration needs like monitoring and detecting configuration drift. Specifically, whether or not a resource’s actual configuration differs, or has drifted, from its required configuration. All of this and more is addressed in the AWS Well-Architected Framework Operational Excellence and Reliability pillars.