Deploying an old monolith in cluster mode
We live in the exciting times of microservices and single page apps, cloud solutions and performant concurrent engines, agile programming and continuous delivery frameworks that spit out green glowing charts for every successful release. And yet, in most organisations there is a dark, shameful corner somewhere on a server, hosting a legacy part of the system that is still used for those 3 things that the company cannot survive without. In our case, that legacy part is called WHMCS.
What is WHMCS
WHMCS is a pretty big specialized PHP platform mainly targeted at web or server hosting companies. It has a lot of bells and whistles on top like CRM, support tickets, billing functionalities, etc., and also a bunch of third-party modules that extend it even further. We are not exactly a hosting company, but for historic reasons that was the system used in the beginning and so all old and current clients have got an account there.
The problem
We heavily rely on WHMCS but not just as a standalone system. It has an extensive API with tons of functions and parameters. For better user experience, our central platform (thankfully, in-house developed with those sweet microservices and single-page apps mentioned in the beginning) uses this API to integrate with WHMCS for a lot of its core functionalities. Unfortunately, with the growing number of data chunks in WHMCS, our single server deployment of it began to slow down and occasionally overload. Some of our requests took more than 10 seconds to finish, as happens in such occasions. The straightforward way to solve such problems is to scale vertically by getting bigger, fatter, noisier servers in the hope that the hardware manufacturers will keep up with the demands of our data and our performance will be tolerable to our users. Or, we could try and force the PHP monolith closer to our infrastructure and onto our clusters, which at least sounds like a lot more fun.
The basics
We deploy our apps in Amazon, so we decided to move WHMCS in the cloud. We found a link to an official WHMCS website with guidelines on how to deploy WHMCS in a cluster mode. Considering that Amazon offers managed scalable MySQL service (Aurora), we decided that the other things were pretty easy and straightforward to configure.
Load Balancing
This one was pretty easy. Our WHMCS instance was already running on a separate domain, and we had a load balancer in front of our infrastructure anyway. A few slight modifications to the routing rules and we were good to go.
Database
This one was also easy. Amazon’s Aurora is a drop-in replacement on the stock MySQL version. Checking for compatibility issues between Aurora and MySQL, we couldn’t find anything worrying. We did a few tests, and everything worked as expected. Our enthusiasm was growing.
WHMCS persistent storage
WHMCS needs a persistent storage to keep customers’ uploaded files. It was a bit troublesome because we didn’t want to mount persistent storage to our virtual machines and solve the hassles of sharing it between cluster nodes (or even between clusters). After a bit of searching, it turned out that WHMCS supports S3 as a storage backend for these particular files. We didn’t have any problems here as well.
Sessions
It is somewhat inconvenient for the user of a system to be spontaneously logged out just because the load balancer forwarded him to a new container for the next piece of server side content he needs. We had to be sure that each WHMCS-deployed container would be able to parse and understand everything important about the user, based on the cookies his requests carry. WHMCS suggested using a table in the database as a storage for the server side information of the web sessions as the cookies carry only identifiers to this storage. We are not big fans of that solution, and we still have our concerns against it, but eventually, we decided to follow the advice and see if there was any significant performance drop. Fortunately, Amazon did an excellent job with their Aurora set up, and additional queries for the web session had no significant cost. So, despite not optimal, so far it was good.
Caveats
The above was the easy part. But as you can imagine, the parts that were missing from the docs were much more problematic.
Licenses
As a commercially-sold solution, WHMCS and its third party modules require licenses to work. Those are most commonly activated for a particular domain name and IP address. The domain part is something we can live with, but IP addresses are not as static in a scalable cloud environment as the license sellers would like. Most of the third party modules had some variation of the license that allowed you to deploy it on multiple WHMCS instances (sometimes even called cloud license). Unfortunately, several of the modules we were using didn’t provide such options. In our case, all of those turned out to be non-essential and we managed to either replace or remove them. You have to thoroughly plan this part of the process, consider any such migrations of a license-based product and check with each vendor if proper licensing options are available. At the very least, it might require purchasing additional or more expensive licenses, which needs to be properly budgeted.
WHMCS crons
And as usual, the biggest issues came from the most trivial place. WHMCS relies on regular cron runs to pull emails from a mailbox as support tickets and replies, to issue invoices, to mail warnings, etc. As it did have documentation about a possible highly-available deployment, we assumed it had some internal mechanism preventing nonidempotent tasks from running twice.
We learn from our mistakes and never assume: one can either confirm with the documentation or straight up ask in a support ticket. WHMCS cron tasks do NOT have any internal synchronization or locking mechanisms in place. They are simple jobs that run independently from each other and take action merely based on what they can pull out of the mailbox and database. In our case, this caused a wide array of minor or more severe issues such as duplicated tickets or ticket replies in the issue tracker, duplicated line items in autogenerated invoices, empty invoices, etc. Depending on what third party modules are able to do in a billing + CRM system, we might have gotten lucky on this as we managed to clean everything up.
We contacted WHMCS support (still assuming that such a setup was supported and we just hadn’t found the correct parameter to use). They told us that the solution was easy: just ensure no two cron jobs run concurrently. An amazing piece of advice</sarcasm>
– we use automation jobs to deploy and scale our clusters up and prefer to think of our VMs and containers as transient infrastructure. Their solution was a no go for us.
We could run the cron jobs on a central location. However, those tasks use the internal WHMCS API, so they had to access the WHMCS file system. So this was basically going back to the hassles of sharing a filesystem between cluster nodes and clusters… That one was a no go either.
The only other solution that we could think of was a protocol that could elect a master from all the running WHMCS instances. For this to happen, we needed a central piece of software that elects the master. We thought of deploying something like Zookeeper or Hazelcast and use their shared lock functionality for a simple mutex. Both of them are decent products and would solve our problem. However, we’ve always hated using a sledgehammer to crack a nut and decided against such an overkill for a set of simple cron jobs.
We thought that there should be a better solution. We went back to the drawing board and noticed that we already had a shared piece of the system that supports mutexes. It is our relational database in Aurora. We know that the updates in the database are atomic. This means you can’t execute concurrent updates after the current query gets matching elements before the set part is finished. It is a mutex per line. So what we did was a pretty simple bash script that:
1. creates a table if it is missing;
2. inserts a record with a specific id (e.g. 1);
3. tries to lock the record with that particular id, by updating the timestamp and the lock owner (in our case, the container hostname);
4. exists with non-zero status code if the lock owner is different from the current process. This technique is called Optimistic locking. You can see more about it here.
This allows us to invoke the WHMCS php cron scripts only upon successful execution of our improvised lock handler.
You can see an example script below.
#!/bin/bash
set -e
hostname=`hostname`
MYSQL_COMMAND="mysql -s -e "
#Create the needed table it it doesn't exists
bash -c "$MYSQL_COMMAND \"CREATE TABLE IF NOT EXISTS \\\`tblcronlock\\\` (\\\`id\\\` int NOT NULL,\\\`cron_identifier\\\` VARCHAR(50), \\\`updated_at\\\` DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY(id));\""
# Inserts an element with a predifined id because, we are creating a mutex table and needs to populate just one row in it.
# If another client tries to insert a record it will violate the primery key constraint.
bash -c "$MYSQL_COMMAND \"INSERT IGNORE INTO tblcronlock (id, cron_identifier) VALUES (1, '$hostname');\""
# Touch the record if we own the lock
bash -c "$MYSQL_COMMAND \"UPDATE tblcronlock set cron_identifier='$hostname', updated_at=NOW() where ( cron_identifier='$hostname' OR updated_at < NOW() - INTERVAL 30 MINUTE )and id = 1;\""
# Get the hostname that holds the lock
lockowner=`bash -c "$MYSQL_COMMAND \"select cron_identifier from tblcronlock where id = 1;\""`
if [ "$lockowner" = "$hostname" ]; then
exit 0;
else
echo "Not the cron lock owner"
exit 1;
fi
Code language: Bash (bash)
The code of WHMCS’s Cron is a closed source and we are not 100% sure if the scripts are synchronous or not. That means we cannot assume that the tasks activated by the script will complete before the script exits. Therefore, to be on the safe side, we decided that each lock will be valid for 30 minutes. If an instance gets the lock, no other instance can execute Cron for the next 30 minutes. It looks enough because WHMCS recommends tasks to run every 5 minutes.
Conclusion
The main takeaways for us in this tumultuous migration are:
- Even monolithic applications with bad architecture can be (to an extent) clusterized and scaled horizontally.
- Never assume that parts of a system are well-implemented just because the correct way is trivial to set up internally. Just as electricity, developers always choose the path of least resistence and will code the bare minimum if the scope allows it.
- Always consider the license options and costs associated with license-based products.
- Understanding of the basic concepts of the tools you’re using is always paramount. If you have an idea of what you want to achieve, simple solutions with your existing products and infrastructure can often be found; in our case, using a database as mutex to achieve a sort of master/slave election.
May the force be with anyone trying to maintain and scale legacy applications with minimal control.