Here at Sauce Labs, we recently celebrated the completion of a significant project to improve our service uptime and reliability, as we transitioned the last of our CouchDB databases to MySQL. We’d outgrown CouchDB, to the point that a majority of our unplanned downtime was due to CouchDB issues, so wrapping up this migration was an important milestone for us.
CouchDB was a very positive experience at first, and its reliability isn’t out of bounds for a database that is, after all, only on version 1.2. But our service is very sensitive to reliability issues, and ultimately we decided that this transition was the most efficient path forward for us.
Once we decided on MySQL (specifically, we’re now using Percona, with its InnoDB-based XtraDB storage engine), we rearchitected our DB abstraction layer and one by one migrated all our databases, large and small. Our uptime was dramatically improved over the past couple months as we worked through the migration, and performance was slightly improved in the bargain.
This post describes our experience using CouchDB, and where we ran into trouble. I’ll also talk about how this experience has affected our outlook on NoSQL overall, and how we designed our MySQL setup based on our familiarity with the positive tradeoffs that came with using a NoSQL database.
First, how did we get into trouble?
Everything Was Going to be Great
When we first started Sauce Labs back in 2008, we thought we were building something very different from the service we run today. We were excited to try a NoSQL db, having spent too many years using MySQL in ways that the designers of relational databases never imagined. CouchDB seemed well suited to our needs.
Our original product design featured a REST API for storing data on behalf of our customers, and the plan was to drop almost straight through to CouchDB’s already RESTful API. This let us get a prototype up and running in a hurry. It didn’t matter that CouchDB was new and not yet hardened by a lot of real-world usage, we thought, because our database I/O needs were meager, our app was naturally horizontally scalable, and our product was fault-tolerant. We could easily bridge any reliability gap just by keeping replicas and retrying things when they failed. What could go wrong?
What Could Go Wrong
As our little company grew, and we learned about the problems our customers faced, our product underwent several major changes. It made less sense over time to partition data so strictly by user. We came to rely more on database I/O performance. In general, we found ourselves using CouchDB very differently from how we’d originally imagined we would, and much more the way most web apps use databases. That was still a reasonable way to use CouchDB, but the margin of safety we thought we had when choosing it slowly evaporated as our product evolved. And, as it turned out, we needed that margin of safety.
Sauce Labs’ service is more sensitive to reliability issues than the average web app. If we fail a single request, that typically fails a customer’s test, and in turn their entire build. Over time, reliability problems with CouchDB became a serious problem. We threw hardware at it. We changed the way we used CouchDB. We changed our software to rely much less on the database and do much less database I/O. Finally, we decided the best next step was to switch.
Again, none of this speaks particularly badly about CouchDB. It’s a young database, and reliability and performance issues are to be expected. And in a way it’s too bad, because we have no love for classical relational databases. We’re convinced that NoSQL is the future. We’re just not convinced it’s the present.
Some Things We Really Liked about CouchDB
- No schemas. This was wonderful. What are schemas even for? They just make things hard to change for no reason. Sometimes you do need to enforce constraints on your data, but schemas go way too far. With CouchDB, adding new fields to documents was simple and unproblematic.
- Non-relational. Relational databases grew up solving problems where data integrity was paramount and availability was not a big concern. They have a lot of features that just don’t make sense as part of the database layer in the context of modern web apps. Transactional queries with 6-way joins are tempting at first, but just get you into trouble when you need to scale. Preventing them from day one is usually easy.
- No SQL. It’s 2012, and most queries are run from code rather than by a human sitting at a console. Why are we still querying our databases by constructing strings of code in a language most closely related to freaking COBOL, which after being constructed have to be parsed for every single query? Things like SQL injection attacks simply should not exist. They’re a consequence of thinking of your database API as a programming language instead of a protocol, and it’s just nuts that vulnerabilities still result from this poorly thought out 1970s design today.
- HTTP API. Being able to query the DB from anything that could speak HTTP (or run curl) was handy.
- Always-consistent, append-only file format. Doing DB backups just by copying files was simple and worry-free.
- Indexes on arbitrary calculated values seemed like a potentially great feature. We never ran into a really brilliant way to use them, though it was straightforward to index users by email domain name.
- Finally, it’s worth pointing out that even under stress that challenged its ability to run queries and maintain indexes, CouchDB never lost any of our data.
The Problems We Encountered with CouchDB
- In our initial setup, slow disk performance made CouchDB periodically fail all running queries. Moving to a much faster RAID setup helped, but as load increased, the problems came back. Percona is not breaking a sweat at this load level: our mysqld processes barely touch the CPU, we have hardly any slow queries, the cache is efficient enough that we’re barely doing disk reads, and our write load is a very comfortably small percentage of the capacity of our RAID 10 arrays.
- Views sometimes lost their indexes and failed to reindex several times before finally working. Occasionally they’d get into a state in which they’d just reindex forever until we deleted the view file and restarted CouchDB. For our startup, this was agony. Surprise reindexing exercises were the last thing we needed as a small team already taking on a giant task list and fighting to impress potential big customers.
- Broken views sometimes prevented all views from working until the poison view file was removed, at which point view indexing restarted its time-consuming and somewhat unreliable work. I don’t know how many times one of us was woken up by our monitoring systems at 4am to learn that our service was down because our database had suddenly become a simple key/value store without our permission.
- Compaction sometimes silently failed, and occasionally left files behind that had to be removed to make it work again. This led to some scary situations before we tightened up our disk usage alarms, because we discovered this when we had very little space left in which to do the compaction.
- In earlier versions, we ran into three or four different bugs relating to file handle usage. Bug reports led to quick fixes for these, and these problems were all gone by version 1.0.2.
- There’s really only one thing to say here, and that’s that view query performance in CouchDB wasn’t up to the level of performance we’d expect from roughly equivalent index-using queries in MySQL. This was not a huge surprise or a huge problem, but wow, a lot of things are quicker now, and our database machines are a lot less busy.
- When CouchDB fails, it tends to fail all running queries. That includes replication and compaction, so we needed scripts to check on those processes and restart them when necessary.
- View indexes are only updated when queried — insertion does not update the index. That means you have to write a script to periodically run all your views, unless you want them to be surprisingly slow when they haven’t been queried in a while. In practice we always preferred view availability to any performance boost obtained by not updating indexes on insertion, but writing reliable scripts to keep view indexes up to date was tricky.
- The simple copying collector used for compaction can spend a lot of time looking at long-lived documents. That’s particularly bad news when a database has both long-lived and short-lived documents: compaction takes a long time, but is badly needed to keep disk usage under control. Plus, you have to run compaction yourself, and monitoring to make sure it’s working is non-trivial. Compaction should be automatic and generational.
- CouchDB’s design looks perfect for NoSQL staple features like automatic sharding, but this is not something it does.
- What is the point of mapreduce queries that can only run on a single machine? We originally assumed this feature was headed toward distributed queries.
- It was never clear to us what the CouchDB developers considered its core use cases. We saw development focus on being an all-in-one app server, and then on massive multi-direction replication for mobile apps. Both interesting ideas, but not relevant to our needs.
(We’re told that a few of these issues have already been addressed in the recently-released CouchDB 1.2.)
We were able to work with CouchDB’s performance, and over time we learned how to script our way around the maintenance headaches. And while we were worried that CouchDB seemed to be gravitating toward use cases very different from our own, it was the availability issues that eventually compelled us to switch. We talked about a number of possible choices and ultimately settled on a classic.
MySQL, the Original NoSQL Database
So why not switch to another document-oriented database like MongoDB, or another NoSQL database? We were tempted by MongoDB, but after doing some research and hearing a number of mixed reviews, we came to the conclusion that it’s affected by a lot of the same maturity issues that made CouchDB tough for us to work with. Other NoSQL databases tended to be just as different from CouchDB as MySQL — and therefore just as difficult to migrate to — and a lot less well known to us. Given that we had experience with MySQL and knew it was adequate for our needs, it was hard to justify any other choice.
We’re familiar with MySQL’s downsides: among other things, it’s terrible to configure (hint: the most important setting for performance is called innodb_buffer_pool_size), and its query engine, besides being SQL-oriented, guesses wrong about how to perform queries all the time. Experienced MySQL users expect to write a lot of FORCE INDEX clauses.
The InnoDB storage engine, on the other hand, is pretty great overall. It’s been hardened by heavy use at some of the biggest internet companies over the past decade, dealing with workloads that resemble those faced by most modern developers. At the lowest level, almost any database is built on the same fundamentals of B-trees, hashes, and caching as InnoDB. And with respect to those fundamentals, any new database will have to work very hard to beat it on raw performance and reliability in real-world use cases. But maybe they won’t all have to: Percona’s forward-thinking key/value interface is a good example of how the solid InnoDB storage engine might make its way into true NoSQL architectures.
In switching to MySQL, we treated it as much like a raw storage engine as we reasonably could. So now we’re back to using MySQL in the way that inspired so much NoSQL work in the first place:
- We ported our CouchDB model layer to MySQL in a way that had relatively minor impacts on our codebase. From most model-using code, using MySQL looks exactly the same as using CouchDB did. Except it’s faster, and the DB basically never fails.
- We don’t use foreign keys, or multi-statement transactions, or, so far, joins. When we need to horizontally scale, we’re ready to do it. (But it’ll be a while! Hardware has gotten more powerful since the days when sharding was invented, and these days you can go a long way with just a single write master.)
- We have a TEXT column on all our tables that holds JSON, which our model layer silently treats the same as real columns for most purposes. The idea is the same as Rails’ ActiveRecord::Store. It’s not super well integrated with MySQL’s feature set — MySQL can’t really operate on those JSON fields at all — but it’s still a great idea that gets us close to the joy of schemaless DBs.
It’s a nice combination of a proven, reliable database storage engine with an architecture on top of it that gives us a lot of the benefits of NoSQL databases. A couple months into working with this setup, we’re finding it pretty hard to argue with this best-of-both-worlds approach.