Over the last year, the Sauce customer base has been growing tremendously. We have heard your concern about our recent service interruptions. We have always had our sights set on 99.99% uptime, and are continually doing everything we can to achieve that. Examples include our recent datacenter move, constant capacity upgrades, and complete re-architecture of our Mac cloud.
Unfortunately, our growth has caused some hiccups that have negatively impacted your experience using Sauce. Knowing ourselves how important a smooth development process is to productivity, our recent quality of service is simply unacceptable and we vow to do better. We want it to be 100% clear that reliability is our top priority, and every time we can’t run jobs within single digit seconds, it becomes a red alert of the highest priority for our team.
To give you insight into how we are assessing situations such as the one yesterday: when our average launch time for a job exceeds 30 seconds, we consider that event an outage and launch an investigation to fully understand what happened, and why, with the onus on not only fixing the problem, but improving our process going forward. These include such things as eliminating single points of failure, scaling our cloud capacity far beyond the foreseen needs of our users, and introducing more stringent testing and analysis into the development and pre-deploy processes.
In terms of communication, when an outage happens we update @sauceops immediately and also status.saucelabs.com, which shows the average wait time for jobs. Our status page is continuously monitoring the situation and updating every few seconds. We also take care to update @sauceops when we notice that job wait times are exceeding an average of 10 seconds, though we don’t count that as an outage.
Additionally, we’d like to make a suggestion to help isolate your build from the effects of unexpected elevated wait times and internet latency should they happen in the future. If you are using a CI system or test runner that times out and fails tests, please consider increasing your time-out threshold to five minutes to isolate your suite for when the Sauce Labs cloud experiences longer wait times. This will allow your jobs to remain queued rather than fail on Sauce. Again, you can go to http://status.saucelabs.com at any time to see average wait times and tweets from @sauceops.
Rest assured that we take events like yesterday extremely seriously, and are working to prevent this from happening in the future. We are fortunate to have very loyal customers and don’t want to let you down.
– Adam Christian
VP of Engineering