Gremlin looks to bring "chaos engineering" to cloud masses

Gremlin looks to bring “chaos engineering” to cloud masses

Written by FinTech Futures
12th December 2017

Early aviators blamed accidents on mischievous sprites they called “gremlins”. The stories gained popularity during World War II.

A start-up called Gremlin, founded by engineers from Netflix, Google, Amazon and other web-scale companies, is looking to help enterprises improve cloud applications’ reliability by using “chaos engineering” to build up the system’s defences.

Enterprise Cloud News (Banking Technology’s sister publication) reports that the system takes out components of an internet application – for example, individual servers or connections – on a controlled basis, to test whether the system recovers gracefully. These planned outages help engineers develop systems resiliency in the face of real, unplanned outages and damage, Kolton Andrus, Gremlin CEO and co-founder, tells Enterprise Cloud News.

Gremlin launched out of stealth and made its service generally available today (12 December), with $8.75 million funding from Amplify Partners and Index Ventures. Customers include Twilio and Expedia, Andrus says.

Netflix is generally credited with developing chaos engineering, starting with a tool it called the “chaos monkey”. As described on the Netflix Technology Blog in 2011, chaos monkey is “a tool that randomly disables our production instances to make sure we can survive this type of failure without any customer impact”. The tool works as if Netflix as “unleashing a wild monkey” in its data centre, breaking things. The goal is to test component failures to be sure they don’t bring down the entire services.

Netflix developed an entire suite of tools, which it called the “Simian Army”, to test failures such as poor latency, as well as finding and shutting down instances that don’t conform to best practices, and testing for instance health and security violations.

Andrus says Amazon was doing the same sort of work at about the same time as Netflix while he was there. Prior to that, “a lot of what we were doing was reactive,” Andrus says. “It was whack-a-mole. We were getting paged at night. We wanted to be proactive.”

Andrus later joined the Netflix team to continue working on failure testing and chaos engineering.

Now, with Gremlin, Andrus and his team of 15 are looking to bring chaos engineering to enterprises and other cloud application developers.

The problem is that cloud applications have made reliability more difficult, Andrus says. In the world of monolithic data centre applications, many problems could be solved with redundancy. Now, cloud applications require myriad microservices, relying on third parties for infrastructure.

“It’s very difficult for an engineer to be able to hold all that in their head, to be able to understand what might go wrong,” Andrus says.

Chaos engineering is like a flu shot or vaccine, Andrus says. “It sounds counter-intuitive, but injecting a little harm helps us understand how the system behaves, and helps us build up our defence against the damage.”

Gremlin supports containers, and is cloud-agnostic, working with Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform and bare metal servers in the data centre.

The service relies on three key principles: safety, security and simplicity. For safety, every change can be rolled back and Gremlin also limits the “blast radius” of a change – the amount of damage it can potentially do.

For security, Gremlin only communicates over SSL, and supports precautions such as permission controls, single sign-on, and role-based access controls.

And for simplicity, Gremlin uses intuitive user interfaces to walk people through running experiments, reporting and controlling tests. The service includes an API to integrate with third-party software, as well as a command line interface for advanced users, Andrus says.

Gremlin tests for a variety of types of failures: CPU failures, disk and memory overconsumption, virtual machine failures, container failures, failures to synchronise clocks, network problems such as failures to resolve DNS, AWS S3 failures, and more.

“It’s a bit like a fire drill,” Andrus says. “You want to test these things properly, you want to give people an opportunity to practice it, during the day, when their caffeine has kicked in.” That way, when the real failure comes in the middle of the night, IT will be ready