Reflecting on the second half of 2012, as we geared up for the Umbrella launch, there’s one moment that truly stands out — and it was not a pleasant one. But, it provides insight into how we worked together as a team to launch a new service, and what we were able to accomplish as a result of communication, teamwork and determination.
‘Not again’, I thought to myself on the train ride in to work. I fired off, “Pants on Fire” across email, text and iMessage (I didn’t think of placing a phone call). I had lost access to the Internet while using the beta version of what would eventually become the Umbrella Mobility iOS VPN Service. Again. But unlike before, this time we were ready.
We are extremely proud of the reliability we’ve built into the OpenDNS network infrastructure and operations around DNS (see Andree’s post), but reliability doesn’t come for free. You need to work for it and earn it, and a new team working quickly to deliver an innovative new VPN cloud service had not yet put in the work.
During early development phases of the VPN service, we had trouble even identifying issues. When they arose, it was not clear how severe they were nor who was on point. To solve the problem we developed a common language, a set of expectations of behavior, and committed to visibility and transparency to the whole company. Pants on Fire (PoF) was born.
Pants on Fire means users are affected — they can’t enroll, login, or worst case, access the Internet – while connected to the burgeoning Tahoe (code name for what became Umbrella Mobility iOS VPN service). Pants on Fire means everyone would drop what they were doing and fix the problem. It didn’t matter that this was a beta, or had only a few users. Pants on Fire means afterwards we would figure out why we had an outage. We’d put in more instrumentation to alert us in the future. We’d re-examine our development and test methodologies, our cooperation between dev and ops, so that this wouldn’t happen again (5 Why’s ended up working very well for this.) Pants on Fire means another entry in red and orange on the board for all to see.
The board grows. We also have some fun with it. The weeks move DOWN, each new week is a new column to the right. So the earliest day is the upper left, and the most recent is marked 19 (brown pants, not on fire) — the date this picture was taken.
With this visibility, patterns change. At first, frequency of updates decreases (fewer big green dots) as we move cautiously. Soon, we gain confidence and accelerate again. Fewer Pants on Fire occur. We set a goal: we won’t ship this unless we go a full four weeks with no Pants on Fire.
We approached our release date. We’ve had only one PoF in 6 weeks, but it was 3 weeks ago. Is this good enough? Will we deliver on what our customers expect? We decided to release!
Ignore the smoldering flames, that is just an indication of the pressure around the release, no pants in that vicinity are aflame. Also, no chickens were harmed in the making of Umbrella. The rubber chicken does scream loudly, however, when attention is required. Now, 3 months post-release, we know when the students using our solution get out of school in Hawaii because our instrumentation notes the surge of increased traffic as they all get online at once (we are guessing this is when they get home from school). Most recently, we had a new situation: our alerting triggered us to a potential issue before it could impact any users. We treated it like a PoF and got it addressed before customers were impacted.
Smoke, but no fire!
Shared language and understanding, transparency, and a little fun work hand-in-hand with the technology approaches that we’ve outlined in the past to make sure that we deliver the experience our customers count on. And they prevent Pants on Fire!