Shrinking Production Incidents

Shrinking Production Incidents


MYK TAYLOR: So hello everyone. All right, everyone here looking
for a shrinking production incidents? We’re going to talk
about how to keep outages from affecting your users
and keeping them happy and keeping everyone happy. So whenever you go to
a talk, you generally want to know who is this person,
and why should I spend an hour listening to them? So who is Myk Taylor? Well, he is me. I spent a decade or so building
architectures and services for companies large and small. And then I got interested
in this newfangled thing called DevOps. So I came on board
at Google as an SRE. And I’ve been in SRE
on-call for Google Cloud for about three years now. And in those three
years, I think I found more about preparing
for, mitigating, and avoiding production incidents than I had
in my entire previous career. So I’m here with
you today to share some of the best things I’ve
learned about production incidents or, as we
usually call them, outages and how to shrink them. But first, it’s
important to cover why I’m talking about
production incidents and how to shrink them. The head of Google Cloud
Customer Reliability Engineering, Dave
Brinson, who you may have seen walking around here– he always says that reliability
is your most important feature of any service. Users may not notice
when you have it. But boy, do they notice
when you don’t have it. Every time your
service has an outage, you burn away some of
your users’ goodwill. Do this too often,
and your service will be too frustrating to use. And your users will
go somewhere else. If you want to keep
your users, even if they’re purely internal
to your own company, reliability is key. And it’s definitely not free. Reliability takes a lot of work. And I’ve not even
heard of a two nine service, that just
99% availability. Not even that stays up without
constant engineering effort. So that’s why we’re going to
talk about today, what you can do to keep things up and happy. So we’ve already established
user happiness is important. Let’s graph it across the
timeline of an incident. So your service has a
level of reliability, the border between the blue and
the red areas of the graph– that’s still kind of blue– that, when you are
below it, your users will be unhappy in some sense. Does anyone know what
this level is called? Anyone out there? Have you heard of it? I’m sure we’ve
mentioned it every day. AUDIENCE: SLO. MYK TAYLOR: SLO, that’s right–
your service level objective. Five points to Gryffindor. That’s an S just a little
bit off the screen there. Now, usually the
further below your SLO that you go, the unhappier
your users become. Above the target
level for your SLO– and this is really
the amazing part– your users become
indifferent to reliability. You can be hovering just
above the red or way up at the top of the graph,
and your users just don’t perceive any real
difference, at least in the long run. I mean, you’re not going
to stop having problems. And users will still
see those blips. But as long as, on average,
you’re above that level, then your users won’t feel
like it’s that bad of a deal. So the blue area up there, that
represents your error budget. When people say, don’t
sweat the small stuff, the small stuff is what fits
up there in your error budget. You can use it like what
Gary was just talking about for fault injection. This is where you do your
testing, when you’re still up in that blue area. That’s what you can use it for. But that’s kind of a
different topic from what I’ll be talking about. I’m focusing today on
keeping you out of the red. So what’s happening
in this graph? Well, not much yet. Let’s say it’s the user
happiness graph for a chat app because Google loves
chat apps, and we love building graphs for chat apps. So we start off nice and
high in the blue region when, bam, an outage begins. Maybe the release we just pushed
is overloading a dependency. Maybe some developer added a
log statement to a hot path and I/O was blocking. Something bad happened,
and it’s getting worse. Now, the SREs on
call for this service are blissfully
unaware that anything is wrong at this point. In fact, they just went
out to lunch as usual. In the meantime,
though, your users are starting to get
a little annoyed. But their patience
hasn’t run out yet. Once enough errors
have accumulated, the alerting system
decides it’s time to page the on-call who checks her
phone, [? axe ?] the page, stuffs a little more
food into her mouth, and hurries back to her desk
to find out what’s going on. Now, this is a
particularly bad outage, so we’re starting to
dip below our SLO. And if you check
online, maybe you’ll start seeing a few complaints
pop up here and there. So it takes some time
for the on-caller to figure out what’s wrong. But they figure it out. They mitigate the problem. And then your service starts
running reliably again. So now your users are starting
to forget about the outage. It was bad, but
they got through it. And now things are OK. And the longer it
stays OK, the better they’re going to feel about it. But it never stays that way. And once you have
another outage, then you might take another hit. Now, of course,
you want your users to have the best experience. So let’s look at this
graph and break it down to the parts you could improve. You want to minimize
your time to detect– this part– and the
time to repair– this part– while maximizing
the time between failures. Now, that’s all pretty obvious. But think about how this graph
changes if the time to detect or the time to
repair were shorter or if the slope of the
graph during the outage were less steep or
if you had longer to recover before the
next outage began. You’d have much less danger
of dipping back into the red. This is what I mean
by shrinking here. If you reduce the duration,
impact, and frequency of production incidents, it
helps keep your users out of the unhappy red zone. So how do you know
how much you invest for improving each phase? Which phase do you target first? And where do you get
the data for analysis? So three is a magic
number, right? These three things
underpin everything I’m going to talk about today. Number one, of course– and
you’ve heard a lot about these– SLOs. You have to create and
maintain SLOs for your service. They’re the basis for
your error budgets. They define the desired
measurable reliability of your service. And they affect the
entire incident cycle because they tell you
when something is bad and, more importantly, they
tell you how bad it is. Number two is postmortems. You have to write postmortems. Postmortems are all about
how to make things better the next time. So when do you write one? Every time your SLO takes
a hit, write a postmortem. Sometimes something
happens where you get lucky and your SLO didn’t take
a hit, but it could have. You still want to
write a postmortem then because you don’t want
to depend on luck to keep your service reliable. It’s the SRE motto. You see it everywhere. Hope is not a strategy. The reason we’re talking
about postmortems here is because they record
what happened, when it happened, and why it happens. I’ll be talking more
about them later. But this is a very fundamental
thing that we have here. And finally, if you
want accurate data, it really, really helps have
a strong, blameless culture. A blameless culture knows
that, whatever you did, there’s probably
something better that you could have done. Everybody’s had that feeling. You’re trying to
handle something– hindsight is always
20/20, but you don’t have hindsight in
the middle of an outage. Whatever you do, it made
sense to you at the time. And you did the
best that you could. If it’s hard for you to
figure out what to do or if it’s too easy for you to
do something counterproductive, it’s the system
that should change. You should not be held
responsible for doing the best job that you could with the
information and the tools that you had. A blameless culture
is essential so that you aren’t afraid to reach
out for help during the outage. And you can be honest and
open in the postmortem. This makes the
postmortem actually useful as a learning tool. Here’s the thing. If you don’t have a
blameless culture, then you’re living
in fear of what? Blame. Your primary motivation
becomes to avoid blame, not to help your users. It’s to protect yourself,
to protect your co-workers. Would you really dive in
and try to find a root cause if you think that the
results of my research could get someone fired? That’s not the environment
that you want to live in. That’s not going
to help anybody. So a blameless
culture, a postmortem built on the blameless culture,
and SLOs to underpin it all. These are the three
fundamental pillars, so remember these as we’re
going through everything else. OK, so you’ve got your SLOs. You know if you need to shrink
to keep your users happy. And you’ve got your
blameless postmortems. So you have a clue
where you need to shrink to keep your users happy. Now, what do you do to shrink
to keep your users happy? You know overall what you
want the graph to look like. But let’s zoom in on
each phase and figure out what kind of tasks we can
focus on to bring each of them under control. Now, nothing I’m going to talk
about here is revolutionary. I’m sure everyone here has heard
of everything I’m going to say, and you know they’re
all important. But what I’m trying
to do here is relate each task to which
phase they affect most so that, once you figure out where
you need to help yourself, you can then look and say, OK,
these are the kinds of things that we can work on. And this is what’s going to be
the best benefit to our team. What I want you to
think about as I go through the
next few slides is how is my service doing here? How does this apply to me? Where can I invest more? Where is it going
to help me most? And at the end,
we’ll have a Q&A. And we can go deeper into
whatever questions you have. I should really note, though,
that even if you don’t already have a blameless culture or
postmortems or even a SLOs, all the stuff in
this presentation is pretty fundamental. So you can use it as a checklist
to get an instant readiness assessment of your service. And they’re all need to do. It’s just how much you
need to do each one. And then you can use
future postmortems to focus in on the ones
that help you most. All right, let’s get started. So you’re having an outage
when your error budget starts burning so fast that,
if you ignore it, it’ll burn away all of
your remaining budget. You can see there
in the yellow box that slope is definitely going
to intersect with the red really soon. So that’s the condition that
you have to fire your alert on. The time to detect is from
the start of the outage to when a person
that can actually work on fixing the
problem is alerted. Remember, as a person
on-call for a service, you can’t even begin to think
about mitigating an issue until you know it’s there. So some proactive
work here is required. You really don’t want your
first postmortem to say, service died. Nobody knew. We lost a lot of money. It’s not a good way to go. All right, so we have
another good rule of three here for shrinking
time to detect. First off, you need a
good signal-to-noise ratio for your SLI. That’s your service
level indicator that your SLO is based on. You know whether you have a
problem because of your SLIs. Good SLIs only move up or down
when your users are actively getting happier or sadder. If this isn’t true for
the SLIs that you have, you need to fix them
since you really only want to alert on problems
that your users actually care about. So make sure your
SLIs map very well to how your users
are actually feeling. How do you do that? Well, you have to
have some backup way to figure out how your
users are feeling, internet complaints,
service calls, phone calls from angry
customers, something to tell you that I’m in the
red and they’re calling me. That’s a good signal. I’m in the red and
no one’s calling me, maybe your SLI is too sensitive. Maybe you have a lot
of jitter in your SLI and you can’t use
it very effectively. A lot of people have
like a batch process that will hit their service
really hard for 5 minutes or 10 minutes at midnight. You don’t really want to have
a bunch of errors showing up at midnight for a batch
process that you control and have it page all the
people on your on-call service at midnight. So maybe take that traffic and
split it off to a separate SLI. Or some people
have SLIs that have all errors who turn
in from a service counts against their SLI. But 400s, 404s, maybe they
don’t care about those. Maybe that’s not a good signal. Where are we? Yeah, so in the
beginning, for your SLIs, it’s most important just
to have something and then iterate, iterate, iterate. Every time you have a
postmortem, look back on it and think, how did my SLI
teach me about what’s actually going on in this outage? Is it worse than
what the SLI said? Tweak it. Is it better? Go back the other way. Am I measuring the wrong thing? Go back and make
a different SLI. All right, the next is– so you have your SLI
actually reading something. You actually have to
get your readings fresh so your learning system
can do something with it. This means that you actually get
your alerts before everything explodes hopefully. One company I worked for
generated all their metrics from analyzing logs. Now, this worked out
pretty well for them analyzing historic behavior. But it introduced
a lot of latency for their actual alerts. So we helped them move it from
a batch log processing system to a streaming metrics
collection system that goes directly from their service
process to the metrics back end and is analyzed in line. And this helped them get
a lot of that latency out. So the final element to
shrinking time to detect is the alerts themselves. What does it mean for an
alert to be effective? Effective alerts are user
focused, sensitive, actionable, and delivered appropriately. Now, I’ll go through
that one by one. You want user focused alerts
that only fire when users are actually being affected. This is where your
SLO definitions come into play like I said. If your SLOs are user
focused, then your SLIs will be user focused and your
alerts will be user focused. They all fit
together really well. If you base your SLI
on some indirect metric like CPU utilization,
it’s not a good signal. It’s a very low signal SLI. Sometimes it means something. Sometimes it doesn’t. You want your alerts to be
both sensitive and actionable. But it’s hard to get them
both at the same time. Alert too quickly, and
you’ll be paging at midnight for something that
goes away in seconds. And your on-callers will
start looking at it, and they’ll see it’s
already gone away. And they’ll be annoyed. But if you alert too
slowly, then you’ll be wasting time
that you could be using to mitigate real outages. It takes some analysis and
time to find the right balance. It’s also important to make
sure that the alerts get to the right person. Smaller companies don’t have
as much of a problem with this if they only have
one on-call team. But larger companies that
have multiple on-call teams, you need to make sure you
loop in all the right people as fast as you can. If you can do any kind of
automated root cause analysis and automatically bring
those people into the alert, then that will save
a lot of minutes as you’re trying to
get things figured out. Finally, and this is
important, make it loud. Use a service like PagerDuty,
opsgenie, VictorOps, Fire Hydrant, something
that sends alerts as a phone call or a page. Don’t just use email. I repeat, do not just use email. Email is to be ignored. That’s all it’s good for. I repeat, it is not good
at all to use email. The time will come when the
person on-call for your service has got to take a shower or
eat dinner or go to sleep. And very few people check their
email while doing these things. So with these three,
your time to detect would be in pretty good shape. So shrinking time to detect
was all technical tasks. Shrinking time to
repair, though, is mostly about people,
policies, training, and, most importantly, stress management. Even if you alert
super efficiently, What good is it if everybody
panics when you get the alert? In my opinion, this is the most
important section to focus on. You’ll always have outages. They’re not going away. You can automate fast
solutions to many of these but not all of them. You hire engineers because
they’re smart and creative. And they’re ultimately
the best resource you have for fixing the problems
that slip through the cracks. Of course, they’re also
the slowest mechanism you have for fixing
the problems. But they’re your last,
most capable defense. And it pays to
prepare them properly. So a quick reminder before
we begin this section. An incident
responder’s first duty is to protect the user
and the user’s data. During your outage, keep
your focus just on that. Don’t go and try finding
all the causal factors. It just doesn’t
help at that point. So a good response is a
temporary failover, a version rollback, an emergency
request for more resources. That’s all good. A bad response–
go into the code, debug it, fix it, check it
in, build a new release, push it out. That takes way too long. You don’t have time for this. Do that later when the service
is not actively burning. So what can you do to
improve your time to repair? First, you have to
train your responders. You need emergency
procedures defined and tested so that everyone involved
can be calm and productive during an outage. You need procedures that
provide enough structure so that it’s all organized and
people know what’s going on. But you need them to
be flexible enough so that a lot of ad hoc debugging
and impromptu substructures can take effect. And don’t just train
the people on call. Train everybody who might
participate in incident response– devs, product
managers, project managers, even CEOs, anybody who
might contribute and have to make some sort of call. And don’t train them just once. Instant response training gets
rusty really, really quickly. To keep it fresh, a
lot of teams at Google run weekly failure drills. We call them wheels
of misfortune because we choose
somebody at random and make them try something
like fix some sort of problem in production. We call them volunteers. But we put that in
quotes because they’re more voluntold what to do. So they go, and they’re
given some sort of alert. And they talk through the
steps what they would do. And then another
member of the team, someone who organized
the drill, will play the part of any external
systems or external teams that are contacting with them. A simple example
could be an alert comes in that users can’t
log into your service. What do you do? And you go talk through
all the tools that you use, all the things you
investigate, and all the people that you contact. You can also hold practical
disaster recovery tests where you actually engineer
real, controlled outages and then mitigate them. This is a lot what
Gary was talking about in the last presentation. This gives everyone practice
with actually communicating with each other and,
among other things, tests all your
escalation channels. You can get the whole
company together to do a whole company
drill like this. A scenario that
Google likes to run is pretend that California
had a major earthquake and slipped into the ocean. So Google headquarters
is just gone. What does the rest
of the company do to continue functioning? And we this almost
every year because it’s important to figure
out how you continue working if your headquarters
just disappeared. Number two here is that you
have to write alert runbooks. Some people call them playbooks. A runbook is the first
thing that an on-caller sees when they get an alert. You need to write one for
every alert that can fire. Write down things like
special response procedures, links to dashboards,
what queries you would run in the debug
logs to find more information. Everything someone who’s
trying to handle the alert would like to know. I’m not saying that you need
to know every possible cause and fix before you
even write an alert. But the runbook should
give you some sort of clue for how to
start the investigation and where to go for help. Now, something I
see a lot is someone writes a runbook but then
nobody else can understand it. Don’t do this. When you write a
runbook, even if you think you’re just
writing it for yourself, keep it brain dead simple. Write it for a
version of yourself that just got back from
vacation and has completely forgotten how to do everything. This is actually
a good opportunity for a failure drill. Give the runbook to an alert– give an alert for the
runbook to a volunteer and see if they can follow it. Is there anything
that they needed to do that wasn’t
in the runbook? Add it to the runbook. It’s just way too easy to
forget simple basic steps in the stress of an outage. So next up– you need
to create a pager SLO. So this is different
from your service SLO. This is the policy
for how long it can take for the person on call
to start working on a page. Regardless of what you
actually set it to, it’ll help even out
your time to repair. It’ll become more consistent,
and that’s a good thing. But what do you set it to? It’s easy to think,
duh, just start working as soon as you get paged. But it’s not quite that simple. Say you set your
pager’s SLO to something short like five minutes. Excellent, your average time
to repair just went down. But what’s the cost? It’s easy to meet
a five minute pager SLO when you’re at the office. But what about the
evenings, the weekends? If you’re on call, you can never
be more than five minutes away from your work computer. Grocery shopping? No way. Going out to the park for
a picnic with your family? It’s not going to happen. You can bring your
laptop, but you have to think–
what’s it going to be like trying to solve a real
problem with your laptop, going through your phone, and
going over a slow connection? It’s not fun. There are some teams at
Google that do actually have a five minute pager SLO. But many teams find that 30
minutes is more reasonable. Think of your pager SLO
as your jumpiness factor. How jumpy do you need
your on-call team to be? Choose this carefully. It’s going to determine
how quickly you get started on
incidents, but it’s also going to affect people’s lives. In the same theme, actually
very much the same theme, you want to start to
reduce responder fatigue. As you might guess,
burnout is a real problem. Being on call is
a hard job to do. If you want to keep your
on-callers on the team that they’re on
call for, you better make their happiness a priority. You can automate routine
operational tasks. This takes a lot of the load
off their everyday life. We call this eliminating toil. A classic example of this
is release management. How much of the release
can you automate so that you don’t have to
babysit it and click buttons every half an hour? You can get more
people on the rotation so your turn for on call
doesn’t come up as often. At Google, there’s a general
rule for six people minimum on rotation. You might not actually
have that many SREs. But if you’re short on that,
then pull in some devs. Some senior developers make
really good on-callers. And remember,
you’re never alone. So if the primary on-caller
doesn’t know what to do, there’s always someone else on
backup that they can contact. This is always a fun one. Get rid of unactionable alerts. You can tune them to
be more effective. But maybe you can
just delete them. I can tell you that there’s no
greater joy in an SRE’s life than proving an alert is useless
and then going and deleting it. We always throw a
little celebration whenever that happens. And finally, you might be
able to relax your SLOs. This can be your
pager SLO so that you don’t have to be so
jumpy all the time, even when you’re
not getting pages. Or it could be your service SLO. Do you really need
all five nines? Maybe not. Five nines is hard to hit. Maybe four nines, 3 and 1/2. Anything that brings
it down a little bit and still keeps your
users happy will make life a lot more
bearable for the people that have to support it. But wait, there’s more. And this one’s important, too. You need to make sure your
responders are empowered to do the job that
they’re paid to do, both technically
and politically. What do I mean by this? You should make your on-callers
have authority during outages. Give them control over the
system that they have to fix. You don’t want the outages to
go on longer than they have to because your on-callers
have to go call the CEO and ask for permission
to do something. Or go and try something that’s
really important then find out access denied. And they have to go and find
out why their access is denied. It delays everything, and
it’s not good for your users. And you also have to
make sure that everyone else who’s working
on the incident understands that,
when an on-call asks you to do something, you do it. The flipside of this is that
you need a solid audit logging framework so everyone is still
accountable for their actions. Remember, blameless
does not mean invisible. Another way for you to
speed up your response is to improve your
recovery tools. If you know what you
need to do right away but it takes forever to
get it done, then maybe you need to work on this. You can start by writing
tools for fast rollbacks because that’s the first
best thing that any incident responder can try to do is roll
back the version that just went out because that very
often fixes the problem. Or automated failover–
when something goes wrong at that particular
part of your application, can you just failover to
your backup somewhere? This is very dangerous
to do manually. You want to automate
this as best you can. This comes up all the
time in postmortems. When you write about what you
could do better next time, think about what
tools you’ve used and what processes
can be made faster. And you need good monitoring. When you get a page, the first
place you go is the runbook. But the second is
usually the monitoring. You want to see visual
time series graphs for all of your internal metrics. These are not the SLIs. This is stuff like
error rates to back end services, q depth,
[? memory ?] pressure, all the technical
details of your system that give you a clue where to
bring your investigation next. And finally, you need
queriable logging. Like monitoring, you need it
when investigating an alert. It should tell you the error
details for what’s going wrong. And if you ever can’t
get the details this way, say so in the postmortems
so the devs can go in and add the lines so the
next time it won’t be so bad. You have enough to make
sure that you don’t have too much logging, though. Every line that
you log has a cost. And you don’t want to
overload whatever system you’re using on the back
end for logs management. That can get very
expensive, too. So try not to go overboard
with your logging. You also need to be aware
of user privacy issues. You can’t just log user
name and all the data that goes along with that. Talk to your legal
team about this and find out what you
can actually record and how fast you
need to delete it. This can be a really big issue. And it’s obvious, but
I still have to say it. Do not write passwords to logs. It never ends well OK. So you’ve mitigated your outage. Good job. Your users are slowly
rebuilding their confidence in your service. You want this phase to last
as long as possible, right? At Google we found that the
most common trigger of outages is a code or config change. So to maximize time
between failures, just eliminate all
changes, right? Well, that might work
for a little while. And actually, sometimes
it’s a really good idea like, say, you have
a business that gets almost all its revenue
in one particular weekend. Maybe don’t make any
changes during that weekend. Why take the risk? There’s no purpose. But in general, you don’t
want changes to stop. There are always features
to push, bugs to fix, new hardware to support,
new threats to counter. Changes, however,
can be made safer. And that’s what we’ll
talk about here. But first, let me go back
on what I said a second ago. You don’t necessarily want this
to last as long as possible. There are some companies I’ve
worked with that are just laser focused on this section. They think, just no outages. The best way to go
is just no outages. But it turned out it was
the wrong thing for them to focus on. And it’s probably the wrong
thing for you to focus on, too. I’m not saying, don’t
focus on it at all. But do not focus on this to the
exclusion of everything else. What happens if you go too
long without an outage? Your on-callers get
out of practice. And when you finally
do have an outage– and you will– what do they do? They have to go and look
back at the procedures. They don’t have
the muscle memory to figure out what to do. And everything ends up
being much worse than it would have been otherwise. Also companies that prioritize
time between failures too highly often slow
the release process down to a crawl. And this is bad for
a number of reasons. It’s bad for dev
velocity, and it’s really bad for response
time to market forces. When things happen in
your competitor companies, you won’t be able to
respond as quickly. In general, it’s better to
focus on shrinking your time to detect and your time to
repair and just take this as you need it. But that being said, most
of us don’t have the problem of having too few outages. So we still need to
work on this stuff. Number one is our old friend,
the blameless postmortem. Write one after every
incident to capture metrics and analyze how things went. And most importantly,
you discuss action items to make it better. And also, most importantly, make
sure you do the action items. These are just not to
be written and ignored. Make sure they get
prioritized very highly against the other features
that devs are working on. Postmortems are the key to
preventing the same incident from happening twice or
at least making it better when it happens again. Remember, there is always going
to be new kinds of postmortems. And you can’t afford
to let them build up. You should write your
postmortems collaboratively with everyone else
that is involved to get all the different
perspectives and all the different viewpoints. And so everyone can
contribute to saying, this is what I
thought was going on, and this is why
I did what I did. It’s not that complicated. They sound hard. But they’re really
not bad at all. It’s just a fact finding
document and a little bit of, this is what I was thinking. If you’ve seen the
Google SRE book online, there’s a sample
postmortem in the back. And you can use
that as a template. Next up is CI/CD. You heard about this
earlier in the day, too. That’s continuous integration
and continuous delivery. This is how you
can deploy changes with reasonable confidence
that they won’t blow up in [? prog, ?] probably. Sometimes they blow up anyway. But this will help them become
smaller and less impactful. You need to automate your tests,
unit tests, integration tests, end-to-end tests, and
rerun them really often to catch regressions. At Google, we basically
run every related test on every code change. And we simply block the change
from ever being committed if it doesn’t pass all of our tests. This is a high bar to pass. But it also gives us
a lot of confidence that, when we put code out,
it’s not going to blow up. And you want
continuous delivery so that you can get new code
out as often as possible. This way each release will be
nice and small, which means smaller things can go wrong. It’s much easier for you to
figure out what is going wrong when there’s only a few
changes per release. Now, I want to talk about
something really important with continuous delivery here. And that’s gradual rollouts. Gradual rollouts allow you to
discover problems and roll back before all your
users are affected. The alternative here is just
shipping your entire binary and putting it out to
all services at once. This gets the new code
out there quickly. But if there’s a problem,
you’re now 100% down. A gradual rollout will only
affect a few people at a time. And you can compare
the error rates for those few people against
the control for all the people on the previous version and
see if those error rates, the percentages, increase. And once you have a good
statistically significant number of them, you
can say, OK, we’re reasonably confident that
this new release is bad. Let’s roll it back even
though we haven’t fully rolled it out yet. It saves a lot of
your error budget. It makes that slope of the graph
at the beginning go down a lot more gradually. And that’s a very good thing. We call this canary analysis. All this filters back
to the dev side as well. Since there’s always
two active versions of the code in
production, you need your data structures
and all your code to be both forward and
backward compatible. Why both forward and backward? Because you may get
partially rolled out, change some of your back end
data, and then roll back. So you need your
previous version to be able to read the data
written by your next version and vice versa. So whenever you
make a change, make sure they can read up
to two versions back. And whenever you’re expecting
a change to be there, make sure it’s there for
at least two versions before you change it. Next is about your service
architecture overall. So what makes an
architecture robust? Your architecture
is robust if it continues serving your users
despite significant internal failures. Think redundancy, no
single points of failure, graceful degradation. This means reduced
functionality when there’s an outage instead
of no functionality when there’s an outage like,
for example, if you serve some data gathered
from some back end and that back end goes down. The page that you
display to users can still successfully be
shown to users but maybe with stale data
or maybe just not without that particular
piece of data. Make sure you write
your user facing code such that it can
handle partial completions from the back end. And finally, chaos engineering–
you hear about this a lot nowadays. Now, I want to say I don’t
really like this name. And Gary said this before, too. We’re not trying
to create chaos. We’re trying to
create resilience. Whenever you see the
word chaos engineering, think resilience engineering. Anyway, why use it? It’s so you can find problems
under controlled conditions so you don’t find them in
uncontrolled conditions. If you think your architecture
is robust, poke it and verify. When I talked before
about disaster recovery training or testing, it was
for training the people, but you can also use it to
test the systems this way. And the cool thing is
it can all be automated. Just write a test to break
something in production. And what do you expect? You expect that nothing
happens, at least to users. Internally, things might be
happening all over the place. You automatically failover. Your redundancy kicks in. This is all expected. But your service
remains within your SLO. And that’s the important part. Sometimes it doesn’t. Sometimes you find
a real problem. That’s why you do this
in the first place. So what are the most important
things we’ve covered today? Pretty much all of them are
ignore at your own peril. But I think the most important
ones are the two at the top– SLOs and blameless culture. This is what underpins
everything else. Your SLOs give you that
common language to say, we’re out of SLO. We need to freeze. You, the PM, you told
me in the beginning what you need the user reliability
to be, and we’re out of that. So you, as the PM, need to
work with the devs and say, we need to prioritize the
reliability to get back up. Your SLOs are your main
force of communication across different disciplines. And blameless culture–
so you can make things better so you can communicate. So you can work together and
not be in fear of each other. [MUSIC PLAYING]

Author:

Leave a Reply

Your email address will not be published. Required fields are marked *