Google Production Environment

Google Production Environment

name is Luis Quesada Torres, and I work as Site
Reliability Engineer at Google. In this tech talk, I am going
to present the Google Production Environment or, at least a
self-consistent way in which Google infrastructure has worked. Some details may have
changed over time. I am going to
introduce everything needed to develop, deploy,
and run a large-scale service. Please note that I
based these slides on the SRE book,
which I’ll link to at the end of the talk. Credits for the contents
of this tech talk go to the authors and
contributors of the book, and especially to
JC van Winkel, who authored the “Production
Environment at Google” chapter in it. Let’s start by
briefly discussing some concepts related to
the physical setup of Google infrastructure. That is, network and machines. Google infrastructure relies
on three different levels of network. The Edge Network connects
Google infrastructure with ISPs to get traffic from
and send traffic to users. Data centers communicate
with each other by means of a globe-spanning
backbone network called B4. B4 is a software-defined
networking architecture that uses the OpenFlow
open-standard communications protocol. The network fabric deployed
inside data centers is called Jupiter. Jupiter consists of hundreds
of hardware switches that form a virtual switch
with tens of thousands of ports and a bisection bandwidth
of 1.3 petabits per second. Now, in terms of
request handling, most external requests, after they
go through the Edge Network, hit Google Front End, or GFE. This front end acts
as a reverse proxy that routes the requests and
the replies between the users and the processes
running the application. The last missing piece is the
Global Software Load Balancer, or GSLB, which balances requests
at three different levels. It does geographic load
balancing for DNS requests like to, it load
balances at the user service level, for example to YouTube
or Google Maps and it load balances individual
Remote Procedural Calls, or RPCs. So a specific user request would
go from their local network to their ISP, then to
Google’s Edge Network. Once in Google network,
it would hit GFE. And from there,
the payload would go through GSLB,
B4, and Jupiter, potentially with several
levels of indirection, until it hits the server. Then the response goes all
the way back to the user– all of this in just
a few milliseconds. The physical layout of
computers within a data center is the following. Google production fleet
runs on many campuses. Within each campus, there can
be one or more data centers. Within each data center, there
can be one or more clusters. Within each cluster,
there can be several rows. Within each row, there
are several racks. And within each rack,
there are several machines. As you can imagine, this adds
up to too many machines. It is impossible to manage
that many machines manually. The teams that maintain
concrete services need an easy way to schedule and
manage jobs in these machines. There is software that is
in charge of managing what runs in each of the
machines in a cluster. Let’s see how it works. Let’s imagine users or teams
want to run jobs in a cluster. Those jobs can be indefinitely
running servers or batch processes, like MapReduce. Each job can consist
of more than one and sometimes thousands
of identical tasks. That is for reliability,
but also for scalability. A single process may
not be enough to process all the traffic. Each job has concrete
resource requirements. For example, three
CPU cores per task or two gigabytes
of RAM per task. Users launch these jobs by using
tools that read config files. Requests arrive to a
borg manager process. The borg manager
checks with a scheduler in which machines the
tasks should be scheduled to optimize the bin packing. Then it asks the
borglets running them to start the tasks. If a task starts malfunctioning,
it is restarted, possibly in a different machine. Please note that given
that tasks are fluidly scheduled over machines,
IPs and port numbers mutate, and it is not possible to rely
on them to refer to the tasks. So, there is an extra
level of indirection– Borg Naming Service, or BNS. A BNS address like this resolves
to the current IP address and port of a concrete task. BNS addresses have to
be always up to date and consistently stored. Where is the mapping between BNS
addresses, and IPs, and ports stored? Well, the mapping
between BNS addresses, and IP addresses,
and ports as well as locks to elect the new borg
manager if the current one malfunctions and
potentially some other data are stored in Chubby. Chubby is a lock service that
provides a filesystem-like API. It handles locks
across data centers and uses the Paxos algorithm
for asynchronous consensus. With this, borg
manages a cluster and turns it into some
sort of huge distributed computer in which users
can run processes. How can users store
data in this computer? A first approach would be to
directly use the hard disk drives and solid-state
drives in the machines where the tasks are running. However, given that tasks
are fluidly scheduled, these drives are not a
good option for these tasks to store permanent data. Indeed, D, which
stands for Disk, is a file server that runs
on machines in a cluster and uses HDD and SSD to
store data temporarily. Colossus is a successor to
GFS, the Google File System. It is the layer on top of D
that creates a cluster-wide filesystem that can store
data permanently. It offers usual filesystem
semantics, and replication and encryption capabilities. Bigtable is a NoSQL
database system that works on top of Colossus. It can handle databases
that are petabytes in size. I will say this slowly. Bigtable stores maps that
are distributed, sorted, persistent, sparse,
and multi-dimensional. Each value stored in the map is
an uninterpreted array of bytes. The clients using the table
are responsible for serializing and interpreting the
data before writing and after reading
it from the table. Bigtable also supports
eventually consistent cross-cluster
replication. Spanner is another
storage system that works on top of Colossus. It offers consistency
across the world. Plus, there’s plenty
of other storage options built on top
of Bigtable, Colossus, and Spanner. The storage systems
complement Borg to turn a cluster into
a huge distributed computer that can not only run
processes, but also store data. But it is more fragile
than a personal computer, because many machines
are more likely to break than a single one. How can this computer guarantee
that all the processes are always running fine? It does so by means
of monitoring. Monitoring services monitor all
the tasks running in a cluster. Borgmon is a whitebox
monitoring approach. Tasks running in
a cluster export a set of values
that can be gathered using an HTTP request. The Borgmon scraping layer
gathers all these values. This layer consists
of several tasks that aggregate partial data. It’s several tasks for
scalability purposes. Then, the cluster-level
Borgmon task gathers all the partially
aggregate values for a cluster and aggregates them
into per-cluster values. This setup is the
same in every cluster, and then there’s a
global layer that gathers aggregate data
from all the clusters and aggregates it together. The cluster-level and
the global Borgmons export the values to
a time series database and can trigger alerts– for example, if request error
ratio for a concrete service in a cluster or
globally is too high. Probers are a blackbox
monitoring approach. They send requests to a task
and evaluate their response times and response payload– for example, the HTML
contents of an HTTP response. Probers can be set up at the
same time behind the load balancer and directly pointing
at servers in a data center. This makes it easier to figure
out whether traffic is still being properly served
when a data center fails, and it makes it easier to detect
where a failure has occurred. Let’s now discuss how tasks– for example, Probers
in this slide– can communicate with other tasks
within and between clusters. Stubby is a Remote
Procedural Call mechanism, or RPC mechanism. It is HTTP-based and is
defined using protocol buffers. Indeed, requests and responses
are protocol buffers. Protocol buffers are
a language-neutral, platform-neutral,
extensible mechanism for serializing structured data. It is like XML but smaller,
faster, and simpler. Protocol buffers take, as
input, the definition of a data structure and generate
source code to write and read the structured data to and
from a variety of data streams. The open source implementation
of protocol buffers supports at least Java, Python,
C++, Ruby, and C#, and works across platforms. This is the last
piece needed to have a fully-functioning production
environment for distributed computing. So, how do engineers develop,
deploy, and run programs in this production environment? Let’s first discuss
where the code lives. Piper is the repository
where the code lives. In 2016, it stored about 1
billion files, 2 billion lines of code, and 9 million
unique source files, totaling about 86 terabytes
of data and 35 million commits in 18 years. It stores, among others, source
code, configuration files, documentation, files copied
into release branches, and supporting data files. So, let’s imagine an engineer
wants to do a change to code in the repository. The engineer has to write a
changelist, namely CL, sticking to Google’s style guides. It is Google culture to
have all code peer-reviewed, so another engineer
reviews the CL. Finally, the owner of the
modified files or directories is also requested
to approve the CL. Please note that, in some cases,
the owner and the reviewer can be the same person. The engineer
triggers the submit. And then pre-submit checks do
automated testing and analysis on the change. If everything is
fine, the change is submitted to the repository. How are these projects
with code in the repository built, tested, and deployed? Blaze is a build tool that
supports building binaries from a range of languages, including
C++, Java, Python, Go, and JavaScript. The engineers
specify targets which are the output of a build run– for example, a JAR file. For each target, it is
possible to specify sources such as files with source code. It is also possible
to specify dependencies on other targets and tests. Blaze can also build MPM
packages, where MPM stands for Midas Package Manager. These packages are versioned and
signed to ensure authenticity. A framework runs continuous
testing on the code. The last passing version
is periodically picked up by the release tool Rapid, which
runs Blaze again to generate MPMs and labels them so that they
are identifiable by a name and a version. Finally, Sisyphus does
the push to production. Sisyphus is a framework that
allows rollouts as complicated as necessary. For example, they can
be pushed immediately with canarying stages
or to successive clusters over a period of several hours. And that’s it. That’s everything. Let’s now do a quick recap. Let’s see how an engineer
would use all this technology to develop a service, build
it, run it in Google Production Environment, make it serve
external user requests, and keep it up to date. Let’s assume that
an engineer wants to develop, for example, a
high-traffic online forum. They use the development
infrastructure to write the code and tests
and get them peer-reviewed and submitted. And they build the
first project MPMs. They write the
configuration files and start the jobs in
Borg, the cluster manager. Borg uses Chubby,
the lock service, to store the mappings
from BNS addresses to IP and port of the
tasks running the service. External user requests
go through Edge, GFE, B4, GSLB, and Jupiter,
and get to a concrete task running in a cluster in the form
of protocol buffers via Stubby calls. The tasks use, for
example, Spanner to write and read the
messages posted in the forum. Borgmon and Probers
are constantly running, and they fire alerts when
request latency or error ratio increase. The server code is continuously
tested and regularly built by Rapid and rolled
out by Sisyphus, so that the tasks pick up
the changes to the code and dependencies. This is already a
service whose maintenance is practically automated,
which engineers can improve, and which relies on Site
Reliability Engineers to deal with any
eventual alerts. That said, let’s wrap up. This slide contains a list
of open source projects that are alternatives to the
different components that I’ve mentioned during this tech talk. I’ll give you a few seconds
to skim through the list. OK. The Site Reliability
Engineering book, also known as the
SRE book, contains more detailed information on
these and other components as well as some other
topics of interest related to production environment, such
as reliability and scalability. Feel free to have a look
at it in the link listed in the slide. So I hope you enjoyed
this tech talk. I really enjoyed preparing
and recording it. Thank you for watching. Cheers. [MUSIC PLAYING]


9 thoughts on “Google Production Environment”

Leave a Reply

Your email address will not be published. Required fields are marked *