Serving Models in Production with TensorFlow Serving (TensorFlow Dev Summit 2017)

Serving Models in Production with TensorFlow Serving (TensorFlow Dev Summit 2017)

[MUSIC PLAYING] NOAH FIEDEL: Hi, everybody. I’m Noah Fidel, and I’m
going to share with you today how you can serve your
models in production with TensorFlow Serving. So we’re going to start with
that image on the bottom right, and really this is just
showing how, as an industry, software engineering is
quite mature and advanced. And machine learning is still
early on in the early days. So if you think about all
the great things we’re used to with soft engineering
and best practices that we’ve all
come to understand, things like source control
and continuous builds, we might take for granted. And machine learning,
we don’t even know what some of
those things are yet, and that motivates a lot of what
we do with TensorFlow Serving. So starting with a few stories– who has taken a commercial
airline in the last 10 years? OK, so half of us are awake
and the other half probably have as well. So going back
about 12 years ago, I was working on a system
that did the flight planning software for the majority of the
world’s commercial air flights. This included things
you might care about like how much fuel to take on
your plane to get somewhere. It didn’t have source
control when I joined. It did when I left. And I was helping do
that, but this kind of motivates that even
though source control was an understood thing of the time,
just because we have a best practice doesn’t mean
everybody uses it yet. And that’s something
with TensorFlow Serving. We want to make best
practices that you all can use and make them easy
enough that you got them kind of by default. So in 2010, I was at a small
startup doing photo sharing on mobile apps. It was really great. We had some great
engineering, we tried to do all
the best practices, we had source control and
continuous builds, and so on. And one day we decided to
add a machine-learned model to do face detection
and auto cropping, and it was really great. We loved it, our users loved it,
our investors really loved it, but we had no continuous
build from machine learning. And at the time, the tools
are kind of ad hoc enough that we checked in the weights
to the model in the execution environment, but you couldn’t
retrain them model, right? And now it’s a kind of
understood best practice. You want to be able to do that. So here we are today in 2017. We have lots and lots of great
tools, but really, you know, we’re getting started,
and we have a long way to go from machine learning. And hopefully,
all of us together can build a great ecosystem
around machine learning and best practices, and
so on, and share those. So here’s the agenda
for today’s talk. To answer the first question,
what is TensorFlow Serving? I’m actually going to
introduce what is serving. So some of you probably
know what it is. This might be, kind
of, the basics for you, but for those who
aren’t familiar with it yet, if you ever want to
deploy your machine learning into production, you’ll
want to become familiar with these concepts. So serving is how you apply
a machine learning model after you’ve trained it. Seems pretty simple
and straightforward. It has some attributes
though that might be unexpected or new to you. And, you know, taking
another view on this. Hopefully, everybody is
familiar with the orange boxes on the left side of the slide. So you have some data. You’re going to use that to
train a model in TensorFlow. And then you have that model,
and you have your application on the right. Let’s say you’re showing
video recommendations to users in your application. Somehow you need to
get that model such that it’s usable inside
your application. So the most common way to do
this is using an RPC server. And TensorFlow serving can
be used both as an RPC server and as a set of libraries inside
your app on embedded devices or in mobile. So a key attribute
for serving is that it’s online
and low latency, and this differs quite
a bit from training and other big data
pipelines you might run. Your users don’t want to
wait for the recommendations. They’re not going
to wait a minute. They’re probably not
going to wait 10 seconds. You really need to be fast
and consistently fast. Another thing that’s, you know,
different from many big data tools out there is that
you’ll have many models in a single process. So let’s say you had a
great production model, it’s serving great
recommendations to your users, and then you have a new
experiment you want to run. The most common
pattern that people do is they want to load, say,
a second experimental model in their server alongside
their main production model. So it’s really common. Another thing is an
emerging best practice that we’re seeing a
lot inside Google, and we want to bring
more to industry– is that you’ll
have many versions of your model over time. So the data that you
trained on last week might not be as relevant to the
data you gathered yesterday, and so you want to
continually train your models and continually deploy them. Last but not least,
I’m just going mention the last bullet here. Is anybody familiar
with mini-batching? OK. Just a f– oh, a good number. So mini-batching is where
you take a bunch of examples at training time,
maybe cue them up, and then you run them
through your graph together to get more throughput
and higher efficiency. And this is really great. It lets you train models
more quickly, process a lot more training data, and
so the challenge here is how can you do that in
a production setting where all of your requests
arrive asynchronously and you want to keep a
nice bound on latency. And we have some good tools,
both at the library level and binaries for you to do that. All right, so moving on to
what is TensorFlow Serving. So it’s a flexible,
high-performance serving system for machine models, and it’s
designed for your production environments. TensorFlow Serving has
three major pillars. The first are C++ libraries. These are what we use internally
to build our binaries, and they’re, you
know, all open source. And they’re very low levels. They include things like how to
save and restore a TensorFlow model as well as how to load
new model versions over time. We have binaries
that incorporate all of the best practices
we’ve learned thus far out of the box, and
for the emerging ones that we’re not sure if
it’s a best practice yet, you know, flag so that you
can enable and try them. We also have Docker
containers, and tutorials, and code labs that let you
auto scale our binaries on Kubernetes and so on. And finally, we have a hosted
service with Google Cloud ML as well as an internal
version of that. OK, so diving into
our libraries. Our core platform is
completely generic, and what this means is let’s
say you have a current system. Maybe you’re serving on a legacy
or in-house built ML system and you want to
adopt TensorFlow, but for some transition
period of time you might want to mix that
in with your legacy system. So our core platform lets you
include any C++ class you want as a servable– The components of the
libraries are all a la carte, so every component–
if you go look at our page, every
single class we have is used by some customer
inside Google on its own and in just about
any combination that you can imagine. They’re also used and deployed. And taking the last
three bullets together, the highlight here is
that all of our APIs, by-and-large, allow you to plug
in your own implementations. So you can, you
know, add support for a different model storage. Maybe you’re already
an existing user of ML. Maybe you have hundreds
of models in a database. If we provide a way to get
models off a file system, you can easily extend that
to support a custom database or a custom data store. OK, so I’m going to
walk you through. Hopefully, this is
not too complex. This is our libraries at a
high level inside TensorFlow Serving, and they’re also
arranged here exactly as they are inside our binaries
that we pre-build for you. So the green boxes are
our standard, kind of, abstractions with APIs,
and the yellow boxes are plug-ins into
those abstractions. So let’s imagine that
this is a server, and you’re doing
video recommendations, and you have version one
of your model loaded. And it’s called My Model. So you have the
source, and the source is responsible for
identifying new models and– that should be loaded. So we have a file system
plug into the source. It does exactly
what it sounds like. It monitors the
file system, sees that there’s a new
version that say, ah-ha, version two
arrived, right? And now we want to load version
two because it has– you know, was train on fresher
data, and it’s going to provide better
recommendations to your users. So it’s going to create a loader
of a TensorFlow SavedModel. And it’s important to note
that the loader of a SavedModel knows how to load the
SavedModel and it knows how to estimate the resources
such as RAM or GPU storage that will be used by that model. It doesn’t actually load it yet. That’s the job of the manager. So this loader’s admitted to the
manager as an inspired version, and it’s actually
up to the manager to figure out when it’s ready,
when it has threads scheduling available, and when it
has enough resources to load that new model
version, it’ll do so. And this is where one
really key plug-in comes into play, which
is the version policy. And it turns out that in
almost all scenarios, if you’re serving something like video
recommendations to users, you want that serving system
to always be available, right? You never want have a downtime. On the other hand, there
are use cases out there where let’s say you have
an offline, big batch pipeline that’s maybe annotating
these videos in batch, right? And it’s not
directly user facing. And let’s say that
your model is very big. You might prefer to have a
little bit of unavailability in that pipeline and save
a bunch of memory, right? So instead of loading both
versions of that model at once, you can actually delete the old
one and then load the new one and just have a little
hiccup in your serving. So the version policies lets
you preserve availability or preserve resources. So it’s a nice thing there. OK, so onto some strengths
of the libraries. First and foremost
is they’re optimized for very high performance
and robustness. So these are used in some
of the largest serving systems at Google at, you
know, pretty large scale. And we do things like, you
know, ref-count accesses to your models. So from the previous
slide, let’s imagine that version two of the
models load– and this is kind of getting
into the details, but version two is loaded. You can, actually,
immediately unload version one because there might still
be a request pending to it. So we actually keep
track, via ref-counting, of the requests in-flight on
each version of your models, and only after all the
requests have quiesced can we delete a version
that’s no longer needed. And last but not least,
just kind of reemphasizing, we have these nice
plug-in APIs, so you can build your own
sources to get models from a database, an RPC
layer, a pub subsystem, kind of whatever you like. OK, so last show of hands. Who’s used TensorFlow
Serving libraries so far? Great. A few of you. So for anyone who’s
thinking of using this, you would just see
this by default. This is in our tutorials,
but for those of you who’ve used the current
libraries, what we found, internally and externally,
is that you had this– setting up all of those
libraries with the best practices took a bunch of
boilerplate and, kind of, common code. So we made a new class called
ServerCore, and what this does is it lets you declare the
set of models you want loaded and pass them to
ServerCore and say, just give me a manager of
these models, please. I don’t really care
about the details. Give me all the best
practices out of the box. And so, this should
let you delete about 200 to 300 lines of code. So give that a try. Send comments and
feature requests. OK, so this slide is
intentionally short. The binaries are
very, very simple. They take all of
the best practices from the libraries
of TensorFlow Serving and just wrap them GRPC
layer along with some flags, and configuration, and
monitoring, and other things you would need. Our binaries are
based on GRPC, which is Google’s high performance,
open-source universal RPC framework. And you can extend this
as well, but this is what we provide out of the box. In terms of specific APIs– currently, open source complete
with the API implementation. We have a low-level,
Tensor-oriented predict API, so this should be usable for
any kind of modeling inference that you would need to do. Coming soon we have regression
and classification interfaces. The APIs are already on
GitHub, and the implementation is out for review. So it should land very soon. All right, now I’m to
move on to some challenges that we’ve seen along with
the best practices for how to solve them. So the first one,
this was a great story around isolation and latency. So this is a graph
of latency with– each of those spikes
is approximately daily, Mountain View time at noon. So this is for a
large Google customer serving several thousand
queries per second. And right around,
give or take, noon, there was this latency spike
at the 99.9th percentile. That put that serving
customer outside of their SLA. So we looked into
this and it turned out that there was a
fantastic engineer we’d been working
closely with, really liked to push new experimental
models right before or after he went to lunch. Who knew? And it– actually, it
threw us for a loop because it did move forward
and backward by an hour or two every day. And so we didn’t find any
automated jobs doing this, but we did find the problem. It was loading a new model–
could have a pretty big latency impact on your existing
models running inference. So if you dive in a
little bit deeper, it turns out that
the default– and I think it’s a very good
default for a TensorFlow– is to optimize for throughput. That’s what most of us
want in most situations. So you don’t have to go set a
flag that says, please give me throughput. But a side effect of that
is, by default, optimizing for throughput pretty much
all the models and all the sessions on your
task will get access to all of the threads. So again, very good default,
but for the specific case of you have many models
and many versions floating over time, what
you really want to do is isolate the
loading threads away from your inference threads. So we added– you know,
it’s pretty fancy slide, but we added just one flag. You can set the number
of loading threads, and we typically set it to
somewhere around one or two. And I’ll show you
the after slide. So you can see much more detail,
but the main point from here is that the y-axis
dropped by 10x. So the latency spikes now are
roughly a hundred milliseconds. This is completely inside
the customer’s SLA. Customer’s happy. We have a best practice now
that all of you can use. The next challenge– and
I hinted at this earlier in the talk– is about
batching, and handling asynchronous requests,
and getting the efficiency of mini-batching at serve time. So if you look at the
little traffic on the right, you have the green,
blue, and orange boxes. And these represent
your user queries arriving at different
times, kind of falling down towards your server. And what you want
to do, ideally, to get that
mini-batching-like performance is kind of wait for
some period of time. Take a few requests,
put them together, and run one graph computation. At the same time,
you really want to keep a strict upper
bound on the latency you’re willing to wait. So we’ve done quite a
bit of investment here. Our batching is
available as a library as well as via flag
for the server. And there are a lot
of interesting nuances here that we didn’t even realize
what we’re getting to when we first started doing this. And as just one
example, let’s imagine that you had two models
on the same machine. What you might
want to actually do is schedule their
batches back to back instead of having them overlap
and contend for threads. So we have the concept of
a shared-batch scheduler. So each model has
its own, and they cooperate via shared scheduler. As a teaser for
later, sequence models make this really interesting
given that they’re typically not just a fee-forward network. And the sequences can
be of varying length, so it gets very tricky
and challenging for how to batch these together. And Eugene is up in
a couple of talks. He’s going to go into much
more detail about batching and sequences. A few areas of emerging tech. So Jonathan was
mentioning a SavedModel. We’re quite excited to announce
this and encourage everybody to adopt it. So SavedModel is the
universal serialization format for TensorFlow Models. It has– it’s included
in terms of flow 1.0 and also in TensorFlow
Serving as of right now, live on the repo. There are two marquee
features of SavedModel. So the first one is support
for multiple MetaGraphs. And for folks who have been,
you know, mostly doing training, this– you might not
know why you’d want this, but once you hear more details
you’ll see at serve time it’s really important to
have different MetaGraphs. And the use case here is let’s
say you were training a model, and you went to
save it for serving. So you typically want to
remove things that are training specific like input queues. And drop out layers. You really don’t want those in
your production serving models. A few people have tried. You get really bad
results or no results at all if they
just hang on a cue. Other things you
might want to do is you might want to
transform your serving graphs. So you might want to
have a quantized graph for serving on a GPU or a TPU. And what the multiple
MetaGraph support does is it lets you have as
MetaGraphs as you want, and you can store and
access them by simple tags. So you can tag one with
Serve, one with Train, one with Serve and
GPU or TPU, and so on. The next concept
is SignatureDefs. So a SignatureDef
defines the signature of a computation supported
by a TensorFlow graph. And this is really important. So if you look at the
slide, most people probably can right away–
especially for humans. We’re good at this. You can read input
classes and scores and probably figure out what’s
going on with this graph. But if you handed this
graph without those labels to a serving system,
it would probably have no idea what to
do with it, right? With these graphs, you can
feed or fetch from just about any node in most graphs. So how would you identify
where your input goes and where your output goes, and
that’s what SignatureDefs do. So in this case, they specify
that that middle node, on the left, is where you want
to feed in your TensorFlow example, and the top right
node is your string Classes, and the bottom right is
your floating point Scores for, say, a classification. Building onto
SignatureDefs, we’re adding support for
Multi-headed Inference. So this is one of the
top feature requests we’ve had from the community. Is anyone familiar with
Multi-headed inference? OK. Very few. This is going to be
really popular and more common as people productionize
and deploy machine learning in their products. So it’s really,
really common that you might start with one model,
maybe a click-through model, to do predictions on, say,
your video recommendations. But over time, you
might decide, you know, we’d also like to train a model
for conversions or maybe people who watch the whole video. And quite often people
would go about this by actually training
a second model, but for starters,
you’re training on almost all the same data. All of your data,
pre-processing, all of your hidden layers,
quite often, are the same, and then at serve
time you’re doing a lot of redundant computation. So you’re parsing your
input, preprocessing it, doing the same, you know, hidden
layer computation, and really only the output layers change. So our Multi-headed Inference
support builds on signatures, and lets you specify
multiple signatures together. And say, run these
together in one request, and it will do that. Performing
Multi-headed Inference will save you operational
overhead and complexity of deploying multiple models. It’ll save you bandwidth
CPU latency, and so on. So we’re really excited, and
that should land sometime in the next week or two. All right, so wrapping up. And again, Eugene is going
to go into much more depth on sequence models. I just wanted to
highlight that there are many flavors
of sequence models. They are generally
very expensive to serve both in compute
cost and latency, and we’re investigating,
specifically, batching, padding, and
unrolling strategies to make them more efficient
and effective for all of you to serve. All right, so for
anybody who’s interested, we really warmly welcome
collaboration, pull requests, feature requests,
bugs, and so on. We sync all of our
code to Github weekly. We have a developer
on call, and it includes facilitating your pull
requests, answering questions, comments, and ideas, and so on. To get in touch, you can reach
us at [email protected] So here are a bunch of links
for how you can get started. I’ll let you read these
offline, and please do get in touch with your
questions and feedback. All right, so thanks very
much and up next is Ashish. [APPLAUSE] [MUSIC PLAYING]


4 thoughts on “Serving Models in Production with TensorFlow Serving (TensorFlow Dev Summit 2017)”

Leave a Reply

Your email address will not be published. Required fields are marked *