From Research to Production with TensorFlow Serving (Google I/O ’17)

good morning, Google I/O. I’m Noah Fiedel, and I’m
gonna speak with you today about how to go from
research to production with TensorFlow Serving. So I’m gonna start by sharing
a few stories from my past and observing industry. So the first one,
show of hands, who’s taken a commercial flight
in the last 10 years? All right, everyone’s awake
and you’ve taken flights, this is good. You might be shocked
to know that circa 2005 I was working on the
flight planning software that plans the majority of
the world’s flights. This does things like
compute how much fuel to take on the plane that
you probably care about. But before 2005 that system
did not use source control, and for those of you who
know what that means, that’s pretty scary. It does now and
I raise this just to show how even though we
know what best practices are we publish them, and blog about
them, and talk about them, it sometimes takes
a while for them to be adopted across industry. Fast forward to 2010. I was at a startup, and we had
a mobile photo sharing app. And we tried to do
all the right things. We had source control,
continuous integration using Jenkins, and one day
we decided to add machine learning to our app. We trained up a model that
could you face detection, auto crop to faces. This is really great. Our users loved it. Our investors really
loved it in 2010. But for machine learning,
what is source control in continuous integration? These best practices
don’t really exist. So I trained that model
on my work station. And if I got sick or if
my work station crashed, no more training for us. We’d have to start over. So here we are in 2017. And at talks like this and other
conferences around the world, we have all kinds of great
tools and best practices for machine learning. So we have TensorFlow
and a whole ecosystem around it and many other tools. But there are many areas
still to be defined. So just as one example
of many, what’s a continuous integration test
look like for machine learning? So you’re deploying
models every day. When you start doing
this, you might make a test that
runs some sample data from today for your model,
and it does not all the time. What happens when your
user’s behavior changes? What happens if
they look at bigger objects, or different
kinds of objects, or the distribution changes? So there are many of these
things we are figuring out. And I’m gonna make
a bold statement, but I think machine
learning is 10 to 20 years behind the state of the art
in software engineering. And so we have a ways to go. One of the ways, and there many,
where TensorFlow Serving is helping is by making it easy
to safely and robustly push out multiple versions of
your models over time, as well as the
ability to roll back. And so this seems
pretty simple, but we’ve seen people inside and
outside Google that don’t have that capability. And there are many other things
that are part of this as well. All right, so here’s the
agenda for today’s talk. For starters, TensorFlow Serving
is a flexible high performance serving system for
machine learned models, and it’s designed for your
production environments. And before I dive into the
details of TensorFlow Serving, I’m going to describe what is
serving for those of you who might not be familiar. It’s pretty simple. Serving is how you
apply a machine or model after you’ve trained it. Many of the talks
on machine learning, both in academia
and industry, are focused on the training side. But serving side is kind of left
as an exercise to the reader, so that’s where we come in. On the right side of
this slide, you’re hopefully all familiar
with the orange boxes here. It you’re at a
TensorFlow talk, you should be pretty
familiar with these. You have a pile of data. You have somebody doing the
role of a data scientist. And you’re training
model, and you’ll use something like
TensorFlow to do that. Now you have your application
on the right side. And for the sake of
example, let’s say that your ranking videos. So you have an app that
users come home from work and they want to look at some
fun videos and just relax. So your application has a
list of candidates videos, and your model is sitting
over there on disk. How are you actually
gonna apply it? So the really straightforward
and common answer is you’re gonna
have an RPC server. So this server is going
to take your model, it’s going to load
it off of desk, and it’s gonna to make it
query-able by your application. So your application
can give a list of videos, along
with their features, and the server will reply back
with maybe the probability of each of those videos
being clicked on. All right, so moving on
to some goals for serving, and in particular for serving
machine learned models. So the first one– and this
is where serving differs quite a bit from what we’re used to in
talking about training of ML– is that requests are
coming in all the time asynchronously from users. You can’t control them, you’re
not reading them off of disk. But whenever a user sits down on
their couch and loads your app, that’s when you’re
gonna get a request. So we want to answer
these requests online and at very low latency in
consistently low latency for all of your users. The next ones are also subtle
departures from training. So the first is that you might
want have multiple models served at the same time. So when many groups start
with machine learning, they’ll have one model that
might, as this example goes, serve rankings for videos. But now what happens when
week two comes along, you have a model it’s
working on production, but now you want to
launch a new one. You’re not sure if
it’s gonna work, you might run an experiment. So now you want to run two
models and maybe have a couple different experiments
you want to run. So that’s really common as well. And lastly, who’s familiar
with mini-batching? OK, just a few of us. So I’m gonna describe
it so that we’re familiar with this cause it’s an
important part of both training and serving. So when you take a
neural network graph, and you process data
through it, there’s some overhead that happens
with all the nodes of the graph and scheduling the
work across them. And to get good efficiency
almost all of the training libraries out there will produce
what’s called a mini-batch. So instead of putting one video
through your trainer at a time, it’ll come up with
patches of 32, 64, 128. Push them through
the graph together, and you get massive
throughput improvements by doing that, in particular
on GPUs and TPUs and so on. So we aim to achieve that
efficiency of mini-batching, but at serve time when all
the requests are coming in asynchronously, and you
don’t have a nice neat 32 size batches off of disk. And the multiple model support
and efficiency, they’re neat challenges on their own. But what makes them
particularly interesting is doing those while also
maintaining those standards for low latency all the time. All right, so I just wanted
to say again, and throw this up there for you all to
read, that TensorFlow Serving is a flexible, high
performance serving system for machinery models. And it’s really designed for
your production environments. TensorFlow Serving has
about three major pillars. The first one is a
set of C++ libraries. These include standard support
for the saving and loading of TensorFlow models. They’re pretty
straightforward, something you’d want to do with
a serving system. We also have a
generic core platform, and this is truly generic. It’s not tied to TensorFlow,
although TensorFlow has first class
plugins for everything you might want to do. So let’s say you have a
legacy ML system in house and you might want to mix and
match for some period of time with, say, TensorFlow. You can do that. You can write adapters that
plugin different ML platforms, and run them in
TensorFlow Serving. Building on top of our
libraries, we have binaries. And these incorporate
all of the best practices that we’ve learned out of the
box, so they make it easy. We have reference Docker
containers and tutorials and code labs for running on
Kubernetes with auto scaling and so on. And the third pillar is a hosted
service with Google Cloud ML, as well as our internal instance
that many Google products are using. Lastly, I wanted to highlight
that across all three of these different offerings
your models are portable. We worked really
hard to make sure that one model format will work
in any of these environments. So you can take your
model, try it on a binary. You want to do something custom,
try it in a library and so on. You can seamlessly
migrate back and forth. All right, so super
excited to show for the first time that
were used very, very broadly across Google. These are just some of our
key customers inside Google. For the first time
I can say that we have over 700
projects inside Google using TensorFlow
Serving, so woo-hoo. [APPLAUSE] Thank you. All right, so I’m gonna dive in
a little bit to the libraries. So I mentioned the
generic core platform. I wanted to mention also
that the components are all a la carte, and this means
that you can mix and match them to suit your needs. So if you’re doing
something really advanced– and we have some incredibly
advanced internal customers– you can mix and match
these components and use just the ones that
accomplish your needs. You don’t have to buy the
entire set of libraries. You can take the ones you want. Let’s see. There’s the batcher for
inference performance. This give us that mini-batching
performance, but at serve time. And lastly, you can
plug-in different sources if you have different storage. So maybe you have a cloud
storage of your choice. You can write a
plug-in for that. All right. So we’re gonna go through
a pretty big slide here. I’ll try and make this
digestable for you. And this is zooming
into the libraries as they would exist inside
our server in the binary. So say your server’s
up and running, and it’s serving your video
ranking model, version one. And it’s cruising along. Everything’s working well. And say you’ve trained version
two and you to deploy it. So I’m gonna walk you
through how that would work. So on the bottom right
of the slide here, you’ll see the green source
API with the yellow orange file system plug-in. It’s a really straight
forward plugin. It simply monitors
a file system, observes the new
versions of your model, in this case version two. And the source is
going to omit a loader. In this case, it’s going to
admit a loader of a TensorFlow saved model. It’s important to note that
the loader doesn’t immediately load that TensorFlow graph. It keeps track of the metadata. It can estimate the
ram requirements and other resources
used by the model. And then it can load the
model when it’s asked to. And this is really an
important differentiator between a straightforward
model server that you might build yourselves. So this loader that’s
very lightweight, is omitted over to the manager. Now the manager
actually knows the state of the server, how
much RAM is available, and other resources such
as GPU or TPU memory. And only when it’s
safe to do so is it gonna ask the loader
to go ahead and load version two of the model. So let’s say that the
manager has plenty of memory, it goes ahead and loads
version two of the model. You might think
that the first thing it’s going to do immediately
is let’s unload version one. We don’t need it anymore. But there’s another
important detail here. I see some smiles, so you
probably know this is. Say you have client request
threads on the top left here. And they’re actually
still processing some queries on model one. So what you actually
have to do is keep track. In this case, we
have a handle on top of that TensorFlow saved model. And it keeps track a
with a high performance ref counted pointer mechanism
of exactly how many clients are still outstanding
processing requests. Only when all of those
requests have quiesced, then the manager will go
ahead and unload version one. All right, so I’m going
to cover some strengths of the libraries. I’ll just highlight a few
aspects of this slide. The one I wanna highlight
here is the fast model loading on server start up. This is another
really subtle detail, but it’s really helped
a lot of our users. So let’s say you’re
starting up a server. There’s a couple of reasons
you might want to do this. Let’s say running an experiment
on your work station. You don’t want to wait
a long period of time. You want that to load
as quickly as you can. Another example where it’s even
more critical is auto scaling. So say your users all
get home at 6 o’clock. They all sit on the couch,
pull out your application, and send you a big traffic
spike of video ranking requests. Using a system like Kubernetes,
you’re gonna want to auto scale as quickly as possible,
and you want those servers to start up quickly. So at the same time, most
of the time with servers you only want maybe one
or two of your threads to do model loading. You want all of your
threads and CPUs to be performing
inference for your users. But, say, on a 16 CPU machine,
why not use all of those cores and all of those threads
from model loading. It’s just a small optimization
in one of many that we do to make it
easier to auto scale and run your actual
models in production. Let’s see, we use the
re-copy update pattern, which is a high
performance pattern for doing concurrent memory
access of these models. I mentioned the ref-counted
pointers and the simple plugin interfaces. So it’s really easy to extend
to support your own data store, cloud storage, or even
a database of your models. All right. Another show of hands, who
has used TensorFlow Servings libraries? OK, a few advanced users. So especially for you folks, but
also for anybody who’s thinking that you might want
to use our libraries, definitely take a
look at ServerCore. So what we observed
was that our libraries were low level, very powerful,
and very configurable. But for most people
you really wanted most of the sane and
sensible defaults. You wanted to load
some set of models. You wanted to load new
versions over time. And so we made the
class ServerCore, which does this for you, and
it’s configuration driven. If you move to it
from our libraries you can remove about
300 lines of code and just have a nice little
config, so give that a try. All right, so moving
on to our binaries, and this is what we
recommend for most users. I mentioned a few
times there are things like sensible defaults,
how many threads to use, and so on. The binaries come out of
the box with those enabled, so you only need to
specify the minimum amount of configuration. They’re very easy to configure. I’ll show you on a
coming soon slide on how easy it is
to launch these. And they’re based on gRPC,
which is Google’s open source high performance RPC framework. All right, so hopefully you
can read these code samples. The line at the top, this is
how you build the model server. It’s a one liner with bazel. The second line is
how you’re gonna run the server
for a single model in just a one line command,
no config files needed, just three flags. So you’re gonna specify the
port, the name of your model, and the path to
the model on disk. And this is we
call it a base path because you might have
multiple versions overtime. So over time you can just
drop more and more versions in that directory and
they’ll be loaded. And lastly, we have the
command to run the model server with a config file, and this
could have as many models as you want, as many as
will fit on your server. All right, so this is great. We have a server. How do we actually talk to it? I’m gonna speak a little bit
about our inference API’s that are supported by
the model server. So the first one
is called Predict. It’s very, very
flexible and powerful. It’s tensor oriented, you
can specify one or many input tensors and output tensors. And so you can do basically
anything you can do, if you’re familiar with
a TensorFlow session. If you’ve been playing
with that in, say, Python. You can do just about anything
there with a predict API. Now moving up a
layer of abstraction, we’ve observe that the vast
majority of machine learning used in production
is actually doing classification in regression. We have two APIs,
Regress and Classify. And these use
TensorFlow standard and structured input format
called TensorFlow example. It’s a protocol message. It’s feature oriented, so
you can have different values for different features. And a nice thing
about these APIs is you actually don’t have to
think about tensors at all. So if you’re new to
machine learning, you just want to try something
out, try code lab, or even if you want to deploy in
production which we see many production users inside
Google using these APIs, go ahead and try the
high level APIs first. Next up, and I’ll talk
about a MultiInference a bit later in the talk. MultiInference allows you to
combine multiple regressions and classifications
into a single RPC, and that has some
really cool benefits. All right, here’s a
total toy model for you and this is just to show what
does the syntax look like. I’ve talk about these APIs if
you haven’t seen them before. So we have a model
spec, and this specifies the name of the model. And this is because your server
could have multiple models running at the same time. We’ll have one feature. In this case, the key
for it is measurements. We have three
floating point values, and we have a structured
result. Again, you don’t have to inspect
tensors you’ll actually get a structured result
called regressions, which has one score. So just an example. All right, so I’m gonna move on
to some of our key challenges that for the most
part we’ve solved, but we still have
quite a ways to go. OK, so this is the story
of isolation and latency. So from anybody who’s ever
looked at a latency graph before, you can probably
tell that those spikes are really bad. You don’t want to
have those, right? Those spikes were happening
pretty much Monday through Friday, around
12:00 or 1 o’clock. Does anyone have an idea? Just shout it out. What caused those
latency spikes? AUDIENCE: Lunch time. SPEAKER 1: Lunch time. Strongly correlated. We have this fantastic engineer
we were working with on one of our largest client teams. They were serving
multiple models. These were large
multi gigabyte models. And they were serving several
hundred thousand queries per second on a
fleet of servers. And what was happening is
this great engineer, really a close collaborator,
he would go to lunch. But right before
he went to lunch, he would push a new
version of the model. So you can probably
pretty quickly figure out what’s going on here. So every time a new model–
a particularly large one, about 5 gigabytes–
was loading, inference would slow down at the
upper end of percentile. You might wonder why
is this going on. Once we figured out the cause it
was pretty easy to figure out. So for most of
machine learning you have one model in a server at
a time, whether your training or serving until now. And so most systems are
optimized around throughput, not latency. And so all available threads are
available for any computation that’s ready to run. In this case model loading
is penalizable in TensorFlow. And so TensorFlow would gladly
use all the threads available and go load that new model
starving inference of threads. So here’s the after
slide, and definitely not that the right side
of this slide, the axis, just dropped by 10x. So we went from over a second
in terms of our latency spikes to about a tenth of a second. This fully met the
needs of the customer. They’re serving SOA. They’re happy. And the way that we
achieved this was we added an option
to TensorFlow that lets you explicitly specify
the number of threads to use for any
call to your graph. And in this case,
we wired them up so there’s one thread pool
for model loading and one for inference, so
pretty straightforward. By default, since
all of users want to use all the threads for
inference all the time, we actually only need users to
configure the number of threads for loading. So usually specify one or two,
and then you’re good to go. This is going to be
moving on to batching, several slides I mentioned
earlier in the talk. This is a really key challenge
to get great performance and throughput. And so let’s look at the
right side of the slide. You have these Tetris blocks
that are falling down. So imagine these are
requests from your users. Maybe those blocks
represent individual videos that are being ranked
by your server. And their different height
represents the time at which they arrive at your server. So what we do is we wait for a
very small and tunable period of time. And we wait for a few queries. We aggregate them
together, and stitch them into one set of tensors that we
feed into the TensorFlow graph. So this enables you to
get very efficient use out of GPUs and TPUs that have
really good batch parallelism. One external paper,
the spin paper, saw that moving to a
batched inference on GPU led to a 25 times speed up. I’m also I excited to
share that on CPU we’ve seen for models up to a 50%
improvement in throughput. So it’s a tiny
little tuning knob that can save you quite a
bit of your inference costs. Another thing is that
once you’re actually doing batching on top of
custom hardware, hardware accelerators, and GPUs,
you’ll typically only want to have one unit of work
being done on that chip or device at a time. And so one thing
we noticed is when you’re going to have multiple
models and multiple versions but perhaps one
piece of hardware, you’re gonna need to
schedule the work carefully across those models. We have a shared batch scheduler
that lets you do just that. Lastly, the batching
capabilities are all available
in both library form as well as they’re
configurable in the binary. So you can access them very
easily in either approach. Now we’re going
to dive into what happens inside one of
these session run calls. All right, so hopefully you can
parse this utilization graph. There are two different kinds
of utilization I’m showing here. So the first is
the blue dash line. This is that
utilization of your CPU. And you’ll notice it’s not
used through the whole period of this request. The orange line represents
the GPU or TPU utilization. So walking through, we’ll start
with some single threaded CPU bound work, like, say,
parsing of your input. It’s really common, pretty much
all models are gonna do this. Then you might do
something else that’s single threaded, like do a vocab
lookup or an embedding lookup. OK, you’re done
preprocessing your data. Now you’re gonna shift
over and do computation on the GPU or TPU. So your CPU basically
goes to idle. Your TPU or GPU is maxed
out for some period of time. It returns, and
then again you’re gonna some post
processing on CPU. And the reason I show this
is that, naively, if you just ran one inference at a time
on this set of CPU and GPU, you’re gonna be
vastly under utilizing both pieces of hardware. So one of the ways
that we solve this is by having multiple
threads, very, very common in serving systems. And what you can
do is you can have a good number of
threads and limit how many are running
at the same time. And you can make it so
that you’re constantly using your GPU and TPU, and
that most of the time you’re using your CPU. But the key takeaway
from this slide is that there’s this
queuing time that happens at the beginning
of each request. So once you enable batching,
I mentioned earlier there’s this tunable
window of time where you’re going to wait for a few
requests to come together, and then you’ll process them. This totally makes sense
for the GPU and TPU work. But for the work done on CPU,
there’s no point in waiting. All this does is add some
latency to that work. All right, so we have
some challenges here. There’s the scheduling
and utilization of both the hardware
accelerator and CPU. There’s the issue of
saturating one resource. So maybe you’ve
turned your model, it’s working really great. You’re using all of your CPU,
but only half of your GPU. That’s not ideal,
and vice versa. I just mentioned
queuing latency. So you really don’t want
CPU single threaded work to wait to be batched together. All that does is
add some latency. And the last one,
for sequence models where you have input
data of variable length, and the computation
will have variable cost, it’s really challenging to batch
these pieces of work together. One of the common approaches is
to add padding to the shorter sequences so they all match. But this means that
your GPU and TPU will be doing work over
padding, which is wasteful. So there’s challenging
trade offs there. So we think we know
the way forward. I’m gonna share with you some
work that we’re currently doing. It’s ongoing, but it’s still
in the experimental phase. We are moving batching
inside the graph. And we think this
is going to be huge for throughput and performance,
in particular for challenging models. Moving batching
inside the graphs enables you to surgically
batch the specific sub graphs that benefit the most
from your custom hardware, your CPUs, and so on. For sequence models where you
have things like a while loop with a single step of code
being run inside the while loop, you can actually run
the batching only inside that while loop. So it’s really neat. You’re never doing batched
inference or batched computation over padding. And early results look
super promising here. Another area is for complex
models that might have, for example, a sequence
to sequence model. Maybe it has an encode
phase and a decode phase that have different costs. This would let you batch those
sub graph’s separately and tune them separately. All right, next up I’m going
to share some new technology that we recently released
as part of TensorFlow 1.0. The main piece of tech
here is SavedModel. So SavedModel is the
serialization format for TensorFlow models. It’s available for
very broad use. It’s not specific to serving. You can use it for training. You can use it
for offline evals, and you can use it for
other interesting areas I’ll get into. SavedModel has two new
features to TensorFlow. The first is that you
can in one saved model have multiple
MetaGraphs that share the same assets such as
vocabs and embeddings, as well as the same
set of trained weights. You might ask what
would I want that for, I’ve just been training models. Well, it turns out that
the MetaGraph for serving is very, very commonly different
than the one for training. And now let’s say you want to
have a third MetaGraph that’s gonna be used on GPU. That’s also something you
would want to customize. So the multiple
MetaGraph support lets you have a MetaGraph
for training, one for serving on CPU, one for serving on GPU. All right, the second
major feature of SavedModel is SignatureDef. A SignatureDef
defines the signature of a computation supported
by a TensorFlow graph. You can think of it like a
function in any programming language. It has named and typed,
inputs and outputs. So as humans, we’re really
good at looking at that graph, you can probably all guess what
the nodes in that graph are. I guess if we
polled us, all of us would figure out
that the middle node on the left side of this graph
is where you feed the input. But if you want
to take your model and have a whole ecosystem of
tools such as serving systems that can interpret those
graphs that they had never seen before, you
would need to annotate that graph in some way. That’s exactly what
SignatureDef does. So in this case, the
middle node on the left is where you feed your inputs. The top right node
outputs a set of classes. And the bottom right
node outputs the scores. All right, so building
on top of SignatureDef is Multi-headed inference. This is also known as Multi-task
and Multi-tower inference. So another show hands, who’s
familiar with Multi-headed, Multi-task inference? Oh wow, OK. So this is pretty new. There’s some emerging research
and great work going on here. I think you’ll be really
excited about this. So let’s take that
example earlier. You’re serving video rankings. Your users are using your app. You get lots of
content creators. You’re successful now. And so users are drawn
to your platform, and maybe some
not so good people are drawn to your platform. Maybe they’re
uploading click-bait. We’ve seen this in many places. So your first model,
the orange model, computes the click through rate. And these bad guys are training
models optimized to get clicks. You might decide, well,
let’s define a new metric and train on that. Let’s call it conversion rate. And it’s going to
track users watching the majority of a video. So we’ll call that CVR. And now we’re gonna go
train a new model for this. But wait a second, all of the
careful future engineering and other things
that we did to get our data ready for
the first model apply to the second model. It’s very, very likely,
and in many cases it’s a certainty
that those features are usable in both contexts. You can actually train one
model for multiple objectives, in this case CTR and CVR. I’ve listed some infrastructure
benefits on this slide. And there’s some really, really
big infrastructure benefits. Your inputs are only sent
once over the network, so that’s already
a pretty big one. You only have to parse the
data in your model ones. You only have to compute
your hidden layers one time. So this is gonna save you
bandwidth, CPU, latency, and even ram overhead
on your servers. This is great, I could
stop right there. There’s another really
important attribute of these, why
they’re becoming more and more exciting in the
research that’s going on. So you’re likely in this example
to have many more clicks then you’ll have conversions. So when you go to
train these models you have much more
label training data for your CTR objective than
for your conversion objective. So what happens
when you actually train one model for
multiple objectives? So we’ve actually seen some
early promising results where in one case
we were able to see a 20% improvement in
a key quality metric by moving an existing
separate model into a multi-objective model
with another objective. So there’s wins here on
infrastructure and on quality. Multi-headed inference is
available in the TensorFlow estimator API. So you can train models with
multiple objectives today, and you can serve them as well. So we’re really
excited about this. All right, I wanted to
show really quickly, what does a Multi-headed
inference request look like. So it’s pretty simple. You can specify one or
more inference tasks. They each have the
name of a model as well as which signature
to use, so really simple. OK, all of this power
and functionality sounds really great. From early adopters
we also heard this is challenging
to get right. I might want to do things
like inspect a model and see what’s actually
going on with it, run a sample query, look
at the metadata, and so on. We’re introducing the saved
model command line tool, and it lets you do all
these things with a model. So the first where you can
do with a SavedModel cli. You can look at what
MetaGraphs are in the model, so pretty simple. So in this case, we
have a SavedModel. It has two MetaGraph’s,
one is for serving and one is for serving on GPU. Now let’s say we wanted to
look at the serving MetaGraph and see its metadata. In this case, the MetaGraph
contains two signatures. One is called classify x to y. The other one is
called serving default. So serving default is a
documented key and it’s a constant in TensorFlow. And what this does is
it says, if the user has not specified
which signature to use, just use that one. And in many cases
for people getting started you have just
one signature, so it’s really easy to get going. All right, now say you
wanted to actually look at the input and output
tensors in the serving default signature. So you add on one
more flag to the cli, and you’ll see the inputs
and I skipped the outputs for this slide here. We can see there’s one
input tensor called x, it’s of type float. Note the method
name on the bottom. The method name here
TensorFlow serving predict. This is kind of like a type
in a programming language. It’s informing another system,
in this case TensorFlow Serving, that this
signature was intended for use in the predict API. We have similar ones for
TensorFlow Serving regress and classify, but you
can override these. Maybe you have an in house
offline evaluation system. You can make your
own method names and check for them
in your models. All right, and lastly, let’s say
you would like to run a graph. You have some input data, maybe
you’re debugging the model, or you just want to
try it out for fun. I’ll highlight here
all of the prams are the same, except
instead of show we’ve asked to run the model. We also have two ways
of expressing input that you can mix and match. So maybe you have a NumPy file. You can actually specify
the path to the NumPy file, and it will just be red. You can also specify
a Python expression, and it’ll be interpreted. All right, so in
closing I wanted to say that collaboration
is very, very welcome on this project. We sink our internal repository
with Github about weekly. We have a developer
on call rotation that includes facilitating
your pool requests, answering questions, and so on. We have lots of open
ended research problems, so feedback is encouraged on
APIs, techniques, anything I’ve talked about, and any
challenges you have as well. You can reach us at
[email protected] For some links on
how to get started, just search for TensorFlow
Serving, very easy to find. We have a great
Kubernetes tutorial. This will let you launch
an inception model server. And it’s gonna run an
auto scale for you. We have Google
cloud ML, as well as I mentioned our mailing list. And you can also use our hashtag
on Stack Overflow, #tensorflow. All right, let’s
go to questions. Thanks very much. [APPLAUSE] [MUSIC PLAYING]


Leave a Reply

Your email address will not be published. Required fields are marked *