Strategies to Edit Production Data – Julie Qiu | #LeadDevLondon 2018

Strategies to Edit Production Data – Julie Qiu | #LeadDevLondon 2018


Hi, everyone. My name is Julie Qiu. I spend
a lot of time working with databases. Most of you, as Meri mentioned might have run across
the issue that I’m going to talk about today which is that, at some point, we all need
to edit data in production. And there are a lot of ways that we can go about doing this
editing, some of which are a bill little bit better than others, and I’m going to share
with you the evolution of processes that I have found help develop in my company. Hopefully,
the strategies I’m about to share make editing production data safer at your organisation
too. A little bit about me: I’m an engineering manager at Spring. For those of you who haven’t
heard of Spring before, it’s not just a Java framework. Spring is a fashion and commerce
marketplace that integrates with thousands of different brands. At Spring, I lead the
catalogue team which is the engineering team that builds out our product catalogue. This
means I spend a lot of time thinking about things like: how does data get ingested into
our system? Over the last few years, more often, that be I probably like to admit, I
found myself behind the SQL prompt needing make edits to the production data. And after
this, I – we are told that editing data in production is really bad practice. Why are
we doing it anyway? For me, the first reason is often that the internal tools that I needed
are just not available. I would have used a better method but it doesn’t exist. The
second is that I ran into an edge case. Sometimes, we actually do take the time to prioritise
building out a user interface. But, it doesn’t allow us to do all of the things we want to
do or needed to be able to do. Lastly, sometimes, we need to make a time-sensitive change, and
jumping behind the SQL prompt is often just the fastest and easiest solution. It is really
easy just to get behind the SQL prompt and execute a query. When we think about it, there’s
not much of a problem with it, either. Let’s say I want to update the products table and
rename the first product in the table to be renamed Julie’s Product, and I write a query
like this in SQL, and, before running it, get a co-worker to look over my shoulder and
spot-check me. Let’s just say this time it’s Friday night – hypothetically! [Laughter].
And someone on the marketing team comes up to me and says, “Julie, I need you to make
this change. We promised the brand we would do it before Monday, and, if you don’t, they’re
going to pull themselves off our platform and they’re one of our highest selling brands.”
Of course, I go ahead and I write the query, and, when I’m done, I look around me and I
realise all the engineers have already left for the weekend. At this point, I could Slack
someone, but I don’t really want to bother anyone. And, I’m thinking to myself: I’ve
been on the catalogue team forever, I’ve ran queries similar to this a million times before,
what is the worst thing … [Laughter]. What is the worst thing that can happen! [Laughter].
So, I go ahead, I run this query, and, at this point, I decide I’m going to grab a dink
of water, Slack the marketing team, let them know I ran it. Come back, my bag’s ready,
and I look down at my laptop, and I realise this is what I actually ran. [Laughter]. For
those of you who aren’t familiar with SQL, the rest of the audience is laughing probably
because they know that I’ve just accidentally updated every single product in our database
to be named “Julie’s product”. [Laughter]. Based on the laughter in the room here, it’s
because we have disaster stories similar to the situation I’ve just described. The key
here is that these mistakes don’t happen because we’re all bad engineers, they happen because
mistakes happen. Having the ability to make any edit that you want sets you up to make
this kind of mistake. So, today, I want to talk to you about some strategies naked continuing
data safer for your organisation. Today, we’re going to to talk talk three five different
strategies, for a raw SQL spreadsheet, running scripts locally, on an existing server, a
task runner, and building out a fully flagged script runner service. We will walk through
the strategies and run through the amount of effort it takes for the strategy to be
implemented at your organisation, and we will also talk about the benefits that you receive
in exchange for this effort. For each strategy, we will talk about three things: how this
strategy works, and an example of how to implement it, what is great about the strategy, and
also what is not so great. The examples we’re going to talk through today will involve using
a Python stack in a SQL database, but these strategies can work for any stack of your
choice. Let’s dive in. The first strategy we’re going to talk about is pretty simple.
It doesn’t require any new code to be written, or infrastructure to be built. Rather, it
is simply developing a process for making sure that we get spot-checked when we’re making
edits to production. So, what does this process look like? Let’s go back to the query I had
intended to run for the marketing team. I want to update the product with id equals
1 to be Julie’s product. I ran the query without any supervision which we all saw was a pretty
dangerous situation to be in. So, what we started to do at my company is started maintaining
a Google spreadsheet to record manual SQL queries against production. This spreadsheet
let’s us collaborate and review each other’s queries before they’re executed. This gave
us a checklist of what actually needs to be done before running a query against production.
Here’s what that checklist looks like. So the first step is I’m going to add a record
to this spreadsheet. Some information I fill out includes my name, the date, a description,
the query I actually want to run, as well as jotting down the reviewer. The reviewer
will look at my query and approve it or request changes. Similar to coder views, after you
go back a couple back and forth, the next step is for the reviewer to actually approve
the query. Once you’ve got into those two thumbs up, the last step is go ahead and run
the query. In this step, we ask everyone to do their manual edits inside a transaction,
so that way, if they in the they made a mistake halfway through, it is really easy to roll
back. So what is great about this system? Well, the first is that it is really easy
to implement and get up and running. You can do this before the conference even ends today.
All it really takes is making a spreadsheet and sending an email to your team. But, the
effects are really powerful. This is because having the spreadsheet gave us an audit trail.
So, after we implemented this process, we went from having ad hoc queries we hoped people
were getting checked to having a log of everything that was being run against products, and who
had looked at it. This audit trail not only gave us information about who was running
the queries, what they were running, but also why they were running these queries. This
has led us to have a lot of discussions about areas we should maybe investing in in internal
tools or overhauling or rebuilding a system entirely. Lastly, this system helps to promote
the right behaviours. So, it helps because it doesn’t only encourage everyone to be careful,
it also teaches engineers one of the right things to do. So, if we think about the journey
that will go on this talk, right now, we have a strategy that gives us the benefit of data
editing. It requires very little upfront investment, and we’ve also implemented a manual version
of a code review process, and an audit trail. This is great because, at your average start-up,
raw SQL edits are probably never going to completely go away. We still use this spreadsheet
to this day, but this process makes it so that we are empowering engineers to avoid
making mistakes, and making sure they’re get spot checked. Introducing this process also
makes it slightly more painful for people to think about how to run the SQL queries
so they’re encouraged to build better tools. What are some things we’re not getting with
this system? Well, the first is that it can still be easy to make mistakes. We’re assuming
that people are copying and pasting queries from the spreadsheet directly to the SQL prompt
but that might not be true. Even so, errors can happen. This audit trail is also easy
to maintain manually but that opens up room for error. After all, there’s nothing preventing
someone from running an unchecked query. Lastly, while it is relatively easy to update, say,
one product to Julie’s product, it can be difficult to execute long and complex logic
with SQL. If your query takes a really long time to run, it’s going to time out. For example
let’s go back to this query. What if on that Friday they hadn’t asked me to update just
the one product. What if they said, “Hey, Julie, there’s a store on the Spring platform,
called Julie’s store, a bunch of products that don’t have a name. Can you name them
all Julie’s products.” We can see how the queries become complex quickly so we want
something better than raw SQL. Let’s talk about our next strategy which will help us
run these long and complex queries. Our next strategy is to write scripts and run them
locally on our machines. Here’s what that process looks like. To run the scripts, we
will start by writing it. We would convert the SQL logic we would have wanted to write
to code in the programming language of your choice. When I’m writing my scripts, I like
to add an argument parser so scripts can be reusable in the future, especially if I plan
to use the logic again, say, I want to name product 2 to Julie’s other product. Then,
next, we’re going to form a connection to the production database. You can do this using
VPN or SSH gateway. Lastly, go ahead and run the script from your terminal. In Python,
this is what my command looks like, and I often like to write my scripts in such a way
that I can add a dry-run flag, so that way I can preview the results before committing
changes to production. So, what do we like about this strategy? Well, the first thing
is that a scripts, unlike SQL, can be reused, so you can write the script just once and
add in different arguments to run it again in the future. Additionally, if you need to
manipulate the outputs of the script, it’s easy to pipe it into a text file. Lastly,
writing a script gives you access to the rest of the code in your repository, so you can
import functions from your common code or reuse logic that you have already written
before. Looking at our journey now, we now have a strategy that gives us both the ability
to edit and the ability to execute, make edits with somewhat complex logic. At this point,
we haven’t needed to set up any infrastructure, so the upfront investment is still relatively
low, although a bit higher than just running queries with SQL. There are some things that
we don’t get with this process, though. The first is that a we don’t have code review
being strictly enforced, so somebody can write a script and connect to the database from
your laptop, and write without any form of code review. The outputs of the script are
also only available on the user’s machine, and they’re going to go away as soon as we
exit out the terminal. Lastly, we would run into network connectivity issues. The script’s
going to stop running if, say, the internet dies out, or the user just closes their computer.
It kind of begs the question: how do we go about running really long scripts? For example,
what if I wanted to run this query many times? In fact, at Spring, we currently have over
50 million products in our database, each of which has many different attributes and
variance. What if I wanted to update a field on all 50 million products? This is actually
a common situation that I run into on projects for the catalogue team, and I’ve often had
to run migrations on the entire products table. When I first started working at Spring, what
I would do is I would just come into work really, really early. I would plug into my
laptop and let the script run for the day. [Laughter]. I was waiting all day. Only 25
per cent of the products table had been processed by 8 pm. I talked to my manager at the time
about workaround strategies. We sat there thinking. Would not it be nice if I just had
a computer that was already set up with all the information I needed, and it had all the
configs I had on my laptop except this computer was just on all the time? Conveniently would
be we have several EC2 instances to do that. That’s how our websites stay up and running
all the time. The next strategy we’re going to talk about is how to run scripts on an
existing server. Here’s how this will work. Similar to running scripts locally, I would
start by writing my migration scripts. Because we are running the script on a server, the
next step is to get the script on to the server. There are a lot of different strategies for
doing this. You could deploy, or fetch or copy the code. You can upload a transfer descript.
We often transfer code before deploying to an application, so I would often copy the
script on to the remote server using SEP. I would often use Jenkins for the server because
we run our automated tasks on it like our deployments. After the code exists on the
server, the last step is to SSH in and run a script inside a session. In general, this
method of running scripts on Jenkins has worked well for me and gotten me through a lot of
major projects. That was until one day this message showed up in our engineering slack
channel. [Laughter]. Our on-call engineer happened to noticed that Jenkins was down.
She pinged the web developer, Justin, to check. Of course, I also check because I’m a responsible
engineer. I confirm, yes, Justin, it is down, you should really look into that! [Laughter].
Then back to doing my work. A minute or two later, I had a realisation, as most of you
could have guessed, I think I shut down Jenkins. My hunch was right. It turns out that Jenkins
was down because the script that I was running on it was using up all the CPU, and, as a
result, none of the other engineers could deploy anything. It was a pretty easy effectively.
All I had to do was reboot the instance to bring Jenkins back up. It means running queries
on an existing server is still a good strategy for running long scripts especially if you
monitor it a little bit more closely than I did and a little downtime is not the worst
thing. But it was a lesson learned for me that we should think about putting in better
infrastructure. Before we talk about improvements, let’s talk about what is great about running
scripts from a server. So the main thing is that you can now run long scripts, or really
any scripts without having to worry about your computer denying on. You can run your
scripts inside a session, and then just come back and check on it later. Also, because
we’re running our scripts on a remote server, we now have the benefit of a much more reliable
network connection. Lastly, you’re getting all these benefits without having to actually
put in any new infrastructure before so, when we look at this strategy in comparison to
the others that we’ve seen, the main benefit that we have now is that we can run really
long scripts without worrying about our laptop shutting off. The strategy has relatively
low investment costs, there are some DevOps configuration involved, but it is still pretty
easy system to rely on. What doesn’t this strategy get us, though? As we can see, the
strategy of running scripts can affect the resources on that server. In my case,ee used
up too much CPU or you can run into cases where we use too much memory. Second, the
need to copy the script on to the remote server SSH into that deserver start a session, and
make sure that session is running, just to run a script,ing it it’s really not the most
user-friendly experience, and it opens up a lot of room for error. Lastly, we still
don’t have a persistent audit trail. Sure, there are logs available inside the session,
but they’re probably going to get lost as soon as the session ends, and most likely,
only the engineer that started it has access. We can do better. Let’s talk about a strategy
next that is more user friendly for our engineers and also gives us a persistent audit trail.
The next strategy we’re going to discuss is using a task runner. A task runner is something
that lets us automated the task associated with running the script. This includes things
like SSHing into a server, or setting up a virtual environment in the case of Python
– getting a copy of the code. All those repetitive things we were doing before we would do using
a task runner. Best of all, logging’s built in. So the way that we do it was run it on
Jenkins. Jenkins lets you write code and register arbitrary
tasks. The Jenkins build page provided us with a free user interface. So, similar to
the last strategy, we’re going to go ahead and write the script. We’re going to get the
script reviewed, tested and merged. And, then lastly now, instead of SSHing into the server
to run the script, all we have to do is use the Jenkins interface and input arguments.
To run a script, you go to the object, select the path, select the arguments you want to
use, and then click build. Jenkins takes care of all the annoying things we had to do before
like setting up the virtual environment, connecting to the remote database and running the scripts
with all the arguments he provided. The best of all in this case, you can just sit there
and watch the output happen. This strategy is great because it’s the first one we’ve
seen that gives us a persistent audit log, so these logs make it easier to monitor the
progress of the script which is really useful for long scripts as well as lets us trace
back what has happened. Because the first step of our task runner is to actually fetch
the latest version of master, all the code being merged on to the task runner needs to
be code-reviewed and merged, so this means we are now able to enforce those practices,
and provide the ability also to write automated tests before scripts are run. Lastly, we can
now run our scripts from our user interface. As opposed to having it SSH into a server
which minimises room for error. We’ve come pretty far on our journey now. At this point,
we have a process for code review, for running automated tests, for having an audit trail,
and we also have an audit interface. Setting up a task runner did take a bit of time, but,
when I think about the amount of time that our team of 50 engineers has been able to
save because we don’t need to do things like manually copy code on to a remote server and
make sure it’s the right version, or figure out how to correct our mistakes when we have
absolutely no logs to look at, the upfront investment kind of seems pretty worth it.
There are still some reasons, though, that this isn’t exactly what we want. First off,
inputting command line arguments every time you want to run a script, it’s kind of annoying,
and it also makes managing credentials a little bit hard. So this is particularly true if
you have a lot of credentials. For us, this was the main driver to look into better solutions.
As our company grew, after the task runner launched, I found that one engineer was still
running really long scripts by copying them on to an existing server. And this was simply
because script he needed to run had over 30 credentials, so it was easier to put them
on a server somewhere, and then pipe them into the script whenever he had to run it.
There is also not a clear separation of environment here. So, because we’re just piping in command-line
arguments, it’s easy to accidentally say, “Use credentials for production,” and then
run them on the dev database, so that’s not great. Lastly, there is no system in place
for us to verify the arguments being passed in. So, for example, what if I’m writing a
script, and I want to update product 1 to Julie’s product. No-one is there I’m updating
product one to two or spelling “Julie” correctly. All these annoying problems I was having,
they really got me thinking. We have all these problems we are dealing with – configuration
management, logging, separating environment – haven’t we solved all these problems before?
After all, all of our existing applications do this. So why had I been trying to reinvent
the wheel instead of just building something with the tools I had already been working
with? So I did. And I decided to use our existing tools to build an application to run scripts
on, our own internal script-running service, and this will be architected like the rest
of our Python applications. This is what it looks like. Each era to access two credentials
that were score in high era that were common to the EC2 instances. It connected to the
database in its respected environment, and, as far as the user is concerned, the steps
for running the script were really similar to that as of the task runner, except for
one small change: so the first part here is the user’s going to write the script, get
it code-review tested and merged as before, but, lastly, the key difference here is that
the user no longer needs to input a series of arguments. All they have to do is just
pick the environment they want to run the script in and they will have all the credentials
that they need. So here’s what it looks like to run a script with this UI. The user will
go to the user interface, type in the file path of the script they want to run, select
the environment, and then just click “build”. All of the command line arguments we had to
manually enter before. They will be environment arguments on the script runner server. It
brings us to the key benefits of the script runner service, the first of which is centralised
configuration management. We have a system for our scripts that is a lot more organised
than just selecting a bunch of command line arguments. The second is that the strategy
allowed us to maintain clear separation of environments, and, lastly, the system is by
far the most user friendly of all the strategies that we’ve seen. The amazing thing, though,
about the script runner service is that it doesn’t have to stop there. There’s even more
that we can do. For example, if there were several engineers who needed to run scripts
at the same time, we can just parallelise and scale our instances. If we want to preview
the effects of our scripts on the database before committing changes, we can build in
the preview mode functionality prompting people to realise if they were passing in the wrong
argument or spelling something wrong, perhaps. In the end, it’s really up to you to customise
your own version of script runner. So we’ve seen today a spectrum of tools that you can
use for editing production data. So some of you might be wondering now: which strategy
should I use? Well, ultimately, that depends on the needs of your team and your willingness
to invest in the upfront cost of building the tool. If your team is small, then speed
might be important and making mistakes once in a while might not be the worst thing. In
that case, running edits, and simply using raw SQL or local scripts from the command
line might be enough and might be the best strategy for you. But for larger teams, better
feeds like auditing and editing can be really important. You can’t have 50 engineers jumping
into a Google spreadsheet every time someone wants to run a SQL query in production. You
might want to have a more centralised way to manage configs because there are just too
many otherwise. In that case, building out a script runner service might make sense for
your needs, but the key here is that, as you think about growing fast, and then hiring
your teams, and building out the best engineering teams possible, make sure that you’re also
thinking about upgrading the infrastructure, internal tools that are needed to support
them and allow them to do their best work. Which brings me to some of the key learning
I had in in the process of developing tools for running scripts, which is that, when you’re
building tools for your team, it’s important not only to think about safety, but also think
about speed and usability. After all, that is why we are all getting behind the SQL prompt
anyway. It is not because we don’t know it’s the wrong thing to do. And, in the same way
that it is important to think about the end-user when building out your consumer applications,
it’s also important to think about making something usable for your engineers. After
all, engineers are also people. On the flip side, a lot of times, we will find ourselves
saying, looking at something, and just be like, “This is not ideal. I really should
not do this.” And then, two minutes later, you will turn around, and, then, just be doing
nothing anyway. You should invest the effort to stop doing that early on. Ideally, before
you change the names of 50 million products to Julie’s product! Thank you! [Applause].

Author:

Leave a Reply

Your email address will not be published. Required fields are marked *