Towards a Secure Path to Production   Felix Hammerl

Towards a Secure Path to Production Felix Hammerl

Everybody’s comfy. Great So Hi, everybody. My name is Felix. I have been a
Thoughtworker since 2016. I’m a developer by trade and security
is my pet topic in its various shapes or forms. So this talk will
be a bit more than yeah more than half an hour. That’s a lot of
context to go through. So we’ll have more than enough
time for questions and answers. Also, you can catch
me in the crowd later. People have told me that
I’m kind of hard to overlook so well anyway. So for some time now
cybersecurity is all the rage. Despite loads of
security requirements. However, most applications
and frameworks are still riddled with
vulnerabilities. Frameworks are not being maintained. Components aren’t updated
and so on and so forth. Right. So feature pressure
always trumps the world to dig through tech debts. So let’s have a look at how
to formulate the narrative to reasonably build security
into software engineering and to make it the first class
citizen that deserves to be. All right. So news always comes first
and then comes the shock. So a lot of companies learn
from successful attacks on their infrastructure
mostly from news coverage even though we got better. The meantime too discovery
is still frighteningly high. So 99 days was the latest
number that I found for 2016. Down from 416 days in 2012. So things are improving, but
since this is the meantime, you know in some
cases, this means it’s also in the range of
years or just they just never discovered it. Let’s keep in mind that the
number of attacks are rising. So the technical complexity
on the other hand, on the attackers side declines. The tools are getting better. We have tools like Burp Suit, we have metasploit framework,
and so on and so forth. So the tools are getting
better both on the defenders as on the attacker side. But you know. Anything any machine,
including light bulbs baby monitors
dishwashers so anything that is reachable
from the internet is under scrutiny at
any given point in time. So while attacks were isolated
events just a few years back. They’ve become a
regular thing now and cyber attacks or incidents
are having a substantial impact on the company’s stock price. And while we have yet to observe
a major tech company going bankrupt after a
hack the effects are tangible even
for top management. Examples include Target
and Equifax recently. So anyway, once a security
hole is discovered, then usually digital
forensics are called in. Environments are being walled
off external penetration test roll in
and it’s actually not surprising for the teams
for the investigation teams to discover that the
people on the ground often knew in advance
what was wrong. And while the team
members often aware of the things that needs that
need fixing things do get lost in translation sometimes. So this anti pattern
initially mentioned is called the security sandwich. So framed with requirements and
audits and penetration tests lies the agile
development process full of its inherent
uncertainties. And your learnings we discover
things on a daily basis. And we reprioritise. But often the
development process itself is largely
ignorant towards security. So instead of short
circuiting here and looking for simple solutions. Let’s take a step back
and investigate a counter proposal to the Security
sandwich let’s have a look at how
to create a shared responsibility for security. So the three movements that have
radically transformed the tech landscape over the
last couple of years were design thinking with
the objective fitness for use a lean
startup methodology with the objective product
market fit and the devOps culture with the objective
responsiveness the common theme here is that they
include the objectives early on in the development
process all three of them, rather than tacking it
on as an afterthought. It’s really hard
to design a system well, when it’s
already built. So now it’s easy to see why
it’s orders of magnitude harder to effectively retrofit
security instead of baking it into the development process. Right if you want to refer to
it and marketing heavy words you can call, sec dev
ops or dev sec ops or any permutation of these
three things or shift security left. So this is the words that
you’re typically hearing here. So now that we’ve spoken
about the problem. Let’s dive into what
can be done about it. So many risk
estimation frameworks use a version of the so-called
calculus of negligence which tries to quantify
exposure by looking at impact and severity of abuse. There is nothing inherently
wrong with these models as they are widely used to
you know quantify and analyze exposure and potential
liability of a company as an example for those familiar
with risk estimation models Open Fair’s model of quantifying
inherent risk works pretty much along those lines. However, even
though these models are not directly actionable
and produce little guidance as to where and how
to implement controls and to mitigate risks
understanding exposure and inform the software development
lifecycle software delivery lifecycle and can give
context to putting security practices into
place by the way, I will refer to the software
delivery lifecycle from now on as SDLC because
it’s less of a tongue twister. So this is my favourite quote
when it comes to security because it perfectly sums up why executives have stopped
listening to security experts a long time ago. So your job is to facilitate
the business to operate in an as-assured-
as-possible manner, given the actual
mission of the business and providing that
context for people that aren’t security
professionals as well as for
those that are: Here’s how important
this thing is in the grand scheme of things. So let’s break down how we can
go about creating this context. So in order to understand
where to focus your efforts and communicating
that to management you need to understand which
assets, your application is handling and why
they are valuable. So assets are generally valuable
goods or physical or immaterial nature, such as production
machines order data of financial transactions
or personally identifying information assets are inspected
in the context of software to be build in order to
understand how the system might be attacked to get a
hold of those assets as well as how when and
where to defend them. So also you need to
know what the impact is, should a risk
materialise and your efforts to protect an asset fail. So in that sense
assets are the targets for both deliberate or
negligent threats from inside or outside your organisation. So let’s start with
understanding an assets value. Value is created through
transformation or indexing or contextualising or just
display of physical goods or real world behaviour. Examples include
e-commerce systems where a customer orders goods
or fleet management solutions or warehouse solutions,
user interaction, analytics. You get the idea. So another class
of value creation in that sense might be digital
services such as streaming services or the entirety and
the integrity of the user experience is also an assets. Your job is to assure that
this value creation based upon the asset can be realised. This is why we’re here. So the goal of a risk
profiling exercise is to identify and understand
potential attacks, as well as unforeseen
or negligent failure to an organisation in the
context of your systems. To this end
understanding an asset gives you insight into
the corresponding security goals, which derive from
either legal or regulatory requirements or business
experience business requirements. So security goals are and you
might have heard about them from other places. So the first most
confidentiality, component of privacy that protects
our data from disclosure to unauthorised parties. Second one is integrity
of information. It refers to the condition
where information is kept accurate and consistent
unless authorised changes are made. You know information only
has value if it is correct hence we’re protecting
it from being modified. And the third one is
availability. Availability is the key for any information
system to serve its purpose. Information must be available. It must be obtainable,
it must be usable when needed. Compromising
availability has become very common nowadays. We all have heard
about denial of service attacks. So these three security goals
confidentiality, integrity, and availability are
also commonly referred to as the CIA triad. It’s pretty easy
to remember that. So once you know about
your security goals that you have to keep
in place, the next thing is to think about what happens
when a security goal is broken. That’s also referred to
as a disaster scenario. So disaster scenarios carry
impact to your business and these risks
basically constitute your applications risk profile.
Disaster scenarios will serve as the basis for your threat
model, which allows you to understand ahead of
time were attacks are likely to come from what an
attack is likely to look like and what can go
wrong at that point. It allows us to
rationally discuss in which way,
or attack patterns can harm our businesses. So for the SDLC this means
that our business stakeholders actually have to sit down
with the delivery team. So this joint
task is a first step in creating shared
responsibility for security between the business
and the delivery team. So roughly speaking
in a risk profiling exercise you would have
each participant come up with a number of assets that
are handled by the application. So for every asset
you would then ask you know what happens
if confidentiality integrity, or
availability are broken and you can assign a
monetary value to that. It might not be easy to come
up with that number. But it often comes in useful
and in later discussions when you’re basically arguing
why this security measure is important to build in. In the example of an e-commerce
shop an order fulfilment system is an asset right. So non availability
of this asset results in a revenue
loss of x euros per hour. Sometimes it’s harder
to quantify them. You can also go in like small,
medium large impact whatever, works for you. In that case. Anyway so gathering these
assets in asset libraries allows us to make security
relevant decisions within an application
development based on security goals by creating a common
language between the business and the delivery team. And so we’ve spoken
about risk profiling but the problem is even though
understanding our exposure is nice and is helpful. These risk models still do
not readily yield action items for how to improve our
overall security posture and which controls to
implement and exactly where to implement them. So this is done
in a second step. So that second step is called
a path to production. The path to production
answers the question of where quote unquote
where security happens during software development. So let’s take a look at
how our teams create value. The idea of a path to
production is adapted from lean value stream mapping. Maybe some in the room
have heard about that. OK, well, you’ve
heard about now. This is used in the
manufacturing industry to improve production
throughput, cost of delay is something that comes
from this discipline right. So a path to production
visualises the many steps that the teams take. Agile SDLCs will often
roughly look like this. Stakeholders get
together, they discuss things, they discuss requirements
ideas or needs or demands or whatever. And that stuff is
usually covered in protocols or reports. These inform decisions by
other groups and committees down the road and at some point. This stuff comes
together in an epic. So an epic roughly outlines
scopes for delivery work. And then this is the point
where the team comes in. These epics are then
broken down into stories, which are scoped, which also
have acceptance criteria. And these stories are then
developed one at a time and deployed into production. And every activity
in the context of application development
in a path to production is registered and brought into
a logical and chronological order. In software development. This path is typically
constantly fed with new proposals or
insights and every team member is working on at some
point in the path to production at
any given point in time. If you’ve done such an exercise
of mapping that out at the end, you will be left with
a pretty big piece of paper that shows the reality
of the team in an abstract way. But accurately and believably. The reason we’re doing this
in the context of security is that security works starts
with the first discussion of features right, long
before a story appears on a Kanban board, way
before that time and path to productions create a
shared responsibility across the entire
team, including this that their business
stakeholders that are involved in the initial discussions. And it creates empathy for the
work the other team members do. And that is crucial
for establishing shared context and shared
understanding shared context and responsibility. Let’s you know take
this, and let’s take a path to production
and walk through it. So let’s take a look
at the analysis phase. We typically break down high
level business requirements into scoped chunks
of requirements. So security
considerations are set. And this phase of a high
leverage and the inputs for these security
considerations comes from as I said
before, the asset library. We have the security
goals we have our disaster
scenarios that informs any decisions taken here. So threat modelling allows
us to proactively identify potential issues in
the technical design of an application. So it’s a good
practice to understand cross-functional
requirements that need to be scoped into
a story and incorporated into the technical
architecture. Threat models tie the delivery and assurance
work within the team back to the management layer
of the organisation based on the assets we
talked about earlier. So the interesting part here
is that whatever you come up with in a threat
modelling exercise. Any counter measures you
decide on they are not decided upon in isolation in the team. It needs to be in
agreement with the business side of your organisation. The four typical causes
of action to any findings are accept, avoid, transfer
or mitigate a threat. All of the above strategies
are equally valid. And please also note
that it is equally valid to mitigate the risk as it
is to mitigate the outcome. Good threat model makes that
context clear either way. So any actions you
agree upon need to be signed off by the
business typically represented by the product owner. And then followed up on and
implemented by the team. So I strongly advise giving
security considerations and threat modelling
a dedicated space in your path to production. This can be done, for
example, in the form of a definition of ready. So if not if this doesn’t work
for you adding threat modelling and security to your
definition of ready, you can also run
recurring sessions. But please make
sure that you run them regularly say every
two weeks, every four weeks. It’s more important to do
threat modelling regularly, then having a threat
modeling session every quarter,
every half a year. All right. I’ve spoken a lot about
threat modelling let’s dive into what actually
what we can do there. I’ll go first into the
agile threat modelling part in the middle, then
into the scenario based threat modelling and then I’ll give
a quick note about exploratory methods like attack trees. So as for the agile part
when remodelling threats. Our goal is to find
the highest value security work we can do and
get that into the team’s backlog right away. We do this by
applying a time box. So that we threat
model little and often and we capture different
parts of the system every time we do this and we try a
different zoom levels. So over time all these
different perspectives and zoom levels help us to get
a good overview of the system and threat modelling
becomes a part of your continuous agile
development process. One thing that I
really liked doing. And I brought a bunch of these
toolkits here, so I ten of them here. It’s first come,
first serve afterwards brainstorming with
Stride who in this room has heard about
Stride 1, 2, 3, 4, 5. Cool. So Stride basically
brainstorming based on Stride is quick and flexible to
extend your ways of working and to understand
what can go wrong. Basically Stride is an acronym
for spoofed identity, tempering with input, repudiation of
action, information disclosure, denial of service and
escalation of approval. So these are basically ways
in which your application, your systems can be broken. That’s a pretty popular toolkit.
Ao you would investigate the current iterations
functionality delta or if you integrate it
into your definition of ready. The
functionality delta in that story to understand
how that functionally can be attacked or otherwise
broken. Heads up many threat modelling
frameworks advise you to name a threat actor. In my personal experiences. This is rather hard to come up
with a believable threat actor and it doesn’t really
yield better results. Basically unless you’re likely
to be attacked by the NSA you know just keep it
rational and should you actually be afraid of
being attacked by the NSA. I truly wonder why you’re
sitting here because you should be knowing that stuff already. But jokes aside, the next step
is basically to map and order the findings according
to their impact. So you’ve done threat modelling. You have your findings. What do we tackle first. Right it’s about getting
the highest value out of this exercise. So just take a moment
to reflect on what’s going to happen to the
business in case this actually came to pass. So you have written
down the context for that in the asset library. I’m referring to
that pretty often. So are you going
out of business. You’re going to spend a day
recovering your database. Do you lose a
competitive advantage or is it possible that
your elaborate disaster scenario is just maybe a
minor nuisance at best. So anyway, the top
threats can then be picked up as additional
acceptance criteria to be added to a user story,
security debt you can track on a radiator and your
team space changes to your team’s definition
of done, time box spikes epics to implement them whatever.
Should this agile threat modelling part be not for you. Maybe because it doesn’t
work for you or your team is not experienced enough to do that. You can also run
tabletop exercises and this is the
scenario based approach. So tabletop exercises
offer inexperienced teams to understand security risk. A team would be
basically confronted with a couple of
disaster scenarios and would list all the
necessary countermeasures. So this technique draws
on pre-existing knowledge maybe from former
engagements or something and tries to map that
to the current tech stack into the
current architecture. So you look at things like
boundaries and interfaces with other services
with other systems. Those might break you look at
recovering from unavailability or you look at recovering
from unexpected data Loss graceful degradation
because third party services aren’t there. All of these things right. Just be creative
when you come up with these ways in which
your system can fail. There is a third part that
is the exploratory methods, like attack trees. Those are recommended. If you have a critical component
in the context of high risk high yield assets or if
you’re doing digital forensics. Attack trees are
basically a method of analysing the
security of a system. Top down. So you start from your
well observed result or your disaster scenario. And then you drill
down into your system. They are very labor intensive. They require expert knowledge. They have limited payoff. And during conversations
with security practitioners. A common theme was that attack
trees often lend themselves to waterfall big design up front type of thinking. Typically in high assurance
environments and Iactually they should be avoided
by agile teams. But I named them here
for completeness sake because if you want to have
a discussion with a security practitioner these
things will come up and maybe you want
to say no, then you know use something
else. anyway, enough with the analysis phase. Let’s dive into development. So many security controls
in the development phase can be just part of your
automated CI/CD practices to contribute to your
overall system stability. Examples like test pyramids
and feature toggles are pretty much well known. So I will not speak
about them yet. But a seldom discussed topic
is scanning off dependencies. However dependencies make up
the vast majority of your code that’s executed at runtime. So your code is probably
like 3, 5% or whatever of what you’re running and the rest
comes in from, I don’t know basically half of GitHub. Anybody who has ever done
an NPM install basically knows that they
basically have half of GitHub on their machine. So most likely you’ll be
using semantic versioning for floating dependencies and
within which the releases can be automatically integrated. So the expectations for
floating version ranges is that upon the next built
you automatically get bug fixes automatically pulled in from
upstream and other non breaking changes and
improvement just float in. This is a reasonable
assumption, but by now we’ve learned that this
doesn’t really work. But it still has
some sort of value because most developers work
exclusively fixed forward. They are not back porting
to previous releases. I have not seen that in any
modern tool chain basically exclusively fixed forward here. So the result of blindly
integrating untested releases is basically broken
builds, broken run times non deterministic builds
and the works on my machine problem. I think we’re all familiar
with that at this point. So there’s new tooling around. A sane way to integrate new
versions and fixed versions. So tools like
Greenkeeper for example, they integrate new versions
and bug fixes automatically if the test suite executed
by your CI screen. So fitness for production. Then the second step. And that’s usually
tested through, for example, a blue green
deployment or you know some deployment that
can easily be discarded or rolled back in
case, the integration of the new dependency didn’t go so well. That practice is also often
referred to as canary build or canary deployment. Should you not be using
blue green deployments. Just you know make sure that
you have an automatic roll back in place. And yeah, I think what I want
to get to here is that it requires some
setup to automatically pull the new versions in. But having the latest
patched versions of a library is actually like halfway
there in terms of security. Let’s look at what
happens if time goes on. So time goes on new
vulnerabilities are discovered and we need to have a
safeguard for the growing number of vulnerabilities in
our libraries and frameworks in our existing builds,
in our deployed versions right. So a prominent
example of that is, Equifax is Apache
struts vulnerability that things stayed
unpatched for month and they knew that it was there. So tools like for example, of us
dependency checker some of you may be familiar with that or NPM
audit they give you the ability to scan dependencies
for so-called common vulnerabilities and exposures. It was exploits a while ago,
they changed the naming here. These security
vulnerabilities are published in well-known public databases. For example, there is the
Nist national vulnerability databases. There is the ones run by the
MITRE corporation or other CVE numbering authorities. There is also things
like HackerOne. You know these what are
they called ethical hacker platforms. I think they’re
called now anyway, so there is a couple
of places where vulnerabilities are published. You just want to know if
the versions you’re using show up there because if
they do, you need to fix them or you need to upgrade
them depending on what the path to resolution is here. Same as libraries
and frameworks. Interestingly also containers
and runtime environments are subject to having
vulnerabilities and also they need to
be inspected regularly. So tools like Clair or jFrog
X-ray anybody heard of those. Yes, I see a couple
of heads nodding. Awesome. So these allow
you to scan the layers in your containers for
CVs for published common vulnerabilities and exposures. So scanning of
dependencies and components is pretty much
universally accepted as a good practice
at this point. So a lot of tools come
with that out of the box. Also on top of that, you have
scanners like Snyk or Twist lock or Aqua that hook
into your CI/CD pipelines and into your
production environments. And these tools can
give you sort of a constant vulnerability
feed basically that you can. So that you know what’s
deployed in your system and where vulnerabilities
are likely to hide. Heads up. It’s a good practice
to continuously check whether your
dependencies are outdated or have vulnerabilities. I typically
recommend once a day. My recommendation is not to
do that in your pipeline. But do that out of band in
a crunch job or something. Because the problem
is if you have that on your path to
production and your pipeline your deployment pipeline is
blocked because there was just a vulnerability discovered
or a new version was just published and you need to
ship a hot fix to production might not be the best
of places to be in. It’s also not uncommon
for deployment pipelines of microservices to
just not be touched for a couple of months
on end, because you know the service is broken. We don’t really need fixing it
is not sorry the service is not broken. We don’t really need to fix it. So if the pipeline isn’t
run and you do these checks on your pipeline,
you wouldn’t know about any CVs published
while the pipeline is not run. All right. So this gives us a
pretty good picture of the things that we can do
during development. Specifically about the things that can be
automated during development. Let’s dive a bit into life
support at any point in time, you want to be able to visualise
the health of your system in a dashboard. So this is not
about response times or disk capacity or network
throughput or all that stuff. But about the ability of your
system to serve the business right. So your system is built
alongside a user journey or different points of focus
for the business for what it delivers value. So you need to basically put
all of these important points all of these points
of interactions where you create value
for the business onto a dashboard and you want
to be able to monitor them simultaneously. One aspect per
tile, for example, you want to know the
number of interactions with a shopping basket
in an e-commerce scenario because should that number
suddenly drop to zero? You know there is some
bug in your system that you’re not detecting
and you’re no longer creating value for the business. So you’ve just nixed
an asset. The basis for any kind of insight into
your production environment is structured logging
and I’m hoping I’m preaching to the choir here
for that instead of logging a string in natural
language you’re logging an index of data
structure for example a JSON or something like that. So basically, these log
outputs of your containers are then aggregated,
picked up by a collector or aggregated in a
central place and indexed. So that in the
event of an incident you can have a pre formulated
searches ready to dive into what went wrong and
where something went wrong. That’s pretty cool. And all we have basically
all the information, all the insights from
the different services in the same place. So the next step is to be
able to correlate things across services. And this is typically
done by a trace ID. A trace ID is just
an identifier that is tacked onto an
incoming request and passed on
through every request to a decentralised system. So that basically in the event
that something went wrong, you can just use this
identifier that you also put into your log messages and
correlate all the interactions that happened along the path
of that request being handled. So if somewhere down in your
system, something breaks, you can just pull up the
entire chain of interactions and how your system
reacted to that. Once we have that we know
what our systems are doing. We have an ability to
understand how things happen across the system. So the next basically
big piece in the puzzle is alerting or rather
in a broader sense, distinguishing nominal
from horrendous behaviour. So having dashboards
for example, that show the
behaviour of your system from a business
perspective allows you to learn over
time what that nominal and what that erroneous
behaviour is going to look like. Think about the
pilots in a cockpit having like all of
these different dials that they used to pilot the
plane safely sending out an alert comes in handy
when you understand that the operating conditions
for your services have changed. So you have to know
what normal looks like so that you understand
when your systems operating outside of normal. So that you can then send a
ping hey, something is wrong. Something needs fixing. Understanding when
your system operates outside of safe parameters.
When you receive an alert you need to act. So let’s take a look
at how emergencies first responders are working.
Standard operating procedures are things they use. And that’s what they use
when things get hectic. It’s pretty much a concise
playbook or a checklist that contains information
that is necessary to deal with incidents. So they are not intended
to be checklists that dumb down a problem to the point
that it can be just handled by just about anyone. But you know they
contain procedures for say restarting
services of information where to look for
logs and procedures for handling data in the
event of an incident. So standard operating
procedures are not a service manual but rather
think of them as sort of the binder that the pilots. I’m using the pilot
analogy very often here. I like the emergency
binder that pilots have in a plane that they look
at oh, this thing went wrong. Wait what do we do here. What’s your what’s
our next step here. How do we combat
it allows you to be able to react to incidents
or outages, largely independent of seniority
tenure or experience. That’s a really
valuable thing to have apart from using those standard
operating procedures in life support, they can also be
tested in tabletop exercises. You know this will
make sure that your standard
operating procedures are first of all commonly
understood state of the art and actionable for the team. So you would just
basically present your team with you know
something went wrong. Our system is down. What do we do? Where do we start right. Now
that we have that in place, we understand what
our system is doing. We understand how parts of
our system interact together. We have the ability to react
when our system operates outside of safe parameters. There is one last thing that
really needs to be addressed. And that’s tech debt. Set aside some
resources to aggressively pay off tech debt because this is
the most common source of bugs. A security incident is
nothing other than a bug. So it helps to have a one
person rotating firefighter role and that person is
dedicated on that role. This person does
not pick up stories while he or she is on that role. So when there is no actual fire
to fight and things work well, this person can pick up
things from the tech debt wall and work on these
little things that rarely make it into a story. But have a large impact on the
safe operation of your system. Emphasis really being one
person because nothing really drives familiarity
with a code base like having to shepherd a system
through a busy day when you’re new to the whole team right. I think we’ve all
been there at some point. I would also say do not pair
junior and senior people on this because it is likely
that the senior person will be and stay in that driver’s seat. So the junior person
is actually you know picking up some
things, but you learn most when you have hands on,
when you’re actually doing that stuff. This person can, of
course pull in resources if he or she is not that
familiar with that thing they’re fixing right now. But experiencing that moment
of panic when you just see, oh things go red. And I’m in the driver’s
seat is a valuable lesson. I think everybody has to
experience that at some point. So if everyone if that
is a rotating role that everyone has to
pick it up at some point. This forces the team to
take collective ownership of the code and all parts of
the code, independent of how ugly they are or who authored it
or when they were written or you know whatever. Everybody has to
pick up that role. Everybody has to make
sure things work. That’s it from my side. If this talk got you all
fired up for security, please note that many teams
will break new ground here. So in order for the
investment to pay off over an extended period of time. It’s beneficial to work closely
with your business stakeholders on this. Only if the team understands
their work in the larger context of the business
measures around security can be explained,
can be justified and can be agreed upon. This requires the businesses
willingness to invest in this. That’s just the
precursor for this. So yeah. The last thing I want to say
is go forth and build security and I’ll be taking
your questions now.


Leave a Reply

Your email address will not be published. Required fields are marked *