<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <title>Posts tagged with “Devops” on Mark van Lent’s weblog</title>
  <updated>2022-11-10T00:00:00+00:00</updated>
  <link rel="self" type="application/atom+xml" href="https://markvanlent.dev/tags/devops/index.xml" hreflang="en"/>
  <id>tag:markvanlent.dev,2010-04-02:/tags/devops/index.xml</id>
  <link rel="alternate" type="text/html" href="https://markvanlent.dev/tags/devops/" hreflang="en"/>
  <author>
      <name>Mark van Lent</name>
      <uri>https://markvanlent.dev/about/</uri>
    </author>
  <rights>Copyright (c) Mark van Lent, Creative Commons Attribution 4.0 International License.</rights>
  <icon>https://markvanlent.dev/favicon.ico</icon>
  <entry>
    <title type="html"><![CDATA[All Day DevOps 2022]]></title>
    <link rel="alternate" href="https://markvanlent.dev/2022/11/10/all-day-devops-2022/" type="text/html" />
    <id>https://markvanlent.dev/2022/11/10/all-day-devops-2022/</id>
    <author>
      <name>map[name:Mark van Lent uri:https://markvanlent.dev/about/]</name>
    </author>
    <category term="conference" />
    <category term="costs" />
    <category term="devops" />
    <category term="failure" />
    <category term="metrics" />
    <category term="observability" />
    <category term="sre" />
    <category term="terraform" />
    
    <updated>2022-11-10T21:32:11Z</updated>
    <published>2022-11-10T00:00:00Z</published>
    <content type="html"><![CDATA[<p>Today was the 7th <a href="https://www.alldaydevops.com/">All Day DevOps</a>. Just like
<a href="/2021/10/28/all-day-devops-2021/">last year</a> the organizers managed to get 180
speakers, spread over 6 tracks, to inform, teach and entertain us. As usual I
made some notes of the talks that I attended.</p>
<p><img src="/images/all_day_devops_2022_banner.webp" alt="All Day DevOps 2022 banner"></p>
<h2 id="keynote-the-rise-of-the-supply-chain-attack--sean-wright">Keynote: The Rise of the Supply Chain Attack &mdash; Sean Wright</h2>
<p>Dependency managers (like Apache Maven) changed the way software is being
developed: using those makes it much easier to use dependencies in your
application (you don&rsquo;t have to fetch them manually yourself). As a result,
building your own libraries is now less common, using an available (open source)
library is much easier.</p>
<p>To get a feel for the scale: NPM serves 2.1 trillion annual downloads
(requests). This obviously includes automatic builds, but the growth over the
last couple of years has been exponential none the less. And using third party
libraries can be a very good thing. Good libraries can be much more secure and
have less bugs than some library developed in house.</p>
<p>Types of supply chain attacks:</p>
<dl>
<dt>Dependency confusion</dt>
<dd>Register a private package (which you know is used by a company) in a public
repository. If the package manager searches for the package in the public
repository <em>before</em> the private one, the malicious package is used.</dd>
<dt>Typo squatting / masquerading</dt>
<dd>Use misspelled or &ldquo;familiar&rdquo; names in the hopes someone uses the wrong package.</dd>
<dt>Malicious package</dt>
<dd>Malicious payload in a package that promises (and perhaps even delivers)
useful functionality.</dd>
<dt>Malicious code injection</dt>
<dd>Compromise of an existing package (e.g. by compromising the build system or
repository).</dd>
<dt>Account takeover</dt>
<dd>Malicious actor can log in as the original author.</dd>
<dt>Exploiting vulnerabilities in packages</dt>
<dd>Using a known or unknown vulnerability in a library.</dd>
<dt>Protestware</dt>
<dd>For example a legitimate library with notes protesting against the war in
Ukraine.</dd>
</dl>
<p><em>After this Sean unfortunately dropped from the session.</em></p>
<h2 id="taking-control-over-cloud-costs--amir-shaked">Taking Control Over Cloud Costs &mdash; Amir Shaked</h2>
<p>Reasons to analyze your cloud costs:</p>
<ul>
<li>Costs translate to profit</li>
<li>Usage is directly tied to costs (and if you don&rsquo;t properly monitor, your spend
could increase unexpected)</li>
<li>Be in control: by monitoring you can determine what impact certain actions
have on your business</li>
<li>Be intelligent: by knowing what you spend where, you can connect it to your
business KPIs</li>
<li>Save money</li>
</ul>
<p>You&rsquo;ll want to be aware when you scale (hyper growth) if your costs grow faster
than the number of users/sales/etc. You don&rsquo;t want to end up in a situation
where more success means <em>less</em> profit.</p>
<p>Things to look for in your environment:</p>
<ul>
<li>Unused resources</li>
<li>Over commitment</li>
<li>Under commitment</li>
<li>Price per unit we care about (e.g. users, sales, etc)</li>
<li>Bugs</li>
</ul>
<p>Best practices you can arrange from day one:</p>
<ul>
<li>Export billing data and create e.g. dashboards</li>
<li>Structure and organize resources for fine-grained cost reporting</li>
<li>Configure policies on who can spend and who has administrator permissions</li>
<li>Have a culture of &ldquo;showback&rdquo; in your organization. Make your employees aware of
the costs, have budgets and alerts.</li>
</ul>
<p><img src="/images/all_day_devops_2022_amir_shaked.webp" alt="Amir Shaked with the takeaways from his talk"></p>
<h2 id="beyond-monitoring-the-rise-of-observability-platform--sameer-paradkar">Beyond Monitoring: The Rise of Observability Platform &mdash; Sameer Paradkar</h2>
<p>Customer experience is important for your revenue. You need to deliver your
service and make sure you know how well you&rsquo;re able to do so. Observability
helps you find the needle in the haystack, identify the issue and respond before
your customers are affected.</p>
<p>Pillars of observability:</p>
<dl>
<dt>Metrics</dt>
<dd>Numeric values measured over time. Easy to query and retained for longer periods.</dd>
<dt>Logs</dt>
<dd>Records of events that occurred at a particular time (plain text, binary,
etc). Used to understand the system and what&rsquo;s going on.</dd>
<dt>Traces</dt>
<dd>Represent the end-to-end journey of a user request through the subsystems of
your architecture.</dd>
</dl>
<p>Relevant key performance indicators (KPIs) and key result areas (KRAs):</p>
<ul>
<li>Customer experience</li>
<li>Mean time to repair (MTTR)</li>
<li>Mean time between failure (MTBF)</li>
<li>Reliability and availability</li>
<li>Performance and scalability</li>
</ul>
<p>An observability platform gives you more visibility into your systems health
and performance. It allows you to discover unknown issues. As a result you&rsquo;ll
have fewer problems and blackouts. You can even catch issues in the build phase
of the software development process. The platform helps understand and debug
systems in production via the data you collected.</p>
<p><img src="/images/all_day_devops_2022_sameer_paradkar.webp" alt="Sameer Paradkar shows a reference architecture of an observability solution"></p>
<p>AIOps applies machine learning to the data you&rsquo;ve collected. It&rsquo;s a next stage
of maturity. Its goal is to create a system with automated functions, freeing up
engineers to work on other things. Automating remediation of issues can also
greatly reduce response time and mean time to repair. This means that the
customer experience is restored faster (or is never degraded to begin with).</p>
<h2 id="failure-is-not-an-option-its-a-fact--ixchel-ruiz">Failure Is Not an Option. It&rsquo;s a Fact &mdash; Ixchel Ruiz</h2>
<p>Failure can cause a deep emotional response, we can get depressed and it can
make us physically sick. On one side of the spectrum, a failure can cause harm
to other people. On the other side we could embrace failure and make things safe to
fail. Failure in IT on the project level is quite common. Failure can also
happen on a personal level.</p>
<p><img src="/images/all_day_devops_2022_ixchel_ruiz.webp" alt="Ixchel Ruiz talks about three types of errors: preventable, complexity related and intelligent"></p>
<p>Not all failures are created equally. There are three types:</p>
<dl>
<dt>Preventable</dt>
<dd>These are the &ldquo;bad&rdquo; failures. There&rsquo;s no reason to allow these.</dd>
<dt>Complexity related</dt>
<dd>These failures are due to the inherent uncertainty of work. Usually it is a
particular combination of needs, people and problems that cause them.</dd>
<dt>Intelligent</dt>
<dd>The &ldquo;good&rdquo; failures, since they provide new knowledge.</dd>
</dl>
<p>Steps to learn from failures:</p>
<ul>
<li>Recognize</li>
<li>Understand the cause</li>
<li>Extract lessons to prevent future failure</li>
<li>Share the information</li>
<li>Practice for the next failure in a safe and controlled setting</li>
</ul>
<p>Increase return by learning from every failure, share the lessons and review the
pattern of failures. Do note that none of this can happen in an environment without
psychological safety. You need to feel safe to discuss your failure, doubts or
questions to be able to learn.</p>
<h2 id="accurate-metrics--christina-zeller-and-marcus-crestani">Accurate Metrics &mdash; Christina Zeller and Marcus Crestani</h2>
<p>A manufacturing process was monitored and there was a nice dashboard to show
whether there were any problems. However, at a certain moment there was a
problem, but the dashboard was still claiming everything was fine.</p>
<p><img src="/images/all_day_devops_2022_christina_zeller.webp" alt="Christina Zeller explaining the problem that the monitoring system did not detect a problem"></p>
<p>What was going on? Unreliable network? Erratic monitoring system? Flawed
collection of metrics? It turned out to be <em>all of them</em>.</p>
<p>Sampling rates, retention policy and network issues can cause missing
measurements in your time series database. This missing information can cause a
drop in failure rate if you are unlucky enough that the missing samples are
failed ones. So you <em>think</em> everything is fine, but there is something wrong in
your environment.</p>
<p>Modelling metrics differently can help. One possible improvement is to have a
duration <em>sum</em> in your metrics instead of just the duration. If you now miss a
sample, the sum will still indicate that there has been a failure.</p>
<p>Using histograms is even better since you place values in buckets, e.g. failures
in our case. A disadvantage however is that the metrics creation system must now
also know what qualifies as a success or failure.</p>
<p>Takeways:</p>
<ul>
<li>Always collect metrics</li>
<li>Check that your metrics match reality</li>
<li>Always consider missing data</li>
<li>For duration metrics always use histograms</li>
</ul>
<p>After improving their metrics, the monitoring system matches the actual state of
the manufacturing environment again.</p>
<p>Tip: look into <a href="https://en.wikipedia.org/wiki/Hexagonal_architecture_(software)">hexagonal
architecture</a>,
also known as &ldquo;ports and adapters architecture.&rdquo;</p>
<h2 id="why-is-it-always-dns-tls-and-bad-configs--philipp-krenn">Why Is It Always DNS, TLS, and Bad Configs? &mdash; Philipp Krenn</h2>
<p><img src="/images/all_day_devops_2022_philipp_krenn.webp" alt="Philipp Krenn compares DNS, TLS and bad config with the three main characters in the Harry Potter series: when something bad happens, at least one of them is always involved"></p>
<p>DNS, TLS and bad config are where failures are waiting to haunt us when we least
expect it. We need to have a tool to find the issues in our system early on.
Health checks can be this essential tool to alert us.</p>
<p>You are able to narrow down where the issue is if you structure your health
checks like this:</p>
<dl>
<dt>Outside the network (different provider)</dt>
<dd>This allows you to detect issues with e.g. DNS, the uplink, a firewall, a load
balancer, service availability and latency.</dd>
<dt>On the network (different AZ)</dt>
<dd>If you are on the network, you could see issues with the network, firewall,
TLS, service availability and latency. You can compare your measurements from
outside with what you see on the inside.</dd>
<dt>On the instance</dt>
<dd>Again, by comparing the local point of view with the outside, you can detect
issues with service availability, proxy vs service, latency, dependencies
(e.g. a database that is slow or the service cannot reach)</dd>
</dl>
<p>Examples of tests you can use:</p>
<ul>
<li>Ping a host</li>
<li>Setup a TCP connection</li>
<li>Do an HTTP request</li>
<li>POST data to a service</li>
<li>Synthetic monitoring where you simulate button clicks</li>
</ul>
<p>Health checks are cheap to run and give you a fast overview. They <strong>do not</strong>
replace observability and only tell you something is broken, not what is going
on. Start simple and only add synthetics when/where needed since they more
complex.</p>
<h2 id="comprehending-terraform-infrastructure-as-code-how-to-evolve-fast--safe--anton-babenko">Comprehending Terraform Infrastructure as Code: How to Evolve Fast &amp; Safe &mdash; Anton Babenko</h2>
<p>By using <a href="https://registry.terraform.io/namespaces/terraform-aws-modules">Terraform AWS modules</a>
you&rsquo;ll have to write less Terraform code yourself, compared to using the
AWS provider resources directly.</p>
<p><img src="/images/all_day_devops_2022_anton_babenko.webp" alt="Anton Babenko compares writing all Terraform code yourself in one big book with using terraform-aws-modules which results in a smaller book"></p>
<p>For example: if you write your own Terraform code from scratch for an example
infrastructure with 40 resources, you&rsquo;ll need about 200 lines of code. Once you
introduce variables, that code base will grow to 1000 lines of code. When you
then split it up into modules, you&rsquo;ll need even more code.</p>
<p>Instead, if you use <code>terraform-aws-modules</code> you&rsquo;ll have more features than your
own modules would only need about 100 lines of code.</p>
<p>Questions with regard to <code>terraform-aws-modules</code>:</p>
<dl>
<dt>Why?</dt>
<dd>Understandability is an important issue. To help with these questions (like
&ldquo;why is it done this way?&rdquo;) there are autogenerated documentation and diagrams. A
visualization of the dependencies also helps.</dd>
<dt>Does it work?</dt>
<dd>The project focusses on static analysis (using <code>tflint</code> and
<code>terraform validate</code>) and running Terraform on examples. Anton thinks that testing
code should be for humans (so HCL is better than Go).</dd>
<dt>How to get feedback quicker?</dt>
<dd>Options: run Terraform locally instead of in CI/CD pipeline, simplify the
code, and restrict surface area of your code using policies, guardrails, etc.
Split up your Terraform code, with different states, to speed up Terraform
runs.</dd>
</dl>
<p>Most important word for this talk: <strong>understandability</strong>.</p>
<blockquote>
<p>50-67% of time for software projects spent on maintenance.</p></blockquote>
<p>Some useful links Anton listed:</p>
<ul>
<li><a href="https://github.com/terraform-community-modules">terraform-community-modules</a></li>
<li><a href="https://github.com/terraform-aws-modules">terraform-aws-modules</a></li>
<li><a href="https://github.com/antonbabenko/terraform-aws-devops">Various links to Anton&rsquo;s Terraform, AWS, and DevOps projects</a></li>
<li><a href="https://github.com/antonbabenko/pre-commit-terraform">Collection of git hooks for Terraform</a></li>
<li><a href="https://serverless.tf">Doing serverless with Terraform</a></li>
<li><a href="https://www.terraform-best-practices.com/">www.terraform-best-practices.com</a></li>
<li><a href="https://www.youtube.com/channel/UCGH0yYPvlCN1VjSFMGVmFgQ">Your weekly dose of #Terraform by Anton</a></li>
<li><a href="https://weekly.tf/">Terraform weekly</a></li>
<li><a href="https://twitter.com/antonbabenko">Anton on Twitter (@antonbabenko)</a></li>
<li><a href="https://lepiter.io/feenk/developers-spend-most-of-their-time-figuri-9q25taswlbzjc5rsufndeu0py/">Developers spend most of their time figuring the system out</a></li>
<li><a href="https://diagrams.mingrammer.com/">Diagram as Code</a></li>
<li><a href="https://github.com/contentful-labs/terraform-diff">terraform-diff</a></li>
<li><a href="https://github.com/terraform-compliance/cli">terraform-compliance</a></li>
<li><a href="https://localstack.cloud/">LocalStack</a></li>
</ul>
<h2 id="keynote-journey-to-auto-devsecops-at-nasdaq--benjamin-wolf">Keynote: Journey to Auto-DevSecOps at Nasdaq &mdash; Benjamin Wolf</h2>
<p>Why DevOps? It can be pretty complicated to explain, even though it is an
obvious choice for Nasdaq. For most people Nasdaq is a stock exchange but it is
actually a global technology company. Sure, they run the stock exchange, but
also provide capital access platforms and protect the financial system.</p>
<p>Nasdaq develops and delivers solutions (value) for their users. They manage and
operate complex systems. It has been around for a while. So again, why DevOps?
The answer is: to get better at their practices.</p>
<ul>
<li>They want to deliver solutions (value) to their users <strong>efficiently, reliably and safely</strong>.</li>
<li>They want to manage and operate complex systems <strong>efficiently, reliably and safely</strong>.</li>
</ul>
<p>Years ago they had manually configured static servers and the development teams
were growing. They automated software deployment to a point where the product
owners could trigger the deployment, and even pick which branch to deploy. This
was an important <strong>first evolution</strong>. They had a &ldquo;DevOps team&rdquo; to handle this
automation.</p>
<p>The <strong>second evolution</strong> for Nasdaq was moving from a data center to the cloud,
using infrastructure as code (IaC). The question they asked themselves was what
to do first: migrate to cloud or get their data center infrastructure 100%
managed via IaC? They made the ambitious decision to do both at once.</p>
<p>By turning your infrastructure into code, you can create and destroy the
environment as many times as you like. And this was welcome: after about 2100
times they &ldquo;got it right&rdquo; and were able to move over the production environment
to the cloud. Without IaC this would not have been possible as flawlessly as it
did.</p>
<p>The cloud and IaC brought them:</p>
<ul>
<li>Efficiency: maintenance, patching, capacity</li>
<li>Reliability: self-healing, immutable</li>
<li>Safety: immutability, destroy/create</li>
</ul>
<p>Over time the DevOps team started to handle a lot more work. The team consisted
of system administators, but they were required to work as developers (make code
reusable, use git, etc). The DevOps team started to complain about being
overloaded and they became a bottleneck since a lot of development teams came to
them with problems (failing builds, cloud questions).</p>
<p>On the other side, the development teams stared to complain because they are
dependent on the DevOps team but that team had become the bottleneck. And &ldquo;just&rdquo;
scaling up the DevOps team would not solve the problem.</p>
<p>Where the second evolution was about the technology, the <strong>third evolution</strong> was about
efficiency. They moved to a &ldquo;distributed DevOps&rdquo; model. Developers were
empowered: access to logs and metrics, training (cloud, Terraform, Jenkins). By
creating a central observability platform, developers could get insight in what
is going on, without the need to have access to the production environment.</p>
<p>This resulted in more deployments and enhanced reliability of the deployments
because of the observability platform.</p>
<p><img src="/images/all_day_devops_2022_benjamin_wolf_1.webp" alt="Benjamin Wolf about their third evolution"></p>
<p>A year or three later, new cracks appeared. Standards were diverging because
teams were allowed to pick their own path (libraries, databases, pipelines,
Terraform code, etc). It also lead to practices that needed to be fixed (e.g.
lack of replication). Standardizing this led to quite a burden at the start of a
project: lots of basic stuff to setup.</p>
<p>Developers needed to be experts in a lot of technology, from JavaScript and the
JavaScript framework in use, via multiple .NET versions to Terraform and other
deployment related tech. An easy way to solve this situation was to flip back to
the previous situation with a single team responsible for deployment and such.
But this would basically mean recreating the bottleneck.</p>
<p>The ownership itself was not the problem. The efficiency was, because of the
boilerplate needed for teams. Stuff you want to do the same across the teams.
They wanted to empower the development teams, but also give them the standards
for databases, messaging, etc. Instead of copying a template and have teams
diverge afterwards, they looked into packaging to also make it easier to update
afterwards.</p>
<p>This lead to <strong>evolution four</strong> with marker files, packages, code generators and
auto-devops pipelines. The pipeline looks at the markers (&ldquo;hey, this is a .NET
app&rdquo;) and can then apply a standard pipeline. Nasdaqs code generators create the
boilerplate for the teams so within two minutes of starting with a new
application you&rsquo;re able to write code to solve your business problem, instead of
having to create boilerplate code yourself first.</p>
<p>The developers can get up and running quickly, but in a safe way.</p>
<p>The development teams are all DevOps teams now, but Nasdaq also has a
specialized team for the complex areas (hardware, networking, etc). There is
also a &ldquo;developer experience&rdquo; team that focusses on the tools for the
developers, like the code generators.</p>
<p><img src="/images/all_day_devops_2022_benjamin_wolf_2.webp" alt="Benjamin Wolf about their fourth evolution"></p>
<p>Current status with regard to our three key areas:</p>
<ul>
<li>Efficiency: code generators, CLI, package setup</li>
<li>Reliability: package standards, pipelines</li>
<li>Safety: code scanning, package scanning, disaster recovery preconfigured</li>
</ul>
<h2 id="varieties-of-incident-response--kurt-andersen">Varieties of Incident Response &mdash; Kurt Andersen</h2>
<p>No matter how reliable our systems are, they are never 100% &mdash; an incident can
always happen. When the pager goes off, the first step is to recruit a response
team. This team can then observe what is going on. They need to figure out what
this means (orient themselves) and decide what to do. And finally they can act
to resolve the problem. (The OODA loop.)</p>
<p><img src="/images/all_day_devops_2022_kurt_andersen_1.webp" alt="Kurt Andersen about basic incidence response"></p>
<p>Getting the right people involved can be hard for the technical responder; they
themselves might want to dive into the technical stuff first. This is where the
&ldquo;incident commander&rdquo; role comes in. The incident responder will recruit the team
to get the right people involved, coordinate who does what and handle
communication with people outside of the response team. (The latter can also be
handled by a dedicated &ldquo;comms lead&rdquo; if needed.)</p>
<p>But how does the incident commander get involved?</p>
<p>A fairly standard approach for an on-call system will be to have a tiered model:
tier 1 (NOC), tier 2 (people generally familiar with the system) and tier 3 (the
experts of the system having the issue). The problem with this model: where does
the incident commander come from? Tier 1? If so, can the incident commander
follow through if the issue is handed over to the next tier?</p>
<p>Another model (&ldquo;one at a time&rdquo;): team A gets involved, decides it is not their
responsibility, hands over to team B, which kicks it to team C, etc. Where does
the incident commander come from in this model?</p>
<p>The aforementioned models only work in the simplest cases. They share a few big
problems: handoffs are hard and there is no ownership, which results in loss of
context. To mitigate this, some teams have an &ldquo;all hands&rdquo; approach where
everyone is paged and everyone swarms into the incident response. However, most
people on the call (or in the war room) cannot contribute. This leads to a
mentality of &ldquo;how quickly can I get out of here?&rdquo;</p>
<p>Yet another approach is an Incident Command System (ICS), which comes from
emergency services. In this approach the alert goes to the incident commander
who then involves the team. While this works in some organizations, in tech it&rsquo;s
usually a bit too regimented.</p>
<p>The ICS morphed to an &ldquo;adaptive ICS&rdquo; where the technical team has more autonomy,
but the incident commander is still involved. This system can be scaled up to
where there&rsquo;s an &ldquo;area commander&rdquo; role which coordinates separate teams (via
their respective incident commanders).</p>
<p><img src="/images/all_day_devops_2022_kurt_andersen_2.webp" alt="Kurt Andersen with an image of the &ldquo;response trio&rdquo;"></p>
<p>Summarizing the roles of the parties in the &ldquo;response trio&rdquo;:</p>
<ul>
<li>incident commander: coordination</li>
<li>tech team: nitty-gritty to solve the problem</li>
<li>comms lead: maintain contact with stakeholders</li>
</ul>
<p>Each role will perform their own OODA loop from their own perspective.</p>
<p>But we started the story in the middle. We need to get back to the beginning and
ask the question &ldquo;why is the pager making noise?&rdquo; Perhaps the first question one
should ask is: &ldquo;is this something actionable?&rdquo; If it is not or if it is something
you can handle in the morning, perhaps you do not have to respond in the middle
of the night.</p>
<blockquote>
<p>Cut down the noise and focus on the signal.</p></blockquote>
<h2 id="the-e-stands-for-enablement---modern-sre-at-pagerduty--paula-thrasher">The E Stands for Enablement - Modern SRE at PagerDuty &mdash; Paula Thrasher</h2>
<p>PagerDuty had an idea what SRE meant: they were enablers. You can hit them up on
Slack and they help you out with a problem. Having an SRE team initially reduced
the total minutes in incident in a year. But when PagerDuty grew further, the
number went up again. Oops.</p>
<p><img src="/images/all_day_devops_2022_paula_thrasher.webp" alt="Paula Thrasher about agreeing on who is responsible for what"></p>
<p>The &ldquo;get well&rdquo; project required teams to have the following:</p>
<ul>
<li>Fast rollbacks: all rollbacks have to happen within 5 minutes. The SRE team
provided guidelines to help teams achieve this.</li>
<li>Canary deploys: teams must test changes (canary) first. The SRE team enabled
this via tooling.</li>
<li>Product limits: reasonable limits. The SRE team made this possible via
telemetry and monitoring.</li>
</ul>
<p>Results:</p>
<ul>
<li>Lower number of minutes in major incident</li>
<li>Lower mean time to resolve</li>
<li>More &ldquo;incidents averted&rdquo;: four minor incidents were resolved before they
caused real harm.</li>
</ul>
<p>Important elements that made this &ldquo;get well&rdquo; project possible:</p>
<ul>
<li>Clear, measurable goals</li>
<li>Enabled by tools and templates</li>
<li>Gave teams a path to do it</li>
<li>Experts were available to coach</li>
<li>The teams owned the implementation themselves</li>
<li>The teams were held accountable</li>
</ul>
<p>PagerDuty used <a href="https://backstage.io">Backstage</a> as an internal &ldquo;one stop shop&rdquo;
developer portal with documentation and insights. It also integrates with the
development systems.</p>]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[All Day DevOps 2021]]></title>
    <link rel="alternate" href="https://markvanlent.dev/2021/10/28/all-day-devops-2021/" type="text/html" />
    <id>https://markvanlent.dev/2021/10/28/all-day-devops-2021/</id>
    <author>
      <name>map[name:Mark van Lent uri:https://markvanlent.dev/about/]</name>
    </author>
    <category term="chaos engineering" />
    <category term="conference" />
    <category term="costs" />
    <category term="culture" />
    <category term="devops" />
    <category term="infrastructure as code" />
    <category term="terraform" />
    
    <updated>2021-12-09T21:14:34Z</updated>
    <published>2021-10-28T00:00:00Z</published>
    <content type="html"><![CDATA[<p><a href="https://www.alldaydevops.com/">All Day DevOps</a>, the free online DevOps
conference that goes on for 24 hours, was held for the 6th time today. With a
total of 180 speakers spread over 6 tracks, there&rsquo;s even more content than <a href="/2019/11/06/all-day-devops/">the
last time I attended</a>. These are the notes I
took during the day.</p>
<figure><img src="/images/all_day_devops_2021_keynote.png"
    alt="First slide of the kick-off session with Derek Weeks opening the day"><figcaption>
      <p>Derek Weeks opening the day for All Day DevOps</p>
    </figcaption>
</figure>

<h2 id="keynote-cams-then-and-now-and-the-path-forward--thomas-vikingops-krag">Keynote: CAMS Then and Now and the Path Forward &mdash; Thomas &ldquo;VikingOps&rdquo; Krag</h2>
<p>When people think about DevOps, they think of CAMS, which stands for:</p>
<ul>
<li><strong>C</strong>ulture</li>
<li><strong>A</strong>utomation</li>
<li><strong>M</strong>easurement</li>
<li><strong>S</strong>haring</li>
</ul>
<p>The term CAMS was formalized by John Willis and he wrote down his idea in the
article <a href="https://www.chef.io/blog/what-devops-means-to-me">What Devops Means to Me</a>.
This talk will focus on the most important aspect: culture, and specifically
organizational learning.</p>
<h3 id="culture">Culture</h3>
<p>Culture as described by Simon Sinek:</p>

  <figure>

<blockquote >
A group of people with a common set of values and beliefs. When we&rsquo;re
surrounded by people who believe what we believe something remarkable happens.
Trust emerges.
</blockquote>

  <figcaption>
    &mdash;Simon Sinek
  </figcaption>
  </figure>


<p><img src="/images/all_day_devops_2021_organizational_learning_thomas_krag.png" alt="Thomas Krag talks about organizational learning"></p>
<p>The 5 disciplines from <a href="https://en.wikipedia.org/wiki/Learning_organization">organizational learning</a>:</p>
<dl>
<dt>Shared vision</dt>
<dd>Have people play the game not just because of the rules, but because they feel
responsible for the game.</dd>
<dt>Mental Models</dt>
<dd>Breaking your own assumptions and how they can influence your actions. You
need to self-reflect on your own beliefs.</dd>
<dt>Personal Mastery</dt>
<dd>This is about owning yourself, learning and achieving your personal goals. And
to do the latter you first need to define what is important for yourself.</dd>
<dt>Team learning</dt>
<dd>Effective teamwork leads to results that persons cannot achieve on their own.
And individuals working as a team can learn faster than they would by themselves.</dd>
<dt>Systems Thinking</dt>
<dd>Looking into patterns that emerge inside your organization from a
holistic viewpoint instead of just looking at your own team.</dd>
</dl>
<h3 id="automation">Automation</h3>
<p>Automation is about automating the right things. You don&rsquo;t want to do it just to
automate things, but it has to fit in the system. So you have to start with
culture before you think about automation. The ultimate goal is
<a href="https://www.gitops.tech/">GitOps</a> where everything happens via a pull request.</p>
<h3 id="measurement">Measurement</h3>
<p>Measurement started out with a focus on the tooling. But
<strong>measuring how you work</strong> is as important as measuring your infrastructure.</p>
<p>Important key metrics, from <a href="https://www.devops-research.com/research.html">DORA</a>:</p>
<ul>
<li>deployment frequency (how often do you release to production)</li>
<li>lead time for changes (the amount of time for a change to get into production)</li>
<li>time to restore service (how long does it take to recover from a failure in
production)</li>
<li>change failure rate (the percentage of deployments causing a failure in
production)</li>
</ul>
<h3 id="sharing">Sharing</h3>
<p>Share how you are doing (in) DevOps. This is how we ended up here now.
Share how you are improving your own organization. But also share information
<em>within</em> your organization: documentation, videos, presentations,
<a href="https://devopsdays.org/open-space-format/">open spaces</a>,
<a href="https://leancoffee.org/">lean coffee sessions</a></p>
<h3 id="three-ways">Three ways</h3>
<p>CAMS originated in 2010. The
<a href="https://itrevolution.com/the-three-ways-principles-underpinning-devops/">three ways</a>
are principles that came out of the book
<a href="https://itrevolution.com/the-phoenix-project/">The Phoenix Project</a>. They are:</p>
<ul>
<li>1st way: create flow (systems thinking)</li>
<li>2nd way: feedback loops</li>
<li>3nd way: experimentation, risks, learning</li>
</ul>
<p>If we map these to CAMS:</p>
<ul>
<li>Culture: 3rd way</li>
<li>Automation: 1st way</li>
<li>Measurement: 2nd way</li>
<li>Sharing: 3rd way</li>
</ul>
<p>Working on your culture is as important as doing your actual work.</p>
<p>So what&rsquo;s next? CAMS is still as applicable as 10 years ago. It has always been
important, but it was only put into words in 2010. We need to continue sharing
to get the full value out of it.</p>
<h2 id="gamification-of-chaos-testing--bram-vogelaar">Gamification of Chaos Testing &mdash; Bram Vogelaar</h2>
<p>We think of <a href="https://en.wikipedia.org/wiki/Usain_Bolt">Usain Bolt</a> as the record
breaking athlete, but he&rsquo;s also the person that worked really hard to get there.
It takes a lot of time and effort to become good at something. Pilots and firemen
spend most of their time training and not doing what you expect them to do; just
to make sure they perform well under pressure. Also note that pilots use a lot
of checklists to prevent mistakes.</p>
<p>We should do the same: train for when our platform is in an error state. We
should not just be able to detect it, but also solve the problem.</p>
<p>Chaos engineering is <q>the discipline of experimenting on a distributed system
in order to build <strong>confidence</strong> in the system&rsquo;s capability to
withstand turbulent conditions <strong>in production</strong>.</q> This
practice started at Netflix with <a href="https://github.com/Netflix/chaosmonkey">Chaos Monkey</a>.</p>
<p><img src="/images/all_day_devops_2021_chaos_engineering_ecosystem_bram_vogelaar.png" alt="Bram Vogelaar showing an image of the vast CNCF cloud native landscape with a whole section of chaos engineering products"></p>
<p>We need to become comfortable with experimenting. Have game day exercises and
analyze what happened, to improve your training. Do not just focus on the result
of the exercise itself, but also ask questions like &ldquo;was it the right
experiment?&rdquo;</p>
<p>Now that we use containers, add sidecar containers with tools to get metrics or
detect errors. Or to do chaos engineering e.g. with
<a href="https://github.com/Shopify/toxiproxy">Toxiproxy</a>.</p>
<p>Since checklists are boring, we can use gamification to spice things up.
Celebrate failure, and learn from it!</p>
<blockquote>
<p>Living in the year 3000: breaking production on purpose on Saturdays and have
the system remedy the problem itself.</p></blockquote>
<p>Convince management that failure is <em>normal</em> and <em>expected</em> behaviour. Promising
100% uptime is not realistic. Large, complex systems will always be in a
(somewhat) degraded state.</p>
<p>Let engineers be scientists to deal with this complex environment. Give them
training, allow them to do tests (experiments), which results in having valid
monitoring that lead to actionable alerts. Get the engineers in a state where
they are comfortable with failures.</p>
<p>(<a href="https://www.slideshare.net/attachmentgenie">Slides</a>)</p>
<h2 id="watch-your-wallet-cost-optimizations-in-aws--renato-losio">Watch Your Wallet! Cost Optimizations in AWS &mdash; Renato Losio</h2>
<p>Each AWS service has it&rsquo;s own price components. It&rsquo;s a complex subject. Even a
simple service like a load balancer has multiple components. It looks simple
with the &ldquo;$0.008 per LCU-hour&rdquo; price tag, but now you have to figure out what an
LCU-hour is. Then you learn it has four dimensions that are measured: number of
new connections per second, active connections, processed bytes and rule
evaluations. Good luck predicting the costs.</p>
<p>This presentation only sticks to the basics since cost optimization it such a
big topic.</p>
<p>To start to manage/reduce your costs, you need to enable billing and costs for
your DevOps team. If you cannot measure your costs, you cannot manage it. Note
that you do not need an excessive amount of tags for cost management. First you
need to figure out what you are going to change and how it&rsquo;s going to affect the
bill for your company.</p>

  <figure>

<blockquote >
When savings can be measured, they can be recognized, and <strong>cost efficiency
projects become exciting opportunities</strong>. As of early 2021, the most viewed
dashboard at Airbnb is a dashboard of AWS costs.
</blockquote>

  <figcaption>
    &mdash;Anna Matlin, Airbnb
  </figcaption>
  </figure>


<p>What patterns can we avoid? In most organizations, the most expensive parts of your
bill will be:</p>
<ul>
<li>Compute</li>
<li>Storage</li>
<li>Data transfer</li>
</ul>
<p>So we will dive into these subjects.</p>
<h3 id="compute">Compute</h3>
<p>Tips to reduce costs:</p>
<ul>
<li>Avoid fixed IP addresses where possible. This is not so much about the costs
of the IP address itself, but mostly because architectures that require a
fixed IP tend to be complex and more expensive.</li>
<li>Use Graviton (ARM) instances. They have a better price to performance ratio.
You can also migrate managed services (RDS, Lambda functions) to Graviton.</li>
<li>Check out Lightsail. It&rsquo;s less flexible than using EC2 instances, but cheaper.</li>
</ul>
<h3 id="data-transfer">Data transfer</h3>
<p>We usually forget to take data transfer costs into account upfront. It&rsquo;s also a
complex subject and there are a lot of considerations to make.</p>
<p><img src="/images/all_day_devops_2021_aws_data_transfer_costs_renato_losio.png" alt="Renato Losio shows an image with all kind of ways data transfer can cost you money"></p>
<p>Tips:</p>
<ul>
<li>Multi AZ is always good, but are 3 zones always better than 2 zones?</li>
<li>Multi region is easy to do nowadays, but expensive with regard to data
transfer. Ask yourself if you really need it.</li>
<li>If you want to use multiple cloud providers always think about the data
transfer costs. Perhaps the storage costs are lower at provider X, but if you
have to transfer data, you&rsquo;ll probably pay an egress cost.</li>
<li>Using a CDN (CloudFront) might be cheaper than paying for the egress data
transfer otherwise.</li>
</ul>
<h3 id="storage">Storage</h3>
<p>Initially it was simple: there were only two storage classes. Currently there
are 6 different classes with their own prices and characteristics.</p>
<p><img src="/images/all_day_devops_2021_s3_storage_classes_renato_losio.png" alt="Renato Losio shows an image with the various S3 storage classes and their use cases"></p>
<p>The best option is to use lifecycle rules to move data between different
classes. Note that you in a lifecycle policy you cannot filter the objects based
on an extension (e.g. <code>*.jpg</code>) but instead you need to think &ldquo;from left to
right.&rdquo; So you need to think upfront about the prefixes you are going to want to
use in your bucket.</p>
<p>Managed services can offer you automatic and manual backups, which can be great.
But what is the cost of that? Check how much retention you need for example.</p>
<p>With regard to EBS: for most use cases <code>gp3</code> is better and cheaper than <code>gp2</code>
(except for very large volumes). Note that you can change the EBS volume type
without stopping the machine. In most cases you can have a 20% cost saving
without affecting your performance.</p>
<h3 id="aws-changes-quickly">AWS changes quickly</h3>
<p>After each AWS re:Invent, your deployment is probably outdated with regard to cost
optimizations. Examples:</p>
<ul>
<li>Use <code>gp3</code> instead of <code>gp2</code> for your EBS volumes.</li>
<li>The instance type <code>m6</code> might be more interesting than the <code>m5</code> or <code>m4</code> you may
currently be using.</li>
<li>Dublin was the cheapest region in Europe, but for most things Stockholm is
cheaper than Dublin at the moment.</li>
</ul>
<p>So keep up to date with the offerings.</p>
<h2 id="keynote-call-for-code-with-the-linux-foundation-contributing-to-tech-for-good-even-if-youre-not-techincal--daniel-krook-and-demi-ajayi">Keynote: Call for Code with The Linux Foundation: Contributing to Tech-for-Good Even if You&rsquo;re Not Techincal &mdash; Daniel Krook and Demi Ajayi</h2>
<p>Call for Code is a multi year program launched in 2018 to address humanitarian
issues and help bridge potential solutions. Last year the global challenge was
around climate change and a track was added for the social and business impact
of the COVID-19 pandemic.</p>
<p>It&rsquo;s not just about generating ideas to take on the issues. It should
eventually also lead to an adopted open source solution that is sustainable.</p>
<p>The 14 projects discussed today can be found on <a href="https://linuxfoundation.org/projects/call-for-code/">Call for Code page on the Linux Foundation website</a>.
You can also read about them via the <a href="https://developer.ibm.com/callforcode/">IBM developer site</a>.
GitHub is central for how they iterate on the features. The related organizations are :</p>
<ul>
<li><a href="https://github.com/call-for-code">Call for Code with the Linux Foundation</a></li>
<li><a href="https://github.com/call-for-code-for-racial-justice">Call for Code for Racial Justice</a></li>
</ul>
<p>The key takeaway for this session is to make us help improve how the projects do
DevOps. The goal is to ensure that everyone can contribute to the projects, can do so
with confidence and the projects can be deployed with speed.</p>
<p>The Call for Code for Racial Justice open source projects are categorised in three pillars:</p>
<ul>
<li>Police reform and judicial accountability:
<ul>
<li><a href="https://github.com/Call-for-Code-for-Racial-Justice/fairchange">Fair Change</a></li>
<li><a href="https://github.com/Call-for-Code-for-Racial-Justice/Open-Sentencing/">Open Sentencing</a></li>
<li><a href="https://github.com/Call-for-Code-for-Racial-Justice/Incident-Accuracy-Reporting-System">Incident Accuracy Reporting System (IARS)</a></li>
</ul>
</li>
<li>Diverse representation:
<ul>
<li><a href="https://github.com/Call-for-Code-for-Racial-Justice/TakeTwo">Take Two</a></li>
</ul>
</li>
<li>Policy and legislative reform:
<ul>
<li><a href="https://github.com/Call-for-Code-for-Racial-Justice/Legit-Info/blob/main/README.md">Legit-Info</a></li>
<li><a href="https://github.com/Call-for-Code-for-Racial-Justice/Five-Fifths-Voter">Five Fifth Voters</a></li>
<li><a href="https://github.com/Call-for-Code-for-Racial-Justice/Truth-Loop">Truth Loop</a></li>
</ul>
</li>
</ul>
<p>Demi talked about each of these projects in the program and their tech stacks. You
can read more about them on the
<a href="https://developer.ibm.com/callforcode/racial-justice/">Call for Code for Radical Justice section</a>
on the IBM developer site.</p>
<p>Daniel in turn talked about other Call for Code projects:</p>
<ul>
<li><a href="https://clusterduckprotocol.org/">ClusterDuck</a></li>
<li><a href="https://pyrrha-platform.org/">Pyrrha</a></li>
<li><a href="https://www.isac-simo.net/">ISAC-SIMO</a></li>
<li><a href="https://openeew.com/">OpenEEW</a></li>
<li><a href="https://github.com/Call-for-Code/Liquid-Prep">Liquid Prep</a></li>
<li><a href="https://github.com/Call-for-Code/DroneAid">DroneAid</a></li>
<li><a href="https://github.com/Rend-o-matic">Rend-o-matic</a></li>
</ul>
<p>Even if you cannot code, there are numerous ways you can contribute, e.g.
conducting user research, write/review documentation, do design work, advocacy
like speaking at conferences.</p>
<p><img src="/images/all_day_devops_2021_call_for_code_non_technical_contributions_demi_ajayi.png" alt="Demi Ajayi talking about non-technical ways to contribute to Call for Code"></p>
<p>There are multiple ways to get involved:</p>
<ul>
<li>Via the virtual community (<a href="https://callforcode.org/slack">join slack</a>, join events)</li>
<li>Directly via the projects on GitHub</li>
<li>Conduct outreach (recruit friends, host events, etc)</li>
</ul>
<h2 id="devsecops-culture-laughing-through-the-failures--chris-romeo">DevSecOps Culture: Laughing Through the Failures &mdash; Chris Romeo</h2>
<p>Chris shared 10 DevSecOps failures and talked about how to change the culture
and turn these failures into successes.</p>
<h3 id="1-name-and-brand">#1 Name and brand</h3>
<p>This quote nicely sums it up:</p>

  <figure>

<blockquote cite="https://twitter.com/jvehent">
<p>Can we stop the {Sec}Dev{Sec}Ops{Sec} naming foolishness?</p>
<p>Just call it DevOps and focus on making security a natural part of building stuff.</p>

</blockquote>

  <figcaption>
    &mdash;Julian Vehent, <cite><a href="https://twitter.com/jvehent">@jvehent</a></cite>
  </figcaption>
  </figure>


<p>To change the culture:</p>
<ul>
<li>Embrace security and make it part of DevOps</li>
<li>Teach this new definition to everyone involved with your project (developers,
testers, product managers, executives, etc)</li>
<li>Create a &ldquo;DevSecOps&rdquo; swear jar ;-)</li>
</ul>
<h3 id="2-the-infinity-graph">#2 The infinity graph</h3>
<p><img src="/images/all_day_devops_2021_infinity_loop_chris_romeo.png" alt="Chris Romeo shows the often used infinity symbol"></p>
<p>The problem is that nothing ever gets done with this infinity graph. It&rsquo;s not an
accurate representation.</p>
<p>The solution is to talk about pipelines instead and integrating security into
them. Code review should include security, vulnerability scanning should be part
of the pipeline, etc. Ban the infinity graph.</p>
<h3 id="3-security-as-a-special-team">#3 Security as a special team</h3>
<p>Creating a specific security team is the <em>opposite</em> of what DevOps is about. It&rsquo;s about
working together. Security isn&rsquo;t a specialty, it is the responsibility of
everybody. This requires knowledge and expertise.</p>
<p>The other way around is also true: teach security people to code. They don&rsquo;t
have to become great coders, but it would be nice if they can review code and
make suggestions to make things more secure.</p>
<h3 id="4-vendor-defined-devops">#4 Vendor defined DevOps</h3>
<p>Sometimes we let vendors define what DevOps and security are for us, via the
products that they offer. It would be better to find the best of breed outside
of the offering of cloud provider. Take a vendor independent approach and
determine what DevOps means to <strong>you</strong>.</p>
<h3 id="5-big-company-envy">#5 Big company envy</h3>
<p>Looking at the big companies can be discouraging. You most likely have not
invested the same time in it as e.g. Netflix, Etsy, etc. have. So while you
won&rsquo;t be at the same level, don&rsquo;t see this as an excuse to give up. Do the
DevOps that you do. Don&rsquo;t fixate on the top of the class. Get on that path and
make incremental progress.</p>
<p>Use the <a href="https://owasp.org/www-project-devsecops-maturity-model/">OWASP DevSecOps Maturity Model</a> to create a roadmap.</p>
<h3 id="6-overcomplicated-pipelines-and-doing-everything-now">#6 Overcomplicated pipelines and doing everything now</h3>
<p>This can be a complicated subject (see for example the
<a href="https://blog.sonatype.com/2020-devsecops-reference-architecture">DevSecOps Reference Architecture</a>
from Sonatype).</p>
<p><img src="/images/all_day_devops_2021_sonatype_reference_architecture_chris_romeo.png" alt="Chris Romeo shows the Sonatype DevSecOps reference architecture"></p>
<p>But keep it simple! Start with a small subset of security tools. Everybody
related to your project should be able to explain the build pipeline. Don&rsquo;t try
to solve all problems immediately. Take a phased approach.</p>
<h3 id="7-security-as-gatekeeper">#7 Security as gatekeeper</h3>
<p>Security might want to slow down the pipeline and act as a gatekeeper. Don&rsquo;t say
&ldquo;<em>no</em>,&rdquo; but &ldquo;<em>yes, if&hellip;</em>&rdquo; For example: &ldquo;<em>yes</em>, that would be a great feature <em>if</em> you
enable multifactor authentication.&rdquo;</p>
<p>Practice empathy. Both security people for developers, but also the other way
around.</p>
<h3 id="8-noisy-security-tools">#8 Noisy security tools</h3>
<p>You buy a tool and enable every option to &ldquo;get your money&rsquo;s worth.&rdquo; The result
is 10,000 JIRA tickets of things that need to be fixed. This does not help.</p>
<p>It would be better to tune the tools and don&rsquo;t waste time with security findings
that do not matter.</p>
<p>Start with a minimal policy focussing on the largest issue. Developers will then
start to trust the tool and <em>then</em> you can slowly increase the policy.</p>
<h3 id="9-lack-of-threat-modelling">#9 Lack of threat modelling</h3>
<p>You scanning tools cannot find business logic flaws; there&rsquo;s no pattern to it.</p>
<p><img src="/images/all_day_devops_2021_threat_modelling_chris_romeo.png" alt="Chris Romeo shows the condescending Wonka meme with the text &ldquo;DevOps is too quick for threat modeling you say? &hellip; Thousands of business logic flaws beg to differ"></p>
<p>Perform threat modelling outside of the pipeline. It should be done when new
feature assignments go out.</p>
<h3 id="10-vulnerable-code-in-the-wild">#10 Vulnerable code in the wild</h3>
<p>There are lots of vulnerabilities in open source software; this is a supply
chain problem.</p>
<p>To improve this: embed software composition analysis (SCA) in <strong>all</strong> your
pipelines. Set the SCA policy to fail when a vulnerability is detected. If you
filter out a vulnerability (e.g. because there&rsquo;s no fix yet), make sure the
filter will not be active forever.</p>
<h3 id="successes">Successes</h3>
<p>The 10 DevOps successes we can distil from the failures above:</p>
<ol>
<li>Just call it DevOps</li>
<li>Pipelines</li>
<li>Security for everyone</li>
<li>Embrace <strong>your</strong> DevOps</li>
<li>Be content with <strong>your</strong> DevOps</li>
<li>Simple and staged pipeline</li>
<li>Security as trusted partner</li>
<li>Tuned and valuable security tools</li>
<li>Threat modelling for everyone</li>
<li>Breaking the build for vulnerabilities</li>
</ol>
<h3 id="key-takeaways">Key takeaways</h3>
<ul>
<li>Hopefully the real impact of DevOps (when looking back 50 years from now) is
going to be security culture related.</li>
<li>Some of the failures were outside of your control, but you can change them.</li>
<li>Everybody has to code, choose <strong>your</strong> DevOps, keep it simple, lower the
noise, add the best practices we discussed, do some threat modelling, break
the build for vulnerabilities.</li>
<li>Embrace and laugh at the failures.</li>
</ul>
<h2 id="keynote-managing-risk-with-service-level-objectives-and-chaos-engineering--liz-fong-jones">Keynote: Managing Risk with Service Level Objectives and Chaos Engineering &mdash; Liz Fong-Jones</h2>
<p>Besides being a principal developer advocate at Honeycomb, Liz is also a member
of the platform on-call rotation. Honeycomb deploys with confidence up to 14
times a day, every day of the week&mdash;so also on Fridays. How do they manage to
(mostly) meet their Service Level Objectives (SLOs) while also scaling out their
user traffic?</p>
<p>Their confidence recipe:</p>
<ul>
<li>Quantify the amount of reliability.</li>
<li>Be able to identify risk areas that might prevent them from fulfilling their
targets.</li>
<li>Test to verify that the assumptions about the systems are correct.</li>
<li>Respond to that feedback to address the problems found via those experiments
or though natural outages.</li>
</ul>
<h3 id="how-to-measure-reliability">How to measure reliability?</h3>
<p>You need to know how broken is &ldquo;too broken.&rdquo; You don&rsquo;t have to alert on all
problems when working at scale. You need to measure success of the service and
define SLOs. These are a way to measure and quantify your reliability.</p>
<p>Honeycomb&rsquo;s jobs is to reliably ingest telemetry, index it, store it safely and
let people query it in near-real-time. Honeycomb&rsquo;s SLOs measure the things
that their customers care about.</p>
<p>For example, they have set an SLO that the homepage needs to load quickly
(within a few hundred milliseconds) in 99.9% of the times. User queries need to
be run successful &ldquo;only&rdquo; 99% of the time and are allowed to take up to 10 seconds.
On the other hand: ingestion needs to succeed in 99.99% of the time since they
only have one shot at it.</p>
<blockquote>
<p>Services are not just 100% down or 100% up (most of the time).</p></blockquote>
<p>These metrics help Honeycomb make decisions about reliability and product
velocity. If the service is down too much, they need to invest in reliability
(since having features that cannot be used does not add value). On the other
hand: if they exceed the SLO, they can move faster.</p>
<h3 id="recipe-for-shipping-reliably-and-quickly">Recipe for shipping reliably and quickly</h3>
<p>Practices used by Honeycomb:</p>
<ul>
<li>Code is instrumented. Think beforehand what a success or failure in
production would look like.</li>
<li>Functional and visual testing using libraries that create snapshots.</li>
<li>The design for feature flag deployment (only making a feature available to a
small percentage of users initially).</li>
<li>Practice automated integration plus human review. (There&rsquo;s an SLO for how
long the tests are allowed to take. Code reviews are high priority tasks
since you are blocking someone else by not reviewing their code.)</li>
<li>The main branch is safe to release any time.</li>
<li>Automatically roll out the changes.</li>
<li>After the deployment the engineers have to observe what is happening with
their code in production. Only after they confirm that everything is okay,
they can go home. So you are free to deploy on Friday at 19:30 as long as you
are willing to stick around until your code is deployed and you have confirmed
it is not causing issues.</li>
</ul>
<p>For infrastructure Honeycomb also use infrastructure as code practices:</p>
<ul>
<li>They can use CI and feature flags for their infrastructure.</li>
<li>They can automatically provision fleets if needed.</li>
<li>They can automatically quarantine certain paths to keep the main fleet from
crashing or do performance profiling.</li>
</ul>
<h3 id="chaos-engineering">Chaos Engineering</h3>
<p>Left-over error budget is used for chaos engineering experiments. This is
something where you go test a hypothesis. You need to control the percentage of
users affected by it and be able to revert the impact you are causing.</p>
<blockquote>
<p>Chaos engineering is engineering. It&rsquo;s not pure chaos.</p></blockquote>
<p>This works well for stateless things, but how does it work for stateful things?
In the case of the Honeycomb infrastructure, they make sure to only restart one
server or service at a time. They do not introduce too much chaos to reduce the
likelihood that something goes catastrophically wrong.</p>
<p>Two reasons why you will want to do these experiments at 3 PM and not at 3 AM:</p>
<ul>
<li>You want to test at peak traffic instead of low traffic, since the latter
could give you a false sense of security in situations when everything may
still look normal even though this would have been a problem in a high traffic
situation.</li>
<li>When doing things in the afternoon, there are more people available to deal
with something than there would be in the middle of the night.</li>
</ul>
<p>With the experiments they measure if they had an impact on the customer
experience. If they cause a change, does the telemetry reflect this? (Is the
node indeed reported as being offline, for example?) When you fix things, you
need to repeat the experiment and make sure the change indeed fixed the issue.</p>
<p><img src="/images/all_day_devops_2021_chaos_engineering_liz_fong-jones.png" alt="Liz Fong-Jones: Not every experiment succeeds. But you can mitigate the risks."></p>
<p>When you burn the error budget, the <a href="https://sre.google/sre-book/table-of-contents/">SRE book</a>
states that you should freeze deploys. Liz disagrees. If you freeze deploys, but
continue with feature development, the risk of the next deployment only
increases. Instead, Liz advocates for using the team&rsquo;s time to work on
reliability (i.e. change the nature of the work instead of stopping work).</p>
<blockquote>
<p>Fast and reliable: pick both!</p></blockquote>
<p>You don&rsquo;t have to pick between fast and reliable. In a lot of ways fast <strong>is</strong>
reliable. If you exercise your delivery pipelines every hour of every day,
stopping becomes the anomaly instead of deploying.</p>
<h3 id="takeaways">Takeaways</h3>
<ul>
<li>By designing the delivery pipeline for reliability, Honeycomb can meet their
SLOs.</li>
<li>Feature flag can reduce blast radius, and keep you within your SLO.</li>
<li>And when they cannot: there are other ways to mitigate the risks.</li>
<li>By discovering risks at 3 PM and not 3 AM, you improve the customer experience
since the system is more resilient.</li>
<li>If something does go catastrophically wrong, remember that the SLO is a
guideline not a rule. SLOs are for managing predictable-ish unknown-unknowns
and not things that are completely outside of your control.</li>
<li>We are all part of sociotechnical systems. Customers, engineers and stakeholders alike.</li>
<li>Outages or failed experiments are learning opportunities, not reasons to fire someone.</li>
<li>SLOs are an opportunity to have discussions about trade-offs between stability and speed.</li>
<li>DevOps is about talking to each other and talking to our customers.</li>
</ul>
<h2 id="common-pitfalls-of-infrastructure-as-code-and-how-to-avoid-them--tim-davis">Common Pitfalls of Infrastructure as Code (And how to avoid them!) &mdash; Tim Davis</h2>
<p>We start with the basics: what is Infrastructure as Code (IaC)? With the advent
of cloud providers, you no longer use hardware, but a UI to stand up
infrastructure. This led to <a href="https://en.wikipedia.org/wiki/Shadow_IT">shadow IT</a>
since developers ran off with a credit card to provision what they needed
themselves, instead of using slow, internal IT systems.</p>
<p>Developers however rather write code and use developer methodologies than click
through a UI. This is where IaC started. With it you can create and manage
infrastructure by writing code.</p>
<p><img src="/images/all_day_devops_2021_pitfalls_tim_davis.png" alt="Tim Davis explains that IaC combines the pitfalls from both infrastructure and code"></p>
<p>What are he pitfalls? The bad news: you get all the pitfalls of infrastructure
and all the pitfalls of code. But you&rsquo;ve probably already got a lot of
experience with those issues and teams to handle them. You just use a different
methodology.</p>
<p>The first pitfall is not fostering the communication between the groups that
have experience and tools.</p>
<h3 id="infrastructure-pitfalls">Infrastructure pitfalls</h3>
<p>Which framework/tool do you pick? There are basically two categories:
multi-cloud or cloud agnostic tools on the one hand (like
<a href="https://www.terraform.io/">Terraform</a> and <a href="https://www.pulumi.com/">Pulumi</a>)
and cloud specific tools on the other (like
<a href="https://aws.amazon.com/cloudformation/">CloudFormation</a>). Note that for example
with Terraform and Pulumi you still have to rewrite code when switching from one
cloud provider to another, but at least the tool is familiar.</p>
<p>Security is a huge thing. You still need to know how to design your VPC, IAM
policies, security policies, etc. You still need to communicate with all the
teams that have the experience. It&rsquo;s not just Dev and Ops. With tools like
<a href="https://github.com/accurics/terrascan">Terrascan</a> and
<a href="https://www.checkov.io/">Checkov</a> you can shift-left the security aspect
instead of trying to bolt it on afterwards.</p>
<h3 id="code-pitfalls">Code pitfalls</h3>
<p>The biggest thing issue is with default values. If you use the UI, there are a
lot of boxes that may be blank or have stuff in them. Some of the boxes can be
left blank, for some you need to specify what you want. The UI is going to
yell at you; if you use IaC things may be less in your face.</p>
<p>You don&rsquo;t want to deploy something with an open policy.
<a href="https://www.openpolicyagent.org/">Open Policy Agent</a> can really help you to make sure you
stay within your allowed parameters. For instance you can write a policy to make
sure you are only use a specific region, don&rsquo;t deploy an open S3 bucket or that
you only use certain sizes of EC2 instances.</p>
<p>If you hard code certain values in Terraform or other IaC tools, you might need
to copy/paste a lot of code if you want to create e.g. a test, acceptance and
production environment. To mitigate these <a href="https://en.wikipedia.org/wiki/Don%27t_repeat_yourself">DRY (don&rsquo;t repeat
yourself)</a> issues you can
for instance use
<a href="https://www.terraform.io/docs/language/modules/index.html">Terraform modules</a> or
<a href="https://terragrunt.gruntwork.io/">Terragrunt</a>.</p>
<p>State size can become a problem. If you want to, you can put all of your
infrastructure in the same state file (which is where your tool stores the state
of the infrastructure). However it means the tool will have to check <em>all</em>
resources in the state file to detect if it needs to do something. To mitigate
this, you can again use Terraform modules. It helps with performance and makes
the codebase more manageable.</p>
<h2 id="wrap-up">Wrap-up</h2>
<p>The conference is still ongoing while I publish this post. However, this is it
for me for All Day DevOps for this year. I learned new things and got inspired.</p>
<p>Thanks to the organizers, moderators and speakers for hosting another great
event.</p>]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[Devopsdays oNLine 2021]]></title>
    <link rel="alternate" href="https://markvanlent.dev/2021/06/29/devopsdays-online-2021/" type="text/html" />
    <id>https://markvanlent.dev/2021/06/29/devopsdays-online-2021/</id>
    <author>
      <name>map[name:Mark van Lent uri:https://markvanlent.dev/about/]</name>
    </author>
    <category term="conference" />
    <category term="devops" />
    
    <updated>2021-11-10T20:39:28Z</updated>
    <published>2021-06-29T00:00:00Z</published>
    <content type="html"><![CDATA[<p>Since <a href="https://en.wikipedia.org/wiki/COVID-19_pandemic">COVID-19</a> is still
around, devopsdays Amsterdam was an online event again. These are the notes I
took while watching the talks.</p>
<p><img src="/images/devopsdays2021_logo.jpg" alt="devopsdays oNLine logo"></p>
<h2 id="how-cognitive-biases-and-ranking-can-foster-an-ineffective-devops-culture--kenny-baas--evelyn-van-kelle">How cognitive biases and ranking can foster an ineffective DevOps culture &mdash; Kenny Baas &amp; Evelyn van Kelle</h2>
<p>In practice, not all team members are treated equally although we might think
they are.</p>
<h3 id="how-to-make-sure-everyone-said-what-has-to-be-said">How to make sure everyone said what has to be said?</h3>
<p>In each team ranking takes place. It determines who takes the lead, who
expresses an opinion, etc.</p>
<p>We need to make sure ranking is made explicit. Your job title or position in the
org chart is explicit ranking. But things like your gender, skin colour and level
of charisma are part of the implicit ranking.</p>
<p>The result of this ranking can be a situation like this: a woman on the team
just explained something on a certain topic, but questions about it are addressed
to the white man in the team.</p>
<p>We all have a list of traits we associate with higher rank. So if someone ticks
more of those boxes, they are seen as higher in rank. You can also rank yourself
lower compared to others. As a result you might not express your opinion because
someone you regard as having a higher rank said something that conflicts with
what you wanted to say.</p>
<p>The person with the higher rank should be aware of their rank and &ldquo;share it.&rdquo;
Ask yourself the question &ldquo;what <strong>don&rsquo;t</strong> I know?&rdquo; in a discussion. Ask the
question the other (with a lower rank) is afraid to ask.</p>
<blockquote>
<p>Own, play and share your rank.</p></blockquote>
<h3 id="how-can-we-create-and-include-new-insights">How can we create and include new insights?</h3>
<p>We use cognitive bias to make decisions. Usually this helps us, but sometimes it
works against us. What we know hinders us from taking on a new perspective and
it limits us in our thinking. We get stuck in what we know
(<a href="https://en.wikipedia.org/wiki/Functional_fixedness">functional fixedness</a>).</p>
<p><img src="/images/devopsdays2021_bias.jpg" alt="We all have biases: should the toilet paper be over or under?"></p>
<p>Be aware of your biases and try to break free from them.</p>
<h3 id="who-makes-decisions-and-how-to-get-everyone-onboard-with-the-decision">Who makes decisions and how to get everyone onboard with the decision?</h3>
<p>Have less &ldquo;corporate&rdquo; meetings and more &ldquo;campfires&rdquo; where we share ideas and talk
about the why.</p>
<p>There are 4 ways to go into a decision:</p>
<ul>
<li>Idea</li>
<li>Suggestion</li>
<li>Proposal</li>
<li>Command</li>
</ul>
<p>If you have a proposal or command, be open about it: &ldquo;this is what we are going
to do, what does it take to go along with it?&rdquo; This is better than pretending it
is completely open for discussion.</p>
<h2 id="getting-started-embrace-your-inner-child--rain-leander">Getting Started: Embrace Your Inner Child &mdash; Rain Leander</h2>
<p>Learning to ride a bike when you are older is harder than when you are a child.
This is not so much a physical thing, but we do not want to be embarrassed when
we are older. This goes for everything new we do: we want to be the
best&mdash;instantly. However you <em>will</em> fall down and fail. You <em>will</em> make mistakes.
Embrace that, be brave and learn.</p>
<p>Bravery is something you have to develop. It is hard! And it is not a <em>lack</em> of
fear, it is moving forward <em>despite</em> of fear.</p>
<p>The steps to become more brave that worked for Rain:</p>
<ul>
<li>Embrace vulnerability</li>
<li>Admit you have fear</li>
<li>Do it anyway</li>
<li>When you fall down, get back up</li>
</ul>
<blockquote>
<p>Play!</p></blockquote>
<p>Playing is instrumental to life. No one can tell you how to play. Do what is fun
for you. And when it stops being fun, just stop doing it. Book time in your
agenda to play, <strong>each day</strong>! (And napping counts.)</p>
<p>Explore! Some of the things Rains uses to stay curious and explore:</p>
<ul>
<li>Listen</li>
<li>Ask questions</li>
<li>Avoid assumptions</li>
<li>Listen</li>
<li>Embrace learning as fun</li>
<li>Develop a beginner&rsquo;s mind</li>
<li>Listen</li>
</ul>
<p>You are a unicorn. Your experience, your background, your education, etc. makes
you unique.</p>
<p>Be willing to fall down, and get back up. Be brave. Play! Conquer the world.</p>
<p>(The original blog post for this talk:
<a href="http://groningenrain.nl/getting-started-embrace-your-inner-child/">Getting Started: Embrace Your Inner Child</a>)</p>
<h2 id="prevent-heroism-how-to-work-today-to-reduce-work-tomorrow--quintessence-anx">Prevent Heroism: How to Work Today to Reduce Work Tomorrow &mdash; Quintessence Anx</h2>
<p>We might be familiar with &ldquo;the angry sysadmin.&rdquo; It is the person who is highly
skilled, has lots of knowledge, gets asked all the questions and puts in overwork to get
their own work done. This is basically the description of the character Brent
from the book <a href="https://itrevolution.com/the-phoenix-project/">The Phoenix Project</a>.</p>
<p>Being a Brent-like person is stressful and may result in anger, resentment and
strained relationships. Brent-like persons typically have issues with things
like time management, maintaining focus and
<a href="https://en.wikipedia.org/wiki/Allostatic_load">allostatic load</a> (where you&rsquo;re too tired to
be awake and too stressed to sleep). But there are business implications as well:
diminishing returns, high turnover, reputational damage and financial
performance goes down.</p>
<p>Instead of having&mdash;or being&mdash;a Brent-like person, we rather want no silos and
no vacuums (the latter is where you, for instance, get no response to
questions).</p>
<p>Steps to get out of this situation and do things different:</p>
<ul>
<li>Brainstorm the work (every task done during the day, planned or unplanned)</li>
<li>Determine which &ldquo;stream&rdquo; each task is in: reactive or proactive</li>
<li>Look for patterns</li>
<li>Switch to &ldquo;upstream mode&rdquo; to prevent work in the future (short term/long term
vs permanent/temporary matrix)</li>
<li>Plan and Prioritize work: upstream work is more important</li>
<li>Do the work</li>
<li>Iterate (what works, what does not, what can be improved)</li>
</ul>
<p>Upstream work is the proactive and preventative work that reduces the need for
reactive, or downstream, work</p>
<p>But how does one measure things? Things to keep in mind:</p>
<ul>
<li>What are your needs?</li>
<li>Learn how to measure them</li>
<li>Establish baselines</li>
<li>Know your capacity</li>
<li>Track the rate of change</li>
<li>Set goals</li>
<li>Iterate</li>
</ul>
<p><img src="/images/devopsdays2021_metrical_thoughts.jpg" alt="Some example metrics to track: SLA compliance, question volume, change failure rate, unplanned work rate"></p>
<p>Slides and additional resources can be found at
<a href="https://noti.st/quintessence/TFyFzN/prevent-heroism-how-to-work-today-to-reduce-work-tomorrow">https://noti.st/quintessence</a></p>
<h2 id="real-world-continuous-delivery-learn-adapt-improve--michiel-rook">Real-world Continuous Delivery: Learn, Adapt, Improve &mdash; Michiel Rook</h2>
<p>This talk is a real life case study, focussing on the deployment of the platform
of a customer of Michiel.</p>
<p>There was a pretty long release check list with manual steps. Especially when
under pressure steps were forgotten or done incorrectly. Ideally they released
every two weeks (after each sprint). It took a team 2&ndash;3 days of manual work.
Time not spent on features.</p>
<p>To improve this situation, the following goals were set:</p>
<ul>
<li>Reduce costs by letting people do more valuable work</li>
<li>Faster feedback and ability to do more experiments</li>
<li>Reduce <a href="https://sre.google/sre-book/eliminating-toil/#toil-defined">toil</a></li>
</ul>
<p>For continuous delivery to work, you have to make things small. This is a key
difference between high and low performing teams (see
<a href="https://itrevolution.com/book/accelerate/">Accelerate</a>).</p>
<p><img src="/images/devopsdays2021_small_releases.jpg" alt="Small releases mean realizing value faster"></p>
<p>Although the platform was a monolith, they deferred splitting it up. If they
would have broken down the monolith into smaller pieces, they would have more
manual steps for each release and made the situation worse. So the strategy was
to automate first and then split.</p>
<h3 id="phase-1">Phase 1</h3>
<p>The teams increased the release cadence. They switched to a weekly release cycle
instead of a release every two weeks (on good weeks). Releasing more often makes
problems visible faster.</p>
<p>The release process was also automated. They created a pipeline to build and
deploy the platform. In this phase there was still a manual step between the
deployment to acceptance and production. They didn&rsquo;t trust the system enough
yet.</p>
<p>They also made changes in their way of working. The big one: pair programming.
This is superior to code reviews, for example because you also discuss code that
does <em>not</em> get written and you have architecture discussions.</p>
<p>Other changes:</p>
<ul>
<li>A &ldquo;do not leave broken windows&rdquo; mentality. This is from the book
<a href="https://pragprog.com/titles/tpp20/the-pragmatic-programmer-20th-anniversary-edition/">The Pragmatic Programmer</a>)
which states: <q>neglect
accelerates the rot faster than any other factor.</q></li>
<li>Measure how they were doing in terms of their continuous delivery journey.
The book <a href="https://leanpub.com/measuringcontinuousdelivery">Measuring Continuous Delivery</a>
has useful information about this.</li>
</ul>
<h3 id="phase-2">Phase 2</h3>
<p>In the next phase they had a robot press the button to deploy to production.
This required zero downtime deploys. They solved this by doing rolling updates
(a load balancer in front of a couple of backend servers and then upgrade one
server at a time until they are all up-to-date).</p>
<p>Pipeline failures come in all forms, e.g.:</p>
<ul>
<li>Flaky tests</li>
<li>Timeouts</li>
<li>Network stability issues</li>
<li>External dependencies in tests</li>
</ul>
<p>If you cannot trust a test, remove it. Otherwise you&rsquo;ll probably work around it,
which is more dangerous.</p>
<p>This level of automation also meant they treated pipeline failures as priority 1
issues (which warrants extreme feedback to notify people of a broken build:
lights, sounds).</p>
<h3 id="phases-3">Phases 3</h3>
<p>In the third phase they made it faster:</p>
<ul>
<li>Scale vertically and horizontally</li>
<li>Parallelize tests</li>
</ul>
<h2 id="the-adjacent-possible-evolution-innovation--catastrophe--jason-yee">The Adjacent Possible: Evolution, Innovation &amp; Catastrophe &mdash; Jason Yee</h2>
<p>The four cornerstones of resilience:</p>
<ul>
<li>Monitoring: knowing what to look for</li>
<li>Responding: knowing what to do</li>
<li>Learning: knowing what has happened</li>
<li>Anticipating: knowing what to expect</li>
</ul>
<p>The term &ldquo;adjacent possible&rdquo; comes from biology in the context of evolution.
Evolution requires a chain of changes to happen.</p>
<p><img src="/images/devopsdays2021_adjacent_possible.jpg" alt="Adjacent possible: evolution by one small change after another"></p>
<p>Complex systems usually do not have a single root cause, only contributing
factors. Perhaps there&rsquo;s an &ldquo;adjacent possible&rdquo; of failure. We cannot anticipate
all failures, but perhaps we can explore them.</p>
<p><img src="/images/devopsdays2021_known_unknown.jpg" alt="Known knowns, known unknowns and unknowns unknowns"></p>
<p>How do you explore adjacent possible failures?</p>
<p>To move from known knowns to known unknowns, we can use chaos engineering
(<q>thoughtful, planned experiments to improve our understanding of how systems
work (and fail), so that we can improve them</q>). This way we can explore the
known unknowns and these then become the known knowns. From that standpoint,
the original unknown unknowns become the known unknowns.</p>
<p>Scientific process:</p>
<ul>
<li>Observe (what is normal)</li>
<li>Hypothesize (how do we think it will react to failure)</li>
<li>Experiment (inject failure)</li>
<li>Analyze (what did actually happen)</li>
<li>Share knowledge</li>
</ul>
<p>Explore the adjacent possible!</p>
<h2 id="minimal-viable-presentations--mark-smalley">Minimal viable presentations &mdash; Mark Smalley</h2>
<p>There are four important points to take into consideration when creating a
presentation:</p>
<dl>
<dt>Accept that it is not about you</dt>
<dd>It is the audience&rsquo;s presentation. What you want to tell might not be what
they want to hear.  Kill your darlings.</dd>
<dt>Anticipate silent questions</dt>
<dd>These questions are:
<ul>
<li>Huh? What is he talking about? (Solution: be clear about the self evident stuff)</li>
<li>Really? (Solution: be clear about the evidence)</li>
<li>So? How is this relevant for me? What&rsquo;s in it for me?</li>
<li>What&rsquo;s next?</li>
</ul>
</dd>
<dt>Think of learning objectives</dt>
<dd>Think about what the audience should <strong>know</strong>, <strong>believe</strong>, <strong>be able to</strong> do and <strong>feel</strong> after the
presentation.</dd>
<dt>Dare to choose your ideal audience</dt>
<dd>Accept that the presentation will not be for the majority of the people
listening.</dd>
</dl>
<figure><img src="/images/devopsdays2021_learning_objectives.jpg"
    alt="Learning objectives of this talk"><figcaption>
      <p>Mark Smalley applies the advice given in his talk to this talk itself</p>
    </figcaption>
</figure>

<h2 id="engineering-metrics-that-matter--dan-lines">Engineering Metrics That Matter &mdash; Dan Lines</h2>
<p>Before we dive in, let&rsquo;s start with some
<a href="https://en.wikipedia.org/wiki/Performance_indicator">KPI</a>s you want to avoid:</p>
<ul>
<li>Lines of code</li>
<li>Number of commits</li>
<li>Individual stack ranking</li>
<li>Anything with story points</li>
</ul>
<p>So what <em>do</em> you want to measure? What will improve your process?</p>
<dl>
<dt>Couple pull request size with cycle time (how much time from coding to deployment)</dt>
<dd>What you&rsquo;ll see is that bigger PR size takes longer to review and
thus increase cycle time.</dd>
<dt>Deployment frequency</dt>
<dd>How often can we get new value into the hands of customers.</dd>
<dt>Idle time</dt>
<dd>Waiting for e.g. a release or review. Waiting means context switches which
means less productivity.</dd>
<dt>Code churn</dt>
<dd>How much code are we reworking in a short period of time? High code churn
indicates delays and quality issues.</dd>
<dt>Review depth</dt>
<dd>The amount of feedback per PR.</dd>
<dt>Mean time to restore</dt>
<dd>How long it takes to fix an incident in production.</dd>
<dt>Investment profile</dt>
<dd>New functions for our customers vs bug fixing, backend or non-functional work.</dd>
</dl>
<h2 id="devops-is-no-walk-in-the-park--sabine-wojcieszak">DevOps is no Walk in the Park &mdash; Sabine Wojcieszak</h2>
<p>In a park you are safe, you do not need a map, can wander where you want and you
need no preparation. DevOps is not like that.</p>
<p>The term CALMS has been coined by John Willis to explain DevOps. It stands for:</p>
<ul>
<li><strong>C</strong>ulture</li>
<li><strong>A</strong>utomation</li>
<li><strong>L</strong>ean</li>
<li><strong>M</strong>easurement</li>
<li><strong>S</strong>haring</li>
</ul>
<p>Most people talk a lot about the automation and some about the measurements
part, but very few about sharing and even less about culture.</p>
<p>There is a similarity with Agile where the term began as a niche but became
hyped and then mainstream, where people talk about Agile without knowing what it
is. We should prevent this from happening with DevOps.</p>
<p><img src="/images/devopsdays2021_agile_mainstream_devops.jpg" alt="Agile: from niche to mainstream. Will the same happen with DevOps&quot;"></p>
<p>It&rsquo;s no longer business <em>and</em> IT, but IT <em>is part of</em> business. One could even
say IT <em>is</em> business.</p>
<p>DevOps is <strong>not</strong> cherry picking what we like (automation) or only picking the
low hanging fruit and calling it &ldquo;DevOps.&rdquo; It needs an holistic approach. But
this needs a lot of ingredients. In the right mix.</p>
<p>You should always ask yourself <em>why</em> you are doing DevOps. Do you think it is
cheaper because of the level of automation? Because everyone is doing it? To
deliver better software?</p>
<p>You need to understand the whole approach.</p>
<p>DevOps means to Sabine: <q>deliver valuable products with better quality sooner to
customers/users.</q> This means we need to know what is valuable for our customers,
what our customers think of as quality, etc.</p>
<p>The term CI/CD in context of DevOps traditionally stands for continuous integration
/ continuous delivery. But it could also stand for continuous improvement /
continuous development, if we think of products and people. But we can also talk
about continuous improvement and continuous discovery of opportunities and
outcomes.</p>
<p>You cannot buy DevOps&mdash;neither with tools, nor with certifications. But it&rsquo;s
not for free either. You will fail for example.</p>
<p>Some anecdotes of where it went wrong:</p>
<ul>
<li>&ldquo;We are now the DevOps department.&rdquo; Which means there is still a silo with a
lack of communication.</li>
<li>&ldquo;We have a DevOps team that is setting up the pipelines.&rdquo; Again no
communication, no transparency.</li>
<li>&ldquo;We have a linter but it gave to many warnings so we turned it off.&rdquo; No
understanding about why it is done, no trust that it helps to grow, sign of
fear to make mistakes.</li>
<li>&ldquo;We are measured by the number of developed features, but not allowed to talk
to sales.&rdquo; It&rsquo;s a feature factory, measuring the wrong things. Again also not
enough communication and silos.</li>
<li>&ldquo;IT has not reported any problems.&rdquo; Be open to the obvious: if customers
complain, there is a problem. Ask yourself what is actually monitored? Not
business relevant metrics obviously otherwise the problem would have been
detected by the company instead of the customers.</li>
</ul>
<p>DevOps is more than automation, it is CALMS.</p>

  <figure>

<blockquote >
If you don&rsquo;t get the C then don&rsquo;t bother with the A, L, M or S.
</blockquote>

  <figcaption>
    &mdash;John Willis
  </figcaption>
  </figure>]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[Continuous deployment for this website using GitLab CI/CD, SSH, Docker and systemd]]></title>
    <link rel="alternate" href="https://markvanlent.dev/2021/03/02/continuous-deployment-for-this-website-using-gitlab-ci-cd-ssh-docker-and-systemd/" type="text/html" />
    <id>https://markvanlent.dev/2021/03/02/continuous-deployment-for-this-website-using-gitlab-ci-cd-ssh-docker-and-systemd/</id>
    <author>
      <name>map[name:Mark van Lent uri:https://markvanlent.dev/about/]</name>
    </author>
    <category term="blog" />
    <category term="docker" />
    <category term="devops" />
    <category term="gitlab" />
    <category term="ssh" />
    <category term="systemd" />
    
    <updated>2021-11-27T22:20:54Z</updated>
    <published>2021-03-02T00:00:00Z</published>
    <content type="html"><![CDATA[<p>Almost two years ago I <a href="/2019/04/10/new-blog-backend/#the-future">wrote</a> that
ideally I would not have to log in to my VPS to update this website. Well, that
moment has finally arrived.</p>
<p>A couple of weeks ago I decided to pursue <a href="https://en.wikipedia.org/wiki/Continuous_deployment">continuous
deployment</a> for this site.
Not because it is such a hassle to deploy a new version myself and also not
because it is needed that often, but because I wanted to explore the concept.</p>
<p>To summarize the most relevant parts of what I wrote in 2019:</p>
<ul>
<li>The source code for this blog is hosted on <a href="https://gitlab.com/markvl/blog">GitLab</a>.</li>
<li>Whenever I push a commit, GitLab CI/CD builds a new Docker image.</li>
<li>Once the build is done I SSH into the VPS this site is hosted from, pull the
new image and use it to run this site.</li>
</ul>
<p>That last, manual step is now automated. The hardest part was figuring out a
method I was happy with; that is: not putting the keys to the kingdom in GitLab.
Not that I distrust GitLab, but if someone would get access to my GitLab
account, I would not want them to <em>also</em> have unlimited access to the VPS.</p>
<p>The solution I ended up with consists of three parts:</p>
<ul>
<li>Configuration on GitLab to trigger a deployment.</li>
<li>A user on the VPS so the GitLab job can log into the VPS.</li>
<li>A monitoring service on the VPS to redeploy when a trigger is detected.</li>
</ul>
<h2 id="gitlab-configuration">GitLab configuration</h2>
<p><em>This part is basically a summary of what David Négrier wrote in his article
<a href="https://thecodingmachine.io/continuous-delivery-on-a-dedicated-server">Continuous delivery with GitLab, Docker and Traefik on a dedicated server</a>.
If you want a more detailed explanation, I can highly recommend reading his
article.)</em></p>
<p>Before we can get into the job that I added to my pipeline, we need to prepare
some things. Starting with adding a couple of
<a href="https://docs.gitlab.com/ee/ci/variables/index.html">GitLab CI/CD variables</a>:</p>
<ul>
<li><code>SSH_HOST</code> and <code>SSH_PORT</code>: the SSH client needs to know
how to connect to the VPS.</li>
<li><code>SSH_USER</code> and <code>SSH_PRIVATE_KEY</code>: the SSH client needs authentication
information.</li>
<li><code>SSH_KNOWN_HOST</code>: the public SSH key of the server (<code>SSH_HOST</code>) so we can add
it to the <code>known_hosts</code> file and prevent
<a href="https://en.wikipedia.org/wiki/Man-in-the-middle_attack">man-in-the-middle attacks</a>. I got this
value by running <code>ssh-keyscan &lt;hostname&gt;</code> on my laptop and pasting the output
in GitLab.</li>
</ul>
<p>In a moment we&rsquo;ll see how these variables get used.</p>
<p>Since I want to use the digest of the Docker image that is built in this
pipeline, I&rsquo;ve added an artifact to store the digest so we can access it later
on:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yml" data-lang="yml"><span class="line"><span class="cl"><span class="nt">build_image</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">script</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="l">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="l">docker image ls --filter &#34;label=org.label-schema.vcs-url=https://gitlab.com/markvl/blog&#34; --filter &#34;label=org.label-schema.vcs-ref=$CI_COMMIT_SHA&#34; --format &#34;{{.Digest}}&#34; &gt; image-sha.txt</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">artifacts</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">paths</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="l">image-sha.txt</span><span class="w">
</span></span></span></code></pre></div><p>(The filters are probably not really necessary, but just in case there are
multiple images present, I want to be reasonably sure that I&rsquo;ve picked the right
one.)</p>
<p>Now finally the job that triggers the deployment:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yml" data-lang="yml"><span class="line"><span class="cl"><span class="nt">deploy_image</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">stage</span><span class="p">:</span><span class="w"> </span><span class="l">deploy</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">only</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">refs</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="l">main</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">services</span><span class="p">:</span><span class="w"> </span><span class="p">[]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">image</span><span class="p">:</span><span class="w"> </span><span class="l">alpine:latest</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">script</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="l">apk add --no-cache openssh</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="l">mkdir ~/.ssh</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="l">echo &#34;$SSH_KNOWN_HOSTS&#34; &gt;&gt; ~/.ssh/known_hosts</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="l">chmod 644 ~/.ssh/known_hosts</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="l">echo &#34;$SSH_PRIVATE_KEY&#34; &gt; ~/.ssh/private_key</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="l">chmod 600 ~/.ssh/private_key</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="c"># add ssh key stored in SSH_PRIVATE_KEY variable to the agent store</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="l">eval $(ssh-agent -s)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="l">ssh-add ~/.ssh/private_key</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="l">ssh -p $SSH_PORT $SSH_USER@$SSH_HOST $(cat image-sha.txt)</span><span class="w">
</span></span></span></code></pre></div><p>Most of the code is just to get SSH working. All the magic happens in the last
line. Note that in contrast to David&rsquo;s article I don&rsquo;t actually execute commands
on my VPS, instead I only send one string.</p>
<p>(For the full <code>.gitlab-ci.yml</code> file see <a href="https://gitlab.com/markvl/blog/-/blob/32f876382900b6d4f25af988e7efdde3e17e4b52/.gitlab-ci.yml">the GitLab repo for this site</a>.)</p>
<p>Now every time the pipeline is run on the <code>main</code> branch, the digest of the
freshly built Docker image is sent to my VPS.</p>
<h2 id="vps-ssh-configuration">VPS SSH configuration</h2>
<p>On the VPS we need to make sure that the GitLab job can SSH into the machine.</p>
<p>The first step is to create a user to be used by GitLab (the <code>SSH_USER</code> variable
I mentioned above). Next we need to make sure that the <code>SSH_PRIVATE_KEY</code> stored
in GitLab can be used to log in. To make this possible <em>and</em> to mitigate the
risks of the SSH key in GitLab getting abused, I have added the following
content to the file <code>~/.ssh/authorized_keys</code> of the new user:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-plaintext" data-lang="plaintext"><span class="line"><span class="cl">command=&#34;/home/&lt;username&gt;/.ssh/ssh_commands.sh&#34;,no-agent-forwarding,no-port-forwarding,no-pty,no-user-rc,no-X11-forwarding &lt;public key&gt; &lt;comment&gt;
</span></span></code></pre></div><p>Using the <code>command</code> option is an idea I got from
<a href="https://serverfault.com/a/803873/25920">a ServerFault answer</a>
and Mauricio Tavares&rsquo; article
<a href="https://unixwars.blogspot.com/2014/12/getting-sshoriginalcommand.html">Getting the SSH_ORIGINAL_COMMAND</a>.
In my case the <code>ssh_commands.sh</code> file stores the original command (in my case
the digest) in a file called <code>deployment.raw</code>.</p>
<h2 id="vps-monitoring-service">VPS monitoring service</h2>
<p>To actually deploy the new image, we need just one more piece in this puzzle: a
script to pull and use the Docker image.</p>
<p>I&rsquo;ve opted for a systemd unit to monitor for the existence of the
<code>deployment.raw</code> file by adding the file
<code>/etc/systemd/system/blog-deployment.path</code> (note the &ldquo;<code>.path</code>&rdquo; at the end of the
filename):</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="cl"><span class="k">[Unit]</span>
</span></span><span class="line"><span class="cl"><span class="na">Description</span><span class="o">=</span><span class="s">Blog deployment path monitor</span>
</span></span><span class="line"><span class="cl"><span class="na">Wants</span><span class="o">=</span><span class="s">blog.service</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">[Path]</span>
</span></span><span class="line"><span class="cl"><span class="na">PathExists</span><span class="o">=</span><span class="s">/&lt;path&gt;/&lt;to&gt;/deployment.raw</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">[Install]</span>
</span></span><span class="line"><span class="cl"><span class="na">WantedBy</span><span class="o">=</span><span class="s">multi-user.target</span>
</span></span></code></pre></div><p>This systemd unit configuration file is accompanied by the following service
file (<code>/etc/systemd/system/blog-deployment.service</code>):</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="cl"><span class="k">[Unit]</span>
</span></span><span class="line"><span class="cl"><span class="na">Description</span><span class="o">=</span><span class="s">Blog deployment service</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">[Service]</span>
</span></span><span class="line"><span class="cl"><span class="na">Type</span><span class="o">=</span><span class="s">oneshot</span>
</span></span><span class="line"><span class="cl"><span class="na">ExecStart</span><span class="o">=</span><span class="s">/usr/local/bin/deploy_blog.sh</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">[Install]</span>
</span></span><span class="line"><span class="cl"><span class="na">WantedBy</span><span class="o">=</span><span class="s">multi-user.target</span>
</span></span></code></pre></div><p>In the <code>deploy_blog.sh</code> script I do things like reading the <code>deployment.raw</code>
file, checking its content, downloading the new Docker image, checking it and
restarting this website with the new image.</p>
<h2 id="summary">Summary</h2>
<p>To recap my continuous deployment solution:</p>
<ul>
<li>I push a commit to the <code>main</code> branch of the repo of this site.</li>
<li>GitLab CI/CD builds a new Docker image and sends its digest to my server.</li>
<li>My server watches for the existence of the digest file and uses it as a
trigger to deploy the new version of this website.</li>
</ul>
<p>And now that I&rsquo;ve written this down, I&rsquo;m going to commit this article to Git,
push it to GitLab and then sit back and wait (im)patiently for my website to
update itself. ;-)</p>]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[Devopsdays oNLine 2020]]></title>
    <link rel="alternate" href="https://markvanlent.dev/2020/07/09/devopsdays-online-2020/" type="text/html" />
    <id>https://markvanlent.dev/2020/07/09/devopsdays-online-2020/</id>
    <author>
      <name>map[name:Mark van Lent uri:https://markvanlent.dev/about/]</name>
    </author>
    <category term="conference" />
    <category term="devops" />
    
    <updated>2021-11-10T20:38:54Z</updated>
    <published>2020-07-09T00:00:00Z</published>
    <content type="html"><![CDATA[<p>This year, due to the <a href="https://en.wikipedia.org/wiki/COVID-19_pandemic">COVID-19 pandemic</a>,
the Dutch devopsdays are combined into a single day, online event.</p>
<p><img src="/images/devopsdays2020_logo.jpg" alt="devopsdays oNLine logo"></p>
<p>These are my notes from the keynote sessions.</p>
<h2 id="why-change-matters-now-more-than-ever-the-importance-of-devops-in-this-new-world--michael-ducy">Why change matters now more than ever: the importance of DevOps in this new world &mdash; Michael Ducy</h2>
<p>This year has challenged us in many ways. Many people got sick, many people have
died. Our world has been turned upside-down. Some responses to this have been
good, some bad. We&rsquo;re more divided, communities have been shattered.</p>
<p>DevOps and digital transformation came about due to the rise of cloud and
mobile. The relationship between businesses and consumers has changed. Businesses
interact with us differently than before. DevOps is a response to serve digital
transformation.</p>
<p>DevOps is about community and empathy. Digital transformation is really
about capitalism, about finding new ways to exploit capital and resources. This
is an interesting contrast since DevOps is to support digital transformation.</p>
<p>Are the digital possibilities we have nowadays, which have helped us in these
challenging times (Netflix, order food online, etc), contributing for the good?
Or do they divide the people that are fortunate enough to be able to use them
and those that are not? What does digital transformation do to serve the
underprivileged, the poor, the oppressed?</p>
<p>Be very aware that there are <em>many</em> people that are being oppressed. Many
systems in our world are built upon the &ldquo;technical debt&rdquo; of not treating people
as equals. We intentionally do not go into the bad neighbourhoods in our cities.
We avoid them (at least in part) to not be confronted with the conditions that
people from those parts of the city have to live in. We don&rsquo;t want to be
reminded about what we are <em>not</em> doing to help these people out, while enjoying
our own lives.</p>
<p>In the DevOps community we talk about diversity and inclusion. We have the
option to close a number of gaps. It is not just about the difference in pay
between genders. It is also about us, privileged people, to use our voices to
lift people up. We must invest in lifting up the lowest and destroy our systems
of oppression that were setup hundreds of years ago.</p>
<blockquote>
<p>Think about how you can use your voice to destroy systems of oppression.</p></blockquote>
<p>There is a lot that is built into society and culture that needs to be changed.
We have an opportunity to lead the change.</p>
<h2 id="from-waterfall-to-agile-microsofts-not-so-easy-evolution-into-the-world-of-devops--abel-wang">From waterfall to agile. Microsoft&rsquo;s not-so-easy evolution into the world of DevOps &mdash; Abel Wang</h2>
<p>The definition of DevOps used by Microsoft:</p>

  <figure>

<blockquote >
DevOps is the union of people, process and products to enable continuous
delivery of value to our end users.
</blockquote>

  <figcaption>
    &mdash;Donovan Brown
  </figcaption>
  </figure>


<p>The first lesson Microsoft had to learn in their evolution from shipping a boxed
software product to a service: if it hurts, don&rsquo;t avoid it but do it more often.
By doing it more often, you get better at it over time until eventually it stops
hurting.</p>
<p>The way you organize your company is very important. You don&rsquo;t want different
teams competing with each other, you want them to collaborate, to be one
company.</p>
<p>To support that idea Microsoft started to use one engineering system instead of
allowing each team to pick their own tools. Now Microsoft is using their own
tools (Azure DevOps services: Azure Boards, Azure Pipelines, etc).</p>
<p>Microsoft&rsquo;s definition of done:</p>
<ul>
<li>Live in production</li>
<li>Collecting telemetry</li>
</ul>
<p>To get there, dramatic changes in the team structure was needed. Previously
there were basically three teams: program management, development and testing.
There was a toxic culture between the development and the QA teams. QA was seen
as a second class citizen.</p>
<p>They transitioned to an engineering team where the development and QA teams
became one team: QA had to learn the codebase and the developers needed to learn
how to test. Another transition they made was to create a &ldquo;feature team,&rdquo; which
combined engineering, ops and program management. This makes it easier for
customers to communicate with Microsoft since one team is responsible for one
feature, instead of different teams, depending on where in the development phase
a feature was at a certain time.</p>
<p>Engineers are allowed to pick a different team to work in each year. This makes
them very happy even though only about 20% of the developers actually changes to
a different team.</p>
<p>Another change is that every team has the same sprint cadence now. All sprints
start on the same day and have the same duration (3 weeks). Sprints start with
an email to the engineers and end with a summary of what has been accomplished
plus a short video. At the end of every 4 sprints, all team leads (of the teams
that are related to each other) have a planning meeting. Every 6 months the 12
month strategy is reviewed.</p>
<p>Leadership wanted to micromanage, but didn&rsquo;t have enough context. Providing them
the right context to make sound decisions took too much time. In the end it was
decided that leadership was no longer allowed to look at the backlog and only
operate at the 6-month and 12-month planning level.</p>
<p>The teams also introduced a new rule: if you find a bug, you have to fix it in
that sprint. Teams are only allowed to carry four bugs to the new sprint. This
is called the &ldquo;bug cap.&rdquo; If you are over this number, you are not allowed to
build new features the next sprint but first have to reduce the number of open
bugs. This is not meant as a punishment, but as a way to keep quality at a more
stable level.</p>
<p>What to measure? Be very careful because people tend to want to game the system.
Is code coverage important for your bonus? Be prepared to have 100% code
coverage without actually improving the quality of the code. It is better to
measure outcome, e.g. how much time it takes to fix a bug once found. Metrics
are <strong>not</strong> used to punish or reward engineers&mdash;they are used to <em>help</em> the
engineers and answer questions like &ldquo;do you need training on the new framework?&rdquo;</p>
<p>How do you get your developers to do the right thing? Make them feel the pain of
their own code if they don&rsquo;t. Make them part of the call to solve a live site
incident for example.</p>
<h2 id="securing-your-devops-transformation--april-edwards">Securing your DevOps transformation &mdash; April Edwards</h2>
<p>(For the record: April also works for Microsoft and uses the same definition for
DevOps as Abel.)</p>
<p>Why is DevOps important? Because your competition is also using it. It results
in high performance companies that deploy more often, have a lower change
failure rate and respond to the market faster (shorter time between commit and
deployment).</p>

  <figure>

<blockquote >
95% of cloud breaches occur due to human errors such as configuration mistakes
</blockquote>

  <figcaption>
    &mdash;Gartner Inc.
  </figcaption>
  </figure>


<p>We are constantly sacrificing security for speed. Things that hinder security:</p>
<ul>
<li>Manual processes that can be automated</li>
<li>Friction between teams</li>
<li>A &ldquo;we&rsquo;ll do it later&rdquo; approach</li>
</ul>
<p>Ops and dev teams don&rsquo;t always have the same goals. We need to align that and
own a feature together.</p>
<p><em>(Due to a technical issue the talk could not be completed as planned and April
had only a few moments to wrap up.)</em></p>
<p>The main takeaway: &ldquo;shift left&rdquo; because most defects are introduced during the code
writing phase. Address these things earlier and tackle security during
development. Add security checks to your pipelines.</p>
<h2 id="bizdevops-bridging-the-dominant-divide-between-business-and-it--henk-van-der-schuur">BizDevOps: bridging the dominant divide between business and IT &mdash; Henk van der Schuur</h2>
<p>There shouldn&rsquo;t be a divide between the business and IT. Both sides should cross
over to the other side. IT needs someone from business, and business should take
more ownership over tech decisions and understand what they mean. Both sides
should be part of one team. It should be a partnership.</p>
<p>You can start a project with a &ldquo;pressure cooker session&rdquo; to get a good start. Such
a session consists of a number of phases:</p>
<ul>
<li>Prepare: what is this (really) about?</li>
<li>Ideate: do we completely understand the problem?</li>
<li>Sketch: is this what you mean and what should be changed?</li>
<li>Prototype: what are the most important requirements and who is the audience?</li>
<li>Decide: was value delivered? what&rsquo;s next?</li>
</ul>
<p>If this session ends with a &ldquo;go&rdquo; all disciplines should be involved: business,
operations and development.</p>
<p><img src="/images/devopsdays2020_henk_van_der_schuur.jpg" alt="The interaction between business, operations, development and product owner."></p>
<p>DevOps needs a sequel. Let&rsquo;s repeat the trick and involve the business. Get
the business, operations, developers and product owners into the same room. The
business wants to get involved and wants us to realise that it&rsquo;s about
delivering value. Technology is everyone&rsquo;s business.</p>]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[All Day DevOps]]></title>
    <link rel="alternate" href="https://markvanlent.dev/2019/11/06/all-day-devops/" type="text/html" />
    <id>https://markvanlent.dev/2019/11/06/all-day-devops/</id>
    <author>
      <name>map[name:Mark van Lent uri:https://markvanlent.dev/about/]</name>
    </author>
    <category term="aws" />
    <category term="conference" />
    <category term="culture" />
    <category term="devops" />
    <category term="observability" />
    <category term="infrastructure as code" />
    <category term="serverless" />
    <category term="sre" />
    <category term="tools" />
    
    <updated>2021-11-09T20:09:33Z</updated>
    <published>2019-11-06T00:00:00Z</published>
    <content type="html"><![CDATA[<p><a href="https://www.alldaydevops.com/">All Day DevOps</a> is an online conference
which lasts for 24 hours. With 150 sessions across 5 tracks, there&rsquo;s enough
content to consume.</p>
<p><em>(As with all my recent conference notes: these are just notes, not complete summaries.)</em></p>
<p><img src="/images/all_day_devops_2019.png" alt="All Day DevOps"></p>
<h2 id="adidoescode-where-devops-meets-the-sporting-goods-industry--fernando-cornago">#adidoescode Where DevOps Meets The Sporting Goods Industry &mdash; Fernando Cornago</h2>
<p>Adidas is a big company, about 57,000 employees. Software is also having an
impact on sports, e.g. registering your performance, etc. Is Adidas already a
software company? Definitely not, but they should start behaving as if they are.
They are building their applications 10,000 times a day.</p>
<p>New book from Gene Kim (the author of
<a href="https://itrevolution.com/the-phoenix-project/">The Phoenix Project</a>):
<a href="https://itrevolution.com/the-unicorn-project/">The Unicorn Project</a>.
Expected end of November 2019.</p>
<p>The platform you need to be able to accelerate you business should be:</p>
<ul>
<li>on demand and self-service</li>
<li>scalable and elastic</li>
<li>pay per use (which helps with the efficiency mindset)</li>
<li>transparent and measurable (observability)</li>
<li>open source + inner source</li>
</ul>
<p>To improve sales, Adidas&rsquo; platform can run where their customers are (globally)
and can scale up if needed and scale down again when possible.</p>
<p>Making use of data is key. The data is seen as a baton in a relay race and makes
its way through the teams. A large part of time consumed in creating a service
is spent at looking at the data. The data is used to train their AI systems.</p>
<p>You want to deliver value and make sure that things are working. Testing needs
to be done on the physical level (the actual products that are being sold), but
also on the virtual level: the IT platform also needs to perform and work as
expected.</p>
<p>At Adidas they went from having a couple of &ldquo;DevOps engineers&rdquo; to having a
DevOps culture in all teams. This removed bottlenecks and allowed them to grow
faster. They have substantially reduced costs and manual testing and improved
drastically on time to production.</p>
<h2 id="multi-cloud-day-to-day-devops-power-tools--ronen-freeman">Multi Cloud &ldquo;day-to-day&rdquo; DevOps Power Tools &mdash; Ronen Freeman</h2>
<p>About the relation between SRE (Site Reliability Engineering) and DevOps: SRE is
an implementation of DevOps.</p>
<p>What defines a great SRE:</p>
<ul>
<li>Quality
<ul>
<li>Be curious
<ul>
<li>Stay up to date</li>
<li>Attend events</li>
<li>Tinker with things</li>
</ul>
</li>
<li>Adapt to change (change <strong>will</strong> happen, anticipate it, monitor it, adapt
to it and prepare for the next change)</li>
<li>Learn: <q>learn fast so you may promptly begin again</q></li>
<li>Laziness: automate, but ask your self the question whether it will
actually save time or not</li>
</ul>
</li>
<li>Support
<ul>
<li>Googling is an art form</li>
<li>You&rsquo;ll learn the answer itself but also discover things in the process</li>
<li>There are numerous forums (for example: Stack Overflow, GitHub, Slack,
etc)</li>
<li>Have a daily standup</li>
<li>Pass on knowledge</li>
</ul>
</li>
<li>Tools: there are a number of areas to consider:
<ul>
<li>Code: Bash</li>
<li>Package: Docker</li>
<li>Deliver: <a href="https://concourse-ci.org/">Concourse</a></li>
<li>Platform:
<ul>
<li>Infrastructure: Kubernetes (<q>relatively simple to get started with</q>);
<a href="https://cloud.google.com/kubernetes-engine/">GKE</a> is the most mature</li>
<li>Configuration tool: Terraform</li>
</ul>
</li>
<li>Monitor: <a href="https://cloud.google.com/products/operations">Google Stackdriver</a>
(since already running on GKE)</li>
</ul>
</li>
</ul>
<p>There&rsquo;s a demo app:
<a href="https://github.com/Darillium/addo-demo">https://github.com/Darillium/addo-demo</a></p>
<h2 id="how-to-run-smarter-in-production-getting-started-with-site-reliability-engineering--jennifer-petoff">How to Run Smarter in Production: Getting Started with Site Reliability Engineering &mdash; Jennifer Petoff</h2>
<p>Software engineering is focussed on design and building, not about running the
application.</p>
<p>There are multiple places in organizations where friction occurs:</p>
<ul>
<li>Business vs development: this is addressed by Agile</li>
<li>Development vs operations: this friction is addressed by DevOps</li>
</ul>
<p>SRE is a bridge between business, development and operations.</p>
<p>Four key SRE principles:</p>
<ul>
<li>SRE needs Service Level Objectives (SLOs) with consequences</li>
<li>SREs must have time to improve the world</li>
<li>SRE teams must be able to regulate their workload</li>
<li>Failure is an opportunity to learn (having a blamelessness culture)</li>
</ul>
<p>Some terminology:</p>
<ul>
<li>Service Level Objectives (SLOs): a goal you want to achieve. For example, you
may want to set an uptime SLO, but typically you want to dig deeper, like
99.99% of HTTP requests should succeed with a &ldquo;200 OK&rdquo; response. You want to
measure something your user cares about.</li>
<li>Service Level Agreements (SLAs): these are &ldquo;handshakes between companies,&rdquo;
contractual guarantees.</li>
<li>Error budget: the gap between your SLO and perfect reliability. E.g. the
allowed downtime to still match your 99.9% uptime SLO.</li>
<li>Error budget policy: this policy describes what do you do when you have spent
your error budget. Example: no new features until within your error budget
again.</li>
</ul>
<blockquote>
<p>100% is the wrong reliability target for basically everything</p></blockquote>
<p>SRE is about balancing between reliability, engineering time, delivery speed and
costs.</p>
<p>Note that you <em>can</em> have an error budget policy without even hiring a single
SRE. You can already start with an SLO.</p>
<p>SLOs and error budgets are a necessary first step if you want to improve the
situation.</p>
<p>When talking about making tomorrow better than today, you need to address toil
(work that needs to be done that is manual, repetitive, automatable and does not
add value). If your SRE team is spending time on toil, they are basically an ops
team. Instead you want them to make the system more robust and improve the
operability of the system.</p>
<p>One way to be able to have your SRE team regulate their workload is to put them
in a different part of the organization.</p>
<p>If you want to start with SRE, start with SLOs and let your SRE team own the
most critical system first. Don&rsquo;t make them responsible for all systems at once.
Let them deal with the most important systems first. Once they have that under
control (removed toil, etc), they can take on more work.</p>
<p>You need leadership buy-in for every aspect of SRE work.</p>
<p>Aim for reliability and consistency upfront. SRE teams should consult on
architectural discussions to get to resilient systems.</p>
<p>SRE teams can benefit from automation:</p>
<ul>
<li>Eliminate toil</li>
<li>Capacity planning (auto scaling instead of forecasting)</li>
<li>To fix issues: if you can write a playbook, you can automate it.</li>
</ul>
<p>SRE teams embrace failure:</p>
<ul>
<li>Setting SLO less than 100%</li>
<li>Blamelessness at all levels</li>
<li>Learning from failure</li>
<li>Make the postmortems available so other teams can also learn from them</li>
</ul>
<p>Resources (books):</p>
<ul>
<li><a href="https://sre.google/sre-book/table-of-contents/">Site Reliability Engineering</a></li>
<li><a href="https://sre.google/workbook/table-of-contents/">The Site Reliability Workbook</a></li>
<li><a href="https://www.oreilly.com/library/view/seeking-sre/9781491978856/">Seeking SRE</a></li>
</ul>
<h2 id="everybody-gets-a-staging-environment--yaron-idan">Everybody Gets A Staging Environment! &mdash; Yaron Idan</h2>
<p>How can you give an isolated and secure staging environment for each developer?
If you follow the path Yaron&rsquo;s company (Soluto) took, prepare for a rocky
transition, getting used to new terminology when moving to Kubernetes and users
that can get upset at first.</p>
<p>Their first step: automate the entire process. The realized they needed
onboarding documentation and templates. They used Helm for the templates.</p>
<p>Deploy feature branches to the production environment. When developers open a
pull request, the CI/CD tool adds a link to where this PR is deployed. This
empowers the developers and provides a sense of security. This accelerated the
way the developers built features and the adoption of the platform by the
company.</p>
<p>The benefits:</p>
<ul>
<li>Code is fit for production during entire life cycle of the feature branch.</li>
<li>Makes performance testing easier (after all: it&rsquo;s deployed on the production environment).</li>
<li>Easier to share things with the stakeholders and thus improves collaboration.</li>
<li>You can deploy an entire set of microservices.</li>
</ul>
<p>The tools used for their implementation:</p>
<ul>
<li>GitHub (VCS)</li>
<li>Docker (code as configuration)</li>
<li>CodeFresh (CI)</li>
<li>Helm (CD)</li>
<li>Kubernetes (production platform)</li>
</ul>
<p>You can swap the aforementioned tools with other tools, but you must be able to
automate them and make sure that they work with the push of a single button.</p>
<p>GitHub repository for the demo:
<a href="https://github.com/yaron-idan/staging-environments-example">https://github.com/yaron-idan/staging-environments-example</a></p>
<h2 id="devops-to-the-next-level-with-serverless-chatops--jan-de-vries">DevOps to the Next Level with Serverless ChatOps &mdash; Jan de Vries</h2>
<p>What DevOps is about:</p>
<ul>
<li>Shipping code</li>
<li>Adding value</li>
<li>Do this continuously</li>
</ul>
<p>It&rsquo;s <strong>not</strong> about:</p>
<ul>
<li>Adding additional load to your team</li>
<li>Lowering quality</li>
<li>More support calls</li>
<li>Slow response times when there are errors</li>
</ul>
<p>One option is to send emails if something is wrong. But these emails are slow
(knowing that there was an issue <em>yesterday</em> is too late) and it generates a lot
of noise. Monitoring dashboards are nice, but you have to actively look at them
to detect if there&rsquo;s something wrong.</p>
<p>A better solution is using chat applications (like Microsoft Teams or Slack).</p>
<p>Your first step is to emit events when something fails (in Azure you can sent
such events to <a href="https://azure.microsoft.com/en-us/services/event-grid/">Event Grid</a>).
Next, you have to subscribe to these events/topics. You can use an Azure
Function for this and then have it post an actionable message to Teams.</p>
<p>These messages can have buttons to take immediate action (e.g. post to a web API
like an Azure Function) to recover from the problem. The output from the recover
action can also send events, so you can get another notification in your chat
application about the result.</p>
<p>GitHub repository with demo code: <a href="https://github.com/Jandev/ServerlessDevOps">https://github.com/Jandev/ServerlessDevOps</a></p>
<h2 id="building-modular-infrastructure-in-code--fergal-dearle">Building Modular Infrastructure in Code &mdash; Fergal Dearle</h2>
<p>Why use infrastructure as code (IaC)?</p>
<ul>
<li>Fergal is a fan of immutable infrastructure and the pets vs cattle analogy (which
goes further than just servers)</li>
<li>Everything as code</li>
<li>You don&rsquo;t have to log into a server via SSH to fix things</li>
<li>Better testability and maintainability</li>
</ul>
<p>The <a href="https://itrevolution.com/book/the-devops-handbook/">DevOps handbook</a> (part
III, chapter 9) talks about on demand spinning up a production-like environment.
The only way to do this is to use IaC.</p>
<p>First thing is understanding your infrastructure (VPC, networking, DNS,
services, database). Look at patterns like requestor/broker. Use these patterns
to create modular stacks.</p>
<p>Patterns:</p>
<ul>
<li>Singleton stack pattern: one stack per instance. But this is actually an
anti-pattern because you are mixing configuration and infrastructure.</li>
<li>Multiheaded stack pattern: multiple stacks in single project, config built in.</li>
<li>Template stack pattern: single stack, config separate, multiple instances from
that template</li>
</ul>
<p>Stack templates &amp; stack instances:</p>
<ul>
<li>Reusable template that implements building blocks</li>
<li>Appropriate tags</li>
<li>Must be able to uniquely identify instances (namespacing)</li>
</ul>
<blockquote>
<p>Stay clear of monoliths</p></blockquote>
<p>Monoliths are stacks with everything in them. This is one end of the spectrum.
The other end is to have one service per stack. You&rsquo;ll probably will want to end
up somewhere in between, where you have a component stack (e.g. with a
requestor/broker combination). And then you can combine those stacks.</p>
<p>Static stack inputs: environment names, app names. You should externalize this
config from your stack, e.g. by feeding these parameters via the command line or
use the <a href="https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-parameter-store.html">SSM Param Store</a>
and <a href="https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html">Secrets Manager</a>.</p>
<p>(Note to self: Fergal mentioned <a href="https://github.com/Sceptre/sceptre">Sceptre</a>.
Check what this is about.)</p>
<p>Common problems:</p>
<ul>
<li>Drift can happen, especially if you also make manual changes (don&rsquo;t!)</li>
<li>Updates are not forward compatible</li>
<li>Region incompatibilities</li>
</ul>
<p>Best practices:</p>
<ul>
<li>Only use template stacks</li>
<li>Use layered and component</li>
<li>Externalize config</li>
<li>Test, test, test!</li>
<li>Only use automation, no manual steps</li>
</ul>
<h2 id="crossing-the-river-by-feeling-the-stones--simon-wardley">Crossing The River By Feeling The Stones &mdash; Simon Wardley</h2>
<p>In <a href="https://en.wikipedia.org/wiki/The_Art_of_War">The Art of War</a> Sun Tzu
discusses five fundamental factors which you should consider:</p>
<ul>
<li>Purpose</li>
<li>Landscape</li>
<li>Climate</li>
<li>Doctrine</li>
<li>Leadership</li>
</ul>
<p>John Boyd developed the <a href="https://en.wikipedia.org/wiki/OODA_loop">OODA loop</a>,
which cycles through the following:</p>
<ul>
<li>Observe</li>
<li>Orient</li>
<li>Decide</li>
<li>Act</li>
</ul>
<p>You can combine these cycles:</p>
<p><img src="/images/all_day_devops_2019_simon_wardley1.png" alt="Simon Wardley combining Sun Tzu&rsquo;s five factors and John Boyd&rsquo;s OODA loop"></p>
<p>By moving we learn, as long as we can observe the environment.</p>
<p>Maps are ways to:</p>
<ul>
<li>explore</li>
<li>communicate</li>
<li>share understanding</li>
<li>de-personalize</li>
</ul>
<p>Graphs are not maps. Identical graphs can spatially be different. On a map,
space has meaning, in a graph not. And because space has meaning, we can
explore, e.g. by asking ourselves the question &ldquo;what if we go in <em>that</em>
direction?&rdquo;</p>
<p>What makes a map?</p>
<ul>
<li>Anchor (compass)</li>
<li>Position (places, relative to the compass)</li>
<li>Consistency of movement (going to right below means go to south-east, assuming
on this map &ldquo;up&rdquo; is north)</li>
</ul>
<p>In business there are a lot of maps, e.g. a systems diagram. But almost
everything in business we call a map actually is a graph. Which means we don&rsquo;t
have a sense of the landscape.</p>
<p>How do you create a map for a business? Start with the anchors, e.g. business
and customers. Work from there to get all related components and express them in
a chain of needs.</p>
<p>But how do you position those components? We can use &ldquo;visibility&rdquo; as an analogy
of distance. You can use this to create a &ldquo;partial ordered chain of needs.&rdquo;</p>
<p>And what about movement? You can describe movement as a chain of changes. But
how do you do <em>that</em>? You can order components based on their evolutionary
stage. Are they in the genesis stage, do we use custom built components, do we
have standard products or can we consider it a commodity?</p>
<p>By combining these we can create a map of components by using the position
(visibility) on the y-axis and the movement (evolution) on the x-axis.</p>
<figure><img src="/images/all_day_devops_2019_simon_wardley2.png"
    alt="Simon Wardley showing a map"><figcaption>
      <p>Simon Wardley shows a map of a fictional company selling cups of tea</p>
    </figcaption>
</figure>

<p>You can add intent (evolutionary flow) to your map, e.g. move a component from
custom built to using a commodity.</p>
<p>By using such a map, you can de-personalize discussions since now you talk about
the map, not a story.</p>
<p>Simon told a story of a company that wanted to invest in robots to do manual
labour. After putting their business in a map, the problem became clear: they
were using custom built racks and as a result needed to customize their servers
because they would not fit their non-standard rack. Because the market had changed
since the company started, they could switch to commodity (cloud) and as a
result also did not need robots.</p>
<p>But keep this in mind:</p>
<blockquote>
<p>All maps are imperfect representations of a space</p></blockquote>
<h2 id="deploying-microservices-to-aws-fargate--ariane-gadd">Deploying Microservices to AWS Fargate &mdash; Ariane Gadd</h2>
<p>At KPMG they moved a project from using Amazon EC2 to AWS Fargate. They also
replaced CloudFormation with Terraform for standardization.</p>
<p>Why managed container services and not EKS?</p>
<ul>
<li>Immutable deployments</li>
<li>Cluster provisioning handled by Fargate</li>
<li>Only pay for what you execute</li>
<li>No OS patching</li>
<li>Smaller attack service</li>
</ul>
<p>They used <a href="https://github.com/fabfuel/ecs-deploy">ECS deploy</a> to deploy
containers to Fargate.</p>
<p>Why was the customer okay with this change?</p>
<ul>
<li>They were already familiar with Docker</li>
<li>Fargate has integrations with ECS and ECR.</li>
<li>Overhead of managing EC2 vs cost of AWS Fargate</li>
</ul>
<p>Fargate runs in your own VPC so you own the network, etc.</p>
<p>The microservices for this project are placed behind an application load
balancer (ALB) using AWS Web Application Firewall (WAF) and TLS termination.
Logs are sent to AWS CloudWatch Logs. They used Anchor for end-to-end container
security and compliance.</p>
<p>Benefits of AWS for KPMG:</p>
<ul>
<li>Cost reduction (automation tools were already in place, they already used
Lambda for account creation and had reserved instances)</li>
<li>Speed of delivery</li>
<li>Allows collaboration with developers (automation, DevOps).</li>
<li>70% of the engineers were already AWS certified.</li>
</ul>
<h2 id="automate-everyday-tasks-with-functions--sean-odell">Automate Everyday Tasks with Functions &mdash; Sean O&rsquo;Dell</h2>
<p>Common use cases for serverless applications are things like web applications
(such as static websites), data processing, Amazon Alexa (skills) and chatbots.</p>
<blockquote>
<p>Not every workload should run in a serverless fashion</p></blockquote>
<p>Organizational and application context are relevant. Serverless is just another
option. You pick what is right for your use case as serverless is not a silver
bullet.</p>
<p>AWS Lambda in a nutshell: there is an event source (e.g. a state change in data
or a resource), which triggers a function, which in turn interacts with a service.</p>
<p>GitHub repository with examples: <a href="https://github.com/vmwarecloudadvocacy/cloudhealth-lambda-functions">https://github.com/vmwarecloudadvocacy/cloudhealth-lambda-functions</a></p>
<p>(Note to self: check <a href="https://docs.amplify.aws/">Amplify</a>)</p>
<h2 id="beyond-dev--ops--patrick-debois">Beyond dev &amp; ops &mdash; Patrick Debois</h2>
<p>The DevOps pipeline usually starts at the backlog. But what happens before? And
how does that influence people outside the engineering department?</p>
<p>Patrick joined a startup five years ago. He fixed the DevOps pipeline. After a
while the organization wanted to sell a product.</p>
<p>The book <a href="https://www.amazon.com/Machine-Radical-Approach-Design-Function/dp/1626342245">The Machine</a>
by Justin Roff-March describes &ldquo;the agile version of sales.&rdquo;</p>
<p>Sales had more in common with IT than Patrick expected. The sales pipeline
looked like a Kanban board. Sales people are also on call, especially if you do
international sales. If we can deliver faster, this increases the chance of a
sale. Sales persons are actually mind readers; this reminded Patrick how IT
people feel about tickets in the backlog: &ldquo;what is meant with this ticket, what
do the customers actually want?&rdquo;</p>
<p>Thinking about the life of the project after it has been delivered, in terms of
maintenance and operational costs, is important. Doing this upfront, is
beneficial for the project.</p>
<p>The biggest impact sales had on IT was the never ending hammer of &ldquo;can you make
things simpler?&rdquo; Sales can help you with making your design less complex.
Together you can determine what actually adds value for the customer.</p>
<p>Having publicly available documentation really helps and makes it easier to sell
the product. Getting the customers on Slack also helped. This showed that the
company was willing to listen and help. This also helps sales.</p>
<p>After sales, the marketing department got focus. Patrick has seen the most
interesting automation within the marketing department (e.g. email sending).
Marketing has a lot of fancy metrics and tools to measure stuff; did they invent
metrics? The IT department can help marketing by providing information.
Conferences are also a marketing instrument. Marketing also has CALMS; they do
very similar things.</p>
<p>We&rsquo;ve got sales and marketing covered, but for a lot of things we still need
budgets. To budget things, you also need to estimate things. Finance knows what
it is to be on the receiving end of tickets where you have to guess what is
going on.</p>
<p>Sometimes you need to buy things. Patrick looked into agile procurement. Lots of
stuff has to be figured out: is it a fit? is there a way to be partners?</p>
<p>Serverless is a manifestation of servicefull, where you use a lot of services,
so you can focus on your domain instead of having to build everything yourself.</p>
<p>With cloud services there doesn&rsquo;t have to be communication initially: you read
the website, check the documentation, pull your credit card and start with a
proof of concept. Communication is still important. You have to treat your
suppliers with respect.</p>
<p>HR: how does your company support hiring people? Have people join the team for
some actual work to see if they are a good fit or not.</p>
<p>Do you know what <a href="https://en.wikipedia.org/wiki/Pair_programming">pair programming</a>
is? Well, <a href="https://en.wikipedia.org/wiki/Mob_programming">mob programming</a> is a whole
other level of collaborating. (Note that it&rsquo;s more than just have one person
typing and a bunch of other people commenting&mdash;it has a lot of patterns going
on.) Teams can get very enthusiastic about this and can be very productive the
first week. But if they are not careful, the team is exhausted the next week.
You need to develop a sustainable pace.</p>
<p>Patrick learned a lot from talking to the different departments. They are solving
similar problems as IT is.</p>
<p>Reading tip:
<a href="https://www.amazon.com/Company-wide-Agility-Beyond-Budgeting-Sociocracy/dp/154467287X">Company-wide Agility with Beyond Budgeting, Open Space &amp; Sociocracy</a>
by Jutta Eckstein and John Buck.</p>
<h2 id="the-open-source-observability-toolkit--mickey-boxell">The Open Source Observability Toolkit &mdash; Mickey Boxell</h2>
<p>Some characteristics of a modern application:</p>
<ul>
<li>Microservices</li>
<li>Distributed</li>
<li>Multiple programming languages</li>
<li>Scalable</li>
<li>Ephemeral</li>
</ul>
<p>This brings new challenges compared to the old situation with a monolithic
application running on a single machine. You still want to address threats to
customer satisfaction. But the old debugging solutions we used for a monolith
will not work.</p>
<p>Observability means designing and operating a more visible system. This includes
systems that can explain themselves without you having to deploy new code. The
tools help you understand the difference between a healthy and unhealthy system.</p>
<blockquote>
<p>Modern, distributed systems <strong>will</strong> experience failures.</p></blockquote>
<p>Observability takes a holistic approach that recognizes the impact an issue has
on the business as a whole. Outages affect the whole business, which should be
able to react. Observability gives tools to address issues when they arrive.</p>
<p>External outputs:</p>
<ul>
<li>Logs</li>
<li>Metrics</li>
<li>Traces</li>
</ul>
<p>These outputs don&rsquo;t make your system more observable, but they can be part of
the holistic approach. They can help you with e.g. a root cause analysis.</p>

  <figure>

<blockquote >
Monitoring tells you whether a system is working, observability lets you ask
why it isn&rsquo;t working.
</blockquote>

  <figcaption>
    &mdash;Baron Schwartz
  </figcaption>
  </figure>


<p>Key SRE concepts:</p>
<ul>
<li>Service Level Indictors (SLIs; what are you measuring), Service Level
Objectives (SLOs; what should the values be), Service Level Agreements (SLAs;
defines the planned reaction if the objectives are not met).</li>
<li>Subset of Google&rsquo;s golden SRE signals: RED (Rate, Errors, Duration).</li>
<li>Mean Time To Failure (MTTF) and Mean Time To Repair (MTTR)</li>
</ul>
<p>Users are comfortable with a certain level of imperfection. A site that is a bit
slow sometimes, is less annoying than a shopping cart that sometimes looses its
content. Use this to determine what you&rsquo;ll spend your time on first.</p>
<h3 id="logging">Logging</h3>
<p>Logging is the most fundamental pillar of monitoring. Logging records events
that took place at a given time. Supported by most libraries because it is such
a fundamental thing.</p>
<p>You only get logs if you actually put them in your code. You also have to make
sure you don&rsquo;t lose them, for instance by aggregating them.</p>
<p>Tools:</p>
<ul>
<li><a href="https://www.fluentd.org/">Fluentd</a>: scrape, process and ship logs</li>
<li><a href="https://www.elastic.co/elasticsearch/">Elasticsearch</a>: a data store</li>
<li><a href="https://www.elastic.co/kibana/">Kibana</a>: interact with your logs</li>
</ul>
<h3 id="metrics">Metrics</h3>
<p>Metrics are a numeric aggregation of data describing the behavior of a component
measured over time. They are useful to understand typical system behavior.</p>
<p>Tools:</p>
<ul>
<li><a href="https://prometheus.io/">Prometheus</a>: scrape data, store the data and query
the data</li>
<li><a href="https://grafana.com/">Grafana</a>: visualize the metrics, can aggregate data
from different sources</li>
</ul>
<p>Alerts are notifications to indicate that a human needs to take action.</p>
<h3 id="tracing">Tracing</h3>
<p>Tracing is capturing a set of causally related events. Helpful for debugging.
Example: each request has a global ID and if you insert this ID as metadata at
each step, you can trace what happens.</p>
<p>Tools: <a href="https://www.jaegertracing.io/">Jaeger</a> and <a href="https://zipkin.io/">Zipkin</a>
can visualize and inspect traces.</p>
<p>The challenge is that tracing is hard to retrofit into an existing application.
You need to instrument all component. Service meshes can make it easier to
retrofit tracing.</p>
<h3 id="service-meshes">Service meshes</h3>
<p>A service mesh is a configurable infrastructure layer for microservice
applications. It can monitor and control the traffic in your environment. It can
be a great approach for observability. You will get less information from a
service mesh compared to adding tracing to all of your components, but on the
other hand it takes less effort to implement than instrumenting your existing
code base.</p>
<p>Tools:</p>
<ul>
<li><a href="https://istio.io/">Istio</a></li>
<li><a href="https://linkerd.io/">Linkerd</a></li>
<li><a href="https://traefik.io/traefik-mesh/">Maesh</a><sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></li>
<li><a href="https://kuma.io/">Kuma</a></li>
<li><a href="https://www.consul.io/">Consul</a></li>
</ul>
<p>Using service meshes won&rsquo;t eliminate the need to instrument your code, but it
can make your life more simple.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Called Traefik Mesh these days.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[Devopsdays Ghent 2019: random notes]]></title>
    <link rel="alternate" href="https://markvanlent.dev/2019/11/02/devopsdays-ghent-2019-random-notes/" type="text/html" />
    <id>https://markvanlent.dev/2019/11/02/devopsdays-ghent-2019-random-notes/</id>
    <author>
      <name>map[name:Mark van Lent uri:https://markvanlent.dev/about/]</name>
    </author>
    <category term="conference" />
    <category term="devops" />
    <category term="opinion" />
    <category term="tools" />
    
    <updated>2021-11-09T20:09:33Z</updated>
    <published>2019-11-02T00:00:00Z</published>
    <content type="html"><![CDATA[<p>Where my <a href="/2019/10/29/devopsdays-ghent-2019-day-one/">previous</a>
<a href="/2019/10/30/devopsdays-ghent-2019-day-two/">articles</a> were focussed on
the notes I took of the talks, this article is a mix of random notes and
observations I made throughout the conference.</p>
<p><img src="/images/devopsdaysghent2019_random_notes.jpg" alt="Belgian chocolates: &ldquo;dev&rdquo;, &ldquo;heart&rdquo;, &ldquo;ops&rdquo;, &ldquo;days&rdquo;"></p>
<h2 id="the-conference">The conference</h2>
<p>First of all: this devopsdays conference&mdash;once again&mdash;inspired me. It
refreshed my desire to make the world (or at least a part of it) a better place.</p>
<p>A big change in the format was that each speaker only had a 15 minute time slot.
This was done to allow more people to speak. While the organization certainly
succeeded in that regard, I felt that most talks were too short, that is: I
would have loved them to last longer.</p>
<p>It felt like the conference was less technical. Perhaps this was because of the
length of the talks, which meant the speakers could not go as deep into a
subject as with a 30-minute talk? Perhaps it was because there were no workshops
where I usually get (more than) my dose of technical deep dives? Perhaps it was
just me?</p>
<p>Note that this is <strong>not</strong> a complaint about the speakers. Each and every one of
them did a great job and had an inspiring talk. I really appreciate them taking
the stage and sharing their knowledge, experience and opinions with us. Thank
you all!</p>
<h2 id="sponsors">Sponsors</h2>
<p>I talked to a number of sponsors and would like to do more research on certain
products. Examples:</p>
<ul>
<li><a href="https://cloud.google.com">Google Cloud</a>: so far my focus has been on AWS and
Azure, but Google&rsquo;s cloud offering might also be interesting.</li>
<li><a href="https://logz.io/">Logz.io</a>: instead of managing our own Elastic stack and
metrics collection + visualization, we might be able to use logz.io for things
we want to run in the cloud. Apparently you can also specify that you want
your data to stay within the EU, which is a plus.</li>
<li><a href="https://www.pulumi.com/">Pulumi</a>: So far I have used Ansible, Terraform and
CloudFormation for my infrastructure-as-code needs. Pulumi allows you to code
in e.g. Python instead of YAML or a domain specific language.</li>
<li><a href="https://rancher.com/">Rancher</a>: although the projects I&rsquo;m involved in do not
use Kubernetes at the moment, they might in the future. And then it might be
useful to see what Rancher can do for us.</li>
<li><a href="https://www.sonatype.com/">Sonaytpe</a>: detect components with a known
vulnerability in your stack.</li>
</ul>
<h2 id="other-conferences">Other conferences</h2>
<p>During the ignites on the second day, a few other conferences were
mentioned:</p>
<ul>
<li><a href="https://www.deliveryconf.com/">Delivery Conf</a>: 21 &amp; 22 January 2020, Seattle, WA, USA</li>
<li><a href="https://archive.fosdem.org/2020/">FOSDEM</a>: 1 &amp; 2 February 2020, Bruxelles, Belgium</li>
<li><a href="https://cfgmgmtcamp.eu/ghent2020/">Configuration Management Camp</a>: 3&ndash;5 February 2020, Ghent, Belgium</li>
</ul>
<h2 id="home-lab">Home lab</h2>
<p>One of the open spaces I joined was about home labs. I only have a small setup
at home, but I did get a few ideas out of this session that may be worth
checking out.</p>
<ul>
<li><a href="https://nextcloud.com/">Nextcloud</a></li>
<li>Have my Synology NAS backup stuff to AWS Glacier</li>
<li><a href="https://www.plex.tv/">Plex</a>; perhaps I can use this to make my DVD collection
(which is now gathering dust) accessible again?</li>
</ul>
<p>But the most important question, in my opinion, that was asked during this
session: <q>do you have a disaster recovery plan for your home lab?</q></p>
<p>What if I&rsquo;m not around and something breaks down, like the WiFi, how does my
family get things up and running again? Spoiler: they probably won&rsquo;t. And this
is not their fault. I made things more complex than needed (from their
perspective that is) and failed to provide instructions on how to solve issues.</p>
<h2 id="miscellaneous">Miscellaneous</h2>
<p>On meetings:</p>
<ul>
<li><a href="https://agilecoffee.com/leancoffee/">Lean Coffee</a>: an interesting, lightweight meeting format</li>
<li><a href="https://medium.com/@agileGeorge/power-start-your-meetings-fcdeab634bf3">POWER start</a>:
a meeting facilitation technique to increase the effectiveness of meetings</li>
</ul>
<p>On toil:</p>
<ul>
<li>Read up on the subject of toil: Google&rsquo;s SRE book, chapter 5:
<a href="https://sre.google/sre-book/eliminating-toil/">Eliminating toil</a></li>
<li>Learn to better recognize toil and record it (what is it, what does it cost
in time, energy, etc and what is the frequency)</li>
<li>Make time to eliminate toil</li>
<li>Read up on, and think about, <a href="https://www.hashicorp.com/resources/what-is-mutable-vs-immutable-infrastructure">immutable infrastructure</a>
and see if we can use it to remove toil</li>
</ul>
<p>Random, unrelated stuff:</p>
<ul>
<li>Look at <a href="https://pypi.org/project/cfn-lint/">CloudFormation Lint</a></li>
<li>Read the <a href="https://www.weave.works/technologies/gitops/">GitOps</a> article by Weaveworks</li>
</ul>
<p>Something that was said during an open space about testing infrastructure-as-code:</p>
<blockquote>
<p>You deploy infrastructure not to have that infrastructure, but to run an application on top of.</p></blockquote>
<p>I don&rsquo;t recall who mentioned it, when and what the context was, but
<a href="https://www.reddit.com/r/devops">r/devops</a> might be fun to read.</p>]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[Devopsdays Ghent 2019: day two]]></title>
    <link rel="alternate" href="https://markvanlent.dev/2019/10/30/devopsdays-ghent-2019-day-two/" type="text/html" />
    <id>https://markvanlent.dev/2019/10/30/devopsdays-ghent-2019-day-two/</id>
    <author>
      <name>map[name:Mark van Lent uri:https://markvanlent.dev/about/]</name>
    </author>
    <category term="conference" />
    <category term="culture" />
    <category term="devops" />
    <category term="incident response" />
    
    <updated>2021-11-09T20:09:33Z</updated>
    <published>2019-10-30T00:00:00Z</published>
    <content type="html"><![CDATA[<p>Devopsdays Ghent is a two day event. These are the notes I took at this second
day of the ten year anniversary edition of devopsdays.</p>
<p><img src="/images/devopsdaysghent2019_welcome2.jpg" alt="A picture of &ldquo;Vooruit&rdquo;, the devopsdays Ghent location"></p>
<h2 id="devops-beyond-dev-and-ops--patrick-debois">DevOps beyond dev and ops &mdash; Patrick Debois</h2>
<p>This is a story about busting silos and making new friends.</p>
<p>Patrick retired from organizing devopsdays five years ago. His next challenge
was helping a company with their DevOps journey.</p>
<p>After about a year, DevOps was a thing at the company. The development and
operations teams worked together and Patrick started to look for the next
bottlenecks.</p>
<p>For a lot of things they used services instead of building things themselves
(monitoring, etc). Patrick calls this &ldquo;servicefull&rdquo;: use a lot of services.
Make the suppliers of the services your friends.</p>
<p>There was internal DevOps and collaboration, but they did not have the same
communication with their suppliers. Communication works differently there. As a
customer they used documentation, which is a way of communication from the
supplier to the customer. To communicate back, they went to conferences, used
Slack and reported findings. Suppliers welcomed this feedback since it is
apparently hard to get. This feedback might even help developers make a case for
an issue with their managers since there is now proven customer demand.</p>
<p>Patrick went to other departments of their own company next:</p>
<dl>
<dt>Marketing</dt>
<dd>Your biggest fans: each new feature means something new they can talk about.</dd>
<dt>HR</dt>
<dd>They have their own SLA: paying employees which, by the way, was something
that could be automated.</dd>
<dt>Sales</dt>
<dd>These people have enormous pressure from the customer, the speed needed in
deployment is the speed needed by sales. They also pressure engineering to
make things more simple and not over-architect things.</dd>
<dt>Legal</dt>
<dd>They are like ops: nobody sees them unless there is a problem.</dd>
<dt>Finance</dt>
<dd>In a way they are testers: if there&rsquo;s no money in the bank, you have
failed. Although they play in the shadows, they enable you to do the job you
like.</dd>
</dl>
<p>Unfortunately they could not convince enough customers to buy the product they
made. The company shut down, but it was an awesome experience.</p>
<p><img src="/images/devopsdaysghent2019_patrick_debois.jpg" alt="Patrick Debois explains every department in a company plays their own, important role in success"></p>
<p>DevOps was <em>not</em> the bottleneck. You need the rest of the company; every part is
needed to succeed.</p>
<p>By the way: the reason Patrick used horses on the slides? He learned that
unicorns do not exist.</p>
<h2 id="pipelines-to-production-detangling-the-devops-web-of-lies--jody-wolfborn-and-kimball-johnson">Pipelines to Production: Detangling the DevOps Web (of Lies) &mdash; Jody Wolfborn and Kimball Johnson</h2>
<p>Creating a DevOps team does not solve your problems. You need trust, which
you&rsquo;ll have to earn. You can start earning that trust by automating things and
showing that the automation is reliable and helps with the repetitive stuff.</p>
<p>You cannot just throw tools at it. You&rsquo;ll have to find the right tools and
keep evaluating your tool set.</p>
<p>By deploying small, iterative changes, and tracking and versioning them, it is
easier to find out where the problem lies. Would you rather debug all changes
made in a month, of just a small set of changes?</p>
<p><img src="/images/devopsdaysghent2019_jody_wolfborn_and_kimball_johnson.jpg" alt="Jody Wolfborn and Kimball Johnson role playing"></p>
<p>You don&rsquo;t want to react to issues, you want to proactively monitor production to
detect broken stuff.</p>
<p>Also make small, iterative changes to the processes. Include the rest of the
organization.</p>
<blockquote>
<p>DevOps is about communication. Breakdowns in DevOps are breakdowns in
communication.</p></blockquote>
<p>Fix breakdowns by communicating your needs, but mostly by <em>listening</em> to the
needs from others and responding with empathy.</p>
<h2 id="different-paths-same-place---human-case-studies--michael-ducy">Different paths, same place - human case studies &mdash; Michael Ducy</h2>
<p>The people are what makes this community so special. They have had a tremendous
impact on Michael. There&rsquo;s a diversity of &hellip; all the things: culture, language,
background, etc.</p>
<p><img src="/images/devopsdaysghent2019_michael_ducy.jpg" alt="Michael Ducy talking about diversity"></p>
<p>Background is special in this list. It&rsquo;s not something you are &ldquo;born into.&rdquo; Some
people have a background of drug addiction, have been in prison or are high
school dropouts. That does not mean they cannot be successful.</p>
<p>Have empathy: we don&rsquo;t come from the same place and have not had the same
opportunities.</p>
<blockquote>
<p>Never criticize a person until you&rsquo;ve walked a mile in their shoes.</p></blockquote>
<h2 id="the-perfect-storm---how-we-talk-about-disasters--george-miranda">The Perfect Storm - How We Talk About Disasters &mdash; George Miranda</h2>
<p>Super visual outages can cause a lot of reputational damage to a company. You
should contain the damage and prevent it from getting worse.</p>
<p>It is important to have a plan beforehand. Your plan should include the way
things are coordinated between the different departments in case of a severe
technical incident. This plan should be well documented and practised regularly.</p>
<p>When PagerDuty had a visible, multiple hours outage, there was a lot of chaos in
the business instead of calm. While the engineering department was used to
having and solving these kind of issues (albeit not on this scale), for the
majority of the company this was new&mdash;which lead to chaos. Engineering needed
to share how to deal with these kind of technical problems with departments like
marketing, legal, PR, finance, etc.</p>
<p><img src="/images/devopsdaysghent2019_george_miranda.jpg" alt="George Miranda about all departments playing a role in incident response"></p>
<p>Most important takeaway: look at an incident response system, see for example
the cut-down version of the <a href="https://response.pagerduty.com/">PagerDuty Incident Response process</a>.
(Disclaimer: George works for PagerDuty.)</p>
<p>In case of a severe incident, the whole company should know what is happening in
engineering. For example:  the marketing department should stop campaigns, sales
people should not try to live demo the system, etc. There needs to be a broad
cross-functional response to the incident.</p>
<p>We often talk about a <a href="https://en.wikipedia.org/wiki/T-shaped_skills">T-shaped individual</a>:
a person with broad knowledge but also a (single) deep expertise. When you put a
lot of these individuals together, each person still has their own speciality,
but the group as a whole covers a much larger area. With that in mind: DevOps
does not really remove all silos, but we have learned to communicate better.</p>
<p>Where do we go from here? The fundamentals will stay the same, but the scope
will be different. This is an opportunity to spread what we have learned as
engineering to other parts of the company (and vice versa).</p>
<p>Engineering practices that make their way to other departments:</p>
<ul>
<li>Chaos engineering</li>
<li>Blameless postmortems</li>
<li>Adoption of retrospectives (also after practising an outage)</li>
<li>Visiting other silos</li>
</ul>
<p>There is an opportunity for us to learn more from the business. They are curious
about learning and adopting practices from us. We should also be open and
curious to learn from them.</p>
<p>More resources:</p>
<ul>
<li><a href="https://business-response.pagerduty.com/">PagerDuty Business Incident Response</a></li>
<li><a href="https://www.pagerduty.com/ops-guides/">PagerDuty Ops Guides</a></li>
</ul>
<h2 id="sysadmin-to-devops-where-we-came-from-and-where-we-are-going--bryan-liles">Sysadmin to DevOps: Where we came from and where we are going &mdash; Bryan Liles</h2>
<p>We used to be able to save the day with some C or C++ code. In the nineties dev
and ops belonged together. Somehow these two grew apart. Then we had a situation
where parties were frustrated with each other: Why don&rsquo;t developers understand
networking or Unix/Linux? Why don&rsquo;t sysadmins understand software that is more
complicated than something in a shell script?</p>
<p><img src="/images/devopsdaysghent2019_bryan_liles.jpg" alt="Bryan Liles about frustrations between ops and dev"></p>
<p>If we break the problem down, we have people that write software that solves a
(business) problem. That software most likely runs on a server somewhere. You
need a team to understand the full stack. We thought we had dev <em>plus</em> ops. But
developers did not think about ops. Instead we had dev <em>versus</em> ops. With DevOps
we are back where we started.</p>
<p>The good parts of DevOps:</p>
<ul>
<li>automating and removing manual failures, infrastructure as code</li>
<li>empathy</li>
<li>collaboration</li>
<li>&ldquo;little a&rdquo; agile (not the certifiable, &ldquo;big A&rdquo; agile)</li>
</ul>
<p>The bad parts:</p>
<ul>
<li>DevOps Engineers &ndash; you can <em>do</em> DevOps, not <em>have</em> DevOps</li>
<li>SRE</li>
</ul>
<p>Our progression:</p>
<ul>
<li>We learned how to automate all the things. (But we take ourselves too
seriously.)</li>
<li>We created new synergies (DevSecOps).</li>
</ul>
<p>We&rsquo;ve been doing this for 10 year now. It&rsquo;s safe to say DevOps is no longer a
fad. Dev{*}Ops might be still a fad. Time will tell.</p>
<h2 id="devops-distributed--bridget-kromhout">DevOps, distributed &mdash; Bridget Kromhout</h2>
<p>If you look at the number of devopsdays events, you can see quite a jump between
2015 and 2016. This is because in 2015 documentation was put up on the
devopsdays website. Information on how to run such an event was made more
accessible to others.</p>
<p><img src="/images/devopsdaysghent2019_bridget_kromhout.jpg" alt="Bridget Kromhout showing a graph with the number of devopsdays events per year"></p>
<p>The main takeaway is that you can grow much more than you thought you could, if
you put in the effort to allow others to do what you can do.</p>
<p>If you worry about the semantics of the word &ldquo;DevOps,&rdquo; you are are focusing on
the wrong thing. Worry about impact instead.</p>
<h2 id="ignites">Ignites</h2>
<p>Ignites are short talks. Each speaker provides twenty slides which automatically
advance every fifteen seconds.</p>
<h3 id="metatalk-an-ignite-about-what-ive-learned-giving-ignites--jason-yee">Metatalk: An ignite about what I&rsquo;ve learned giving ignites &mdash; Jason Yee</h3>
<p><img src="/images/devopsdaysghent2019_jason_yee1.jpg" alt="Jason Yee talks about ignites"></p>
<p>Some tips about ignite talks:</p>
<ul>
<li>Don&rsquo;t introduce yourself</li>
<li>No title slide, tell them directly what you want to tell, get straight to it</li>
<li>The big challenge is the timing; don&rsquo;t wait for your slides but keep talking
and they will eventually catch up</li>
<li>Don&rsquo;t memorize your script, but get a feel for it.</li>
<li>You <em>will</em> feel nervous, but that is okay. Everybody wants you to succeed.</li>
<li>Remember the letters C,A,L,M and S: This spells &ldquo;SCAML&rdquo;: <strong>S</strong>preading
<strong>C</strong>ulture <strong>A</strong>s <strong>M</strong>essy, heavily indented text fi<strong>L</strong>es.</li>
</ul>
<h3 id="how-observability-is-not-killing-monitoring--blerim-sheqa">How Observability is not killing Monitoring &mdash; Blerim Sheqa</h3>
<ul>
<li>Monitoring: black box</li>
<li>Observability: look inside the box</li>
</ul>
<blockquote>
<p>Observability is not a replacement, it is an addition to your monitoring</p></blockquote>
<h2 id="the-end">The end</h2>
<p>And that&rsquo;s it for the devopsdays Ghent 2019 talks. Well&hellip; almost.</p>
<p>There&rsquo;s one sheet I want to share. It&rsquo;s not a great picture and out of context
here (it&rsquo;s one of the sheets Jason used in his ignite talk), but I think it&rsquo;s a
great way to close off.</p>
<figure><img src="/images/devopsdaysghent2019_jason_yee2.jpg"
    alt="One of the slides of Jason Yee&#39;s ignite"><figcaption>
      <p>Jason Yee with an ancient DevOps proverb: You don&rsquo;t just <strong>do</strong> DevOps, you must <strong>feel</strong> DevOps</p>
    </figcaption>
</figure>]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[Devopsdays Ghent 2019: day one]]></title>
    <link rel="alternate" href="https://markvanlent.dev/2019/10/29/devopsdays-ghent-2019-day-one/" type="text/html" />
    <id>https://markvanlent.dev/2019/10/29/devopsdays-ghent-2019-day-one/</id>
    <author>
      <name>map[name:Mark van Lent uri:https://markvanlent.dev/about/]</name>
    </author>
    <category term="conference" />
    <category term="culture" />
    <category term="devops" />
    
    <updated>2021-11-09T20:09:33Z</updated>
    <published>2019-10-29T00:00:00Z</published>
    <content type="html"><![CDATA[<p>This year I was lucky enough to attend devopsdays for a second time. This time
the conference was held in the beautiful Ghent, Belgium.</p>
<p>I&rsquo;ve been to devopsdays Amsterdam
<a href="/2016/06/29/devopsdays-amsterdam-2016-workshops/">a</a>
<a href="/2017/06/28/devopsdays-amsterdam-2017-day-zero-workshops/">couple</a>
<a href="/2018/06/27/devopsdays-amsterdam-2018-workshops/">of</a>
<a href="/2019/06/26/devopsdays-amsterdam-2019-workshops/">times</a>, but I never
went to one in a different city. Because ten years ago the
<a href="https://legacy.devopsdays.org/events/2009-ghent/">first devopsdays</a> was
organized in Ghent, I thought this was a great moment to visit the &ldquo;birth place&rdquo;
of devopsdays.</p>
<p><img src="/images/devopsdaysghent2019_welcome.jpg" alt="The number of devopsdays event grew from 1 event in 2009 to 80+ in 2019"></p>
<p>These are the notes I made during/after the talks.</p>
<h2 id="enterprise-devops-10-years-on-what-went-wrong-and-what-went-right--nigel-kersten">Enterprise DevOps 10 years on: What went wrong? and what went right? &mdash; Nigel Kersten</h2>
<p>Some enterprises are <a href="https://en.wikipedia.org/wiki/Cargo_cult_programming">cargo culting</a>
DevOps. They have shiny, polished processes and tools which are not in place out
of experience, but out of an idea that that is how you are supposed to do DevOps.</p>
<p><img src="/images/devopsdaysghent2019_nigel_kersten.jpg" alt="Nigel Kersten talks about the &ldquo;OpsDev&rdquo; team as an example of things going wrong"></p>
<p>Processes often are like scar tissue&mdash;adhesions formed after trauma. That is:
something went wrong in the past and a process was created to prevent that
thing from going wrong again. Unfortunately these processes are not always
reviewed to see if they still match a world that has changed since then.</p>
<p>Most companies are not changing enough. Teams, and perhaps departments, can be
successful in adopting DevOps, but often it does not scale up to the whole
company.</p>
<p>So what went wrong? When you look at the CALMS framework (which stands for
culture, automation, lean, measurement and sharing), culture and sharing are
hard. This leaves you with automation, lean and measurement. But:</p>
<blockquote>
<p>DevOps should be more than optimizing and automating builds!</p></blockquote>
<p>The things we build need to be more accessible and more easily shareable.</p>
<h2 id="why-were-bad-at-sharing-devops--joshua-zimmerman">Why we&rsquo;re bad at sharing DevOps &mdash; Joshua Zimmerman</h2>
<p>What does DevOps mean? Often the picture of several blindfolded people examining
(parts of) an elephant is shown to demonstrate that nobody is right.</p>
<p><img src="/images/devopsdaysghent2019_joshua_zimmerman.jpg" alt="Joshua Zimmerman shows the picture of several blindfolded people examining (parts of) an elephant"></p>
<p>But this approach misses the point. Language changes all the time. Words might
loose some of its original meaning and start meaning something else. &ldquo;Awful&rdquo;
originally was used to describe something worthy of respect or fear. Nowadays it
is used to make clear something is extremely bad.</p>
<p>Arguing about the definition of DevOps does not help us reach consensus and also
does not learn <em>others</em> what we want to convey about DevOps.</p>
<h3 id="teaching">Teaching</h3>
<p>Education can be used as a tool of oppression&mdash;forcing your ideas onto others.
One way of doing this is by being selective in what we teach and what we omit.
But <em>how</em> we teach can also make a difference.</p>
<p>Traditional approach in tech uses one-way broadcasts. The teacher fills up the
student with knowledge. But people are not empty vessels, we already have
experience and knowledge.</p>
<p>Instead of teaching being one-way communication, you can also explore solutions
together with the student (problem-posing education). You can work together on a
problem and find truth. Students can contribute to the knowledge.</p>
<p>Ways to improve sharing concepts like DevOps:</p>
<ul>
<li>Focus on concepts over language. Set a context and move on.</li>
<li>Pose more problems and offer fewer solutions. Show the problems more clearly
instead of focussing on the solutions.</li>
<li>Create more spaces for collaborative learning.</li>
</ul>
<p>Reading tip: <a href="https://en.wikipedia.org/wiki/Pedagogy_of_the_Oppressed">Pedagogy of the Oppressed</a>
by Paulo Freire.</p>
<h2 id="everything-i-need-to-know-about-devops-i-learned-in-the-marines--ken-mugrage">Everything I need to know about DevOps I learned in The Marines &mdash; Ken Mugrage</h2>
<p>Disclaimer: there are things in Marine boot camp that are counter productive
when applied at DevOps, like instructors yelling at recruits.</p>
<p>Words matter a lot. In the marines they have very specific words for things. For
example: a <em>bed</em> is a <em>rack</em>, a <em>hat</em> is a <em>cover</em>, a <em>toilet</em> is <em>the head</em>. In
organizations it&rsquo;s similar: a &ldquo;unit test&rdquo; means something specific. By having a
shared, well defined vocabulary communication is more effective.</p>
<p>Boot camp is full of creating habits. For example: recruits have to sit cross
legged on a concrete floor in a &ldquo;classroom.&rdquo; Later on in the course it becomes
clear that this position is a stable position to fire a gun. By the time the
recruits get to actually work with a gun, they are already used to the position.</p>

  <figure>

<blockquote >
I&rsquo;m not a great programmer. I&rsquo;m just a good programmer with great habits
</blockquote>

  <figcaption>
    &mdash;Kent Beck
  </figcaption>
  </figure>


<p>Having muscle memory, ingraining habits, means the quality goes up.</p>
<p>A different level of risk means a different level of testing. This does not have
to imply that deployments take longer. While the developers are running their
functional tests, you can also already run additional tests that are required to
promote that code to the next stage. So by the time the developers know that the
functional tests pass, we also know whether there are security or performance
issues that need to be addressed.</p>
<p><img src="/images/devopsdaysghent2019_ken_mugrage.jpg" alt="Ken Mugrage tells about risks and levels of testing"></p>
<p>The military is not all about command and control. High level objectives are set
by the highest in command and these objectives get more detailed in lower
levels. In business: people share &ldquo;corporate goals,&rdquo; but the product team
decides what the best implementation is.</p>
<p>Takeaways:</p>
<ul>
<li>Words have meaning</li>
<li>Train like you play, play like you train</li>
<li>Diversity and inclusion makes us stronger</li>
</ul>
<h2 id="supporting-devops-with-technology-procurement--ramon-van-alteren">Supporting devops with technology procurement &mdash; Ramon van Alteren</h2>
<p>Procurement is about obtaining stuff. In a modern context this means answering
question like &ldquo;which cloud vendor?&rdquo; and &ldquo;why <em>that</em> cloud vendor?&rdquo;</p>
<p>Spotify was using their own data centers with compute, network and storage.
Because provisioning new hardware takes time (e.g. weeks to order a new server),
you start managing buffers and flow to be able to fulfil a demand.</p>
<p>Moving to the cloud puts you into a transactional mode, you pay directly for
what you use. Moving to the cloud in a way where every team can spin up their
own infrastructure also means there is no longer a single place of costs.
However, cost is a factor in how you design and run systems. By decentralizing
this place of costs, each team has to consider them.</p>
<p>And don&rsquo;t forget that rules like writing off infrastructure does not work. In
the cloud, compute is not &ldquo;free&rdquo; anymore after a number of years. When a project
has a budget for infrastructure, you have to take into account that in a cloud
setting you have to keep paying once the work on the project is finished and it
goes to maintenance mode (and the costs shifts from engineering to an operations
department).</p>
<p>Lessons:</p>
<ol>
<li>Create a safety net. Predicting costs is not easy. Allow people to
experiment, but make it safe for them. (E.g. alert on spikes in costs.)
<img src="/images/devopsdaysghent2019_ramon_van_alteren.jpg" alt="Ramon van Alteren about creating a safety net for your engineers"></li>
<li>Look at global cost optimization before local. E.g. look at using reserved
instances instead of having a team of developers optimize a small thing while
they also could be adding value to the company.</li>
<li>Haven an explicit frame of reference. Make it clear if cost are high or not.</li>
<li>Most importantly: cost reduction should not be a goal! Otherwise you end up
with a cheap, very specific service; you&rsquo;d rather focus on value for users.</li>
</ol>
<h2 id="you-cant-buy-devops--julie-gunderson">You can&rsquo;t buy DevOps &mdash; Julie Gunderson</h2>
<p>You cannot buy it, but people will try to sell it to you.</p>
<blockquote>
<p>You cannot change culture with a credit card.</p></blockquote>
<p>Cultural changes rarely work top-down. Instead start small and prove its usefulness.</p>
<p>DevOps is not things like:</p>
<ul>
<li>Moving to the cloud</li>
<li>Forming a &ldquo;DevOps team&rdquo;</li>
<li>Getting rid of the Ops team</li>
<li>A prescriptive checklist</li>
</ul>
<p>DevOps is about culture, people over processes.
<a href="https://rework.withgoogle.com/blog/five-keys-to-a-successful-google-team/">Research by Google</a>
about what makes a successful team, shows that psychological safety fosters a
culture of empowerment. Having psychological safety means you can embrace failure
and learn from your mistakes.</p>
<p><img src="/images/devopsdaysghent2019_julie_gunderson.jpg" alt="Julie Gunderson about psychological safety"></p>
<p>We should stop talking about &ldquo;a root cause,&rdquo; but instead discuss &ldquo;contributing
factors.&rdquo; With our complex systems, there often no longer is a single root
cause. There are always contributing factors that allowed e.g. a person to push
the button that took down the whole company.</p>
<p>Some practices you can adopt:</p>
<ul>
<li>Configuration management</li>
<li>Automation</li>
<li>Use the right tools (the actual users of the tools should be able to decide
<em>which</em> tools to use)</li>
<li>Working in small branches instead of massive integration (this makes it easier
to rollback if needed)</li>
<li>Testing</li>
</ul>
<p>You can start small and prove success. This can affect the entire organization.</p>
<h2 id="what-mma-taught-me-about-working-in-tech--daniel-maher">What MMA taught me about working in tech &mdash; Daniel Maher</h2>
<p>Mixed martial arts (MMA) is about mixing techniques.</p>
<p>Fundamentals win fights: there are some basic things (tools) you have to learn.
Learn about the individual tools and how to put them together. But don&rsquo;t be
dogmatic about fundamentals; they are just tools. What matters is how you <em>use</em>
the tools.</p>
<p>There are more ways than one to throw a jab. It&rsquo;s the diversity that makes you
better. You have to embrace diversity. MMA is all about the mixing, right? How
does this relate to tech? You come up with new ideas by surrounding yourself with
people who are different and have different experience, knowledge, etc. Fresh eyes
and fresh ideas.</p>
<blockquote>
<p>Tech needs diversity to succeed</p></blockquote>
<p><img src="/images/devopsdaysghent2019_daniel_maher.jpg" alt="Daniel Maher explains that diverse teams are better"></p>
<p>By having a diverse team, you&rsquo;ll be prepared for things you would not have
thought about by yourself.</p>
<h2 id="admitting-that-hard-problems-are-hard--sasha-rosenbaum">Admitting that hard problems are hard &mdash; Sasha Rosenbaum</h2>
<p>Have you ever been at a meeting where your boss explains this horribly complex
architecture he wants for a project? We reward complexity. (Resume driven
development: use tech because it looks good on your resume, not because you
actually needed it.) There is a cost to admitting ignorance and pushing back on
this idea.</p>
<p><img src="/images/devopsdaysghent2019_sasha_rosenbaum.jpg" alt="Sasha Rosenbaum about resume driven development"></p>
<p>We should stop pretending that:</p>
<ul>
<li>distributed systems are easy (running Kubernetes is not easy, it is error prone and hard)</li>
<li>people can be available 24/7</li>
<li>unicorn companies don&rsquo;t have technical debt</li>
<li>that new product will solve all of your problems</li>
</ul>
<p>There are costs to complexity:</p>
<ul>
<li>time to learn</li>
<li>time to build</li>
<li>maintenance</li>
<li>increased chances of failure</li>
</ul>
<p>This costs us money and productivity.</p>
<p>We use heuristics to navigate the world. And we need to because otherwise we
would not last very long. Problem is that our current assumptions are not
very good (e.g. &ldquo;women and persons of color are not very smart&rdquo;). These
stereotypes are persistent and actually hurt people. Same resume: 6&ndash;14% more
interviews for males vs female candidate, 50% more callbacks if the name of the
candidate sounded white vs black-sounding names.</p>
<p>A consequence is also that the cost for admitting ignorance and errors are much
higher for women, people of color and junior engineers.</p>
<p>Building effective teams (according to the same
<a href="https://rework.withgoogle.com/blog/five-keys-to-a-successful-google-team/">Google study</a>
that was mentioned by Julie Gunderson) is not related to background, personality
types or skills, but psychological safety.</p>
<p>But psychological safety does not <em>just</em> happen. It has to be built by leaders.</p>
<ul>
<li>Make it safe to express opinions</li>
<li>Make it safe to admit mistakes (e.g. admit mistakes yourself as a leader)</li>
<li>Realise that you do not know everything</li>
</ul>
<p>This isn&rsquo;t easy. This is a hard problem. But it is going to be worth solving it.</p>
<p>Reading tip: <a href="https://www.amazon.com/Mindset-Psychology-Carol-S-Dweck/dp/0345472322">Mindset: The New Psychology of Success</a>
by Carol S. Dweck.</p>
<h2 id="ignites">Ignites</h2>
<p>Ignites are short talks. Each speaker provides twenty slides which automatically
advance every fifteen seconds.</p>
<h3 id="the-danger-of-devops-certifications--ken-mugrage">The danger of DevOps Certifications &mdash; Ken Mugrage</h3>
<p>You cannot certify empathy. Most certifications only certify attendance, not
actual knowledge or experience.</p>
<p>DevOps is about learning from failure, but companies that hire you to &ldquo;implement
DevOps&rdquo; because you have a &ldquo;DevOps certificate&rdquo; are not going to give you the
time to learn. If you fail, the CEO gets to say &ldquo;see, DevOps doesn&rsquo;t work (for
us).&rdquo; This can even be a danger for your career.</p>
<p>Training, courses, etc are useful. And if you want to get a certificate, that is
also fine. But it&rsquo;s <strong>not</strong> the thing that will make your career.</p>
<p>Certification on tools is ok. You cannot certify culture.</p>
<p>Some companies will want certification, fine, but understand that it&rsquo;s not the key to
your success.</p>
<h3 id="4-ways-to-help-people-in-their-careers--emanuil-tolev">4 ways to help people in their careers &mdash; Emanuil Tolev</h3>
<ul>
<li>Coaching: shorter term relationship to learn a specific skill</li>
<li>Mentoring: longer, exploratory relationship</li>
<li>Sponsorship: improve how others will think of you (potential, skills)</li>
<li>Career counselling: less formal version of the previous ways</li>
</ul>
<p>The stories we tell about ourselves are powerful. Melancholy and contemplation
are good things. You should prevent self-reinforcing feelings of doubt though.</p>
<p>For the coaches, mentors, etc: you cannot take the journey of others, but you
can help them with their journey.</p>
<h3 id="serverless-changes-nothing--ant-stanley">Serverless changes nothing &mdash; Ant Stanley</h3>
<p>Serverless does not bring new ideas. Thinks like the <a href="https://12factor.net/">12 factor app</a>,
the <a href="https://en.wikipedia.org/wiki/Unix_philosophy">Unix philosophy</a> or microservices
already existed. Operation practices are also the same, often even the same
tools are used.</p>
<p>If you want to succeed in serverless, you need to adopt a DevOps way of working
<em>first</em>; do the cultural transformation <em>first</em>. You cannot do serverless
without DevOps.</p>]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[Podcasts I listen to — 2019 edition]]></title>
    <link rel="alternate" href="https://markvanlent.dev/2019/10/13/podcasts-i-listen-to-2019-edition/" type="text/html" />
    <id>https://markvanlent.dev/2019/10/13/podcasts-i-listen-to-2019-edition/</id>
    <author>
      <name>map[name:Mark van Lent uri:https://markvanlent.dev/about/]</name>
    </author>
    <category term="aws" />
    <category term="devops" />
    <category term="observability" />
    <category term="podcast" />
    <category term="python" />
    <category term="security" />
    
    <updated>2021-11-09T20:09:33Z</updated>
    <published>2019-10-13T17:08:00Z</published>
    <content type="html"><![CDATA[<p>Most of the time I use my commute to listen to podcasts. Because they reflect my
interests at this point in time, I thought it would be nice to share my current
list.</p>
<p>The last time I posted my selection of podcasts was back in
<a href="/2012/11/12/podcasts-i-listen-to/">2012</a>. Since then, a lot has changed:</p>
<ul>
<li>Only two out of the six podcasts on the 2012 list are still in my feed. But
there are quite a few new podcasts to compensate.</li>
<li>My commute has reduced from six hours/week to about four.</li>
<li>Out of necessity I no longer listen to all episodes of all podcasts&mdash;I just
do not have enough time to listen to them. But all podcasts in the list below
have interesting episodes, so unsubscribing from them is also not an option.
(I know, first world problems&hellip;)</li>
</ul>
<p>My podcast feed with the reason why I listen to each show:</p>
<ul>
<li><a href="https://www.lastweekinaws.com/">AWS Morning brief</a>: I use this podcast to
quickly get up to speed on AWS related developments. Oh, and I like the snarky
way Corey Quinn summarises the news.</li>
<li><a href="https://www.redhat.com/en/command-line-heroes">Command line heroes</a>: a fun
and educational podcast about topics that interest me as a developer. For
example season 2 was focussed open source development and season 3 was all
about (the history of) programming languages.</li>
<li><a href="https://darknetdiaries.com/">Darknet Diaries</a>: interesting stories about
<q>the dark side of the internet.</q> I love the stories and the way
Jack Rhysider tells them.</li>
<li><a href="https://freakonomics.com/">Freakonomics Radio</a>: a completely different show
because it is not technology focussed. Stephen Dubner discusses topic I often
don&rsquo;t know much about or have not thought about much. I think the episodes are
not only educational but also fun to listen to.</li>
<li><a href="https://www.heavybit.com/library/podcasts/o11ycast/">O11ycast</a>: this one I
recently discovered. Since I&rsquo;m interested in the subject of observability,
this podcast gives me new ideas to think about.</li>
<li><a href="https://www.realworlddevops.com/">Real World DevOps</a>: I really like Mike
Julian&rsquo;s interviews with people from the tech industry about DevOps related
topics. Unfortunately there has not been a new episode for a few months; I
hope Mike will start podcasting again in the future.</li>
<li><a href="https://gimletmedia.com/shows/reply-all">Reply All</a>: a brilliant show. I had
lost track of the hosts, PJ Vogt and Alex Goldman, after they stopped with the
TLDR podcast. Unfortunately I did not search for them, otherwise I would have
found Reply All a lot sooner! I listen to this podcast primarily to get
amused; learning something from it is a nice side effect though.</li>
<li><a href="https://www.lastweekinaws.com/podcast/screaming-in-the-cloud/">Screaming in the cloud</a>: another podcast
from Corey Quinn which I listen to, to learn more about cloud related topics
from the interviews on this show.</li>
<li><a href="https://twit.tv/shows/security-now">Security Now</a>:  I listen to this podcast
for the security related  news and an occasional deep dive into a technical
topic.</li>
<li><a href="https://shoptalkshow.com/">Shop Talk Show</a>: although these days I&rsquo;m more into
infrastructure and automation, I like listening to this show allow to keep in
touch with web development.</li>
<li><a href="https://talkpython.fm/">Talk Python to me</a>: more or less the same as with
Shop Talk Show: I like this podcast for its (Python) development related topics.</li>
<li><a href="https://www.tech45.eu/">Tech45</a>: a podcast I&rsquo;ve listened to for quite some years
now and one of the few shows I want to listen to every single episode of. I
enjoy the way the panel discusses the (tech related) news each week.
Informative and &ldquo;gezellig.&rdquo;</li>
<li><a href="https://soundcloud.com/user-98066669">The privacy, security &amp; OSINT show</a>:
the most recent addition to my list. I have only listened to a few shows yet,
but the topics seem interesting. We&rsquo;ll see if this one sticks or not.</li>
</ul>
<p>And there you have it: my list of podcasts of October 2019. (Also available in
<a href="/files/podcasts-2019.opml">OPML format</a>.)</p>]]></content>
  </entry>
</feed>
