<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <title>Posts tagged with “Infrastructure as Code” on Mark van Lent’s weblog</title>
  <updated>2026-01-31T00:00:00+00:00</updated>
  <link rel="self" type="application/atom+xml" href="https://markvanlent.dev/tags/infrastructure-as-code/index.xml" hreflang="en"/>
  <id>tag:markvanlent.dev,2010-04-02:/tags/infrastructure-as-code/index.xml</id>
  <link rel="alternate" type="text/html" href="https://markvanlent.dev/tags/infrastructure-as-code/" hreflang="en"/>
  <author>
      <name>Mark van Lent</name>
      <uri>https://markvanlent.dev/about/</uri>
    </author>
  <rights>Copyright (c) Mark van Lent, Creative Commons Attribution 4.0 International License.</rights>
  <icon>https://markvanlent.dev/favicon.ico</icon>
  <entry>
    <title type="html"><![CDATA[FOSDEM 2026]]></title>
    <link rel="alternate" href="https://markvanlent.dev/2026/01/31/fosdem-2026/" type="text/html" />
    <id>https://markvanlent.dev/2026/01/31/fosdem-2026/</id>
    <author>
      <name>map[name:Mark van Lent uri:https://markvanlent.dev/about/]</name>
    </author>
    <category term="ansible" />
    <category term="conference" />
    <category term="docker" />
    <category term="infrastructure as code" />
    <category term="git" />
    <category term="python" />
    <category term="security" />
    
    <updated>2026-02-01T14:54:22Z</updated>
    <published>2026-01-31T00:00:00Z</published>
    <content type="html"><![CDATA[<p>January is already almost over, so time for <a href="https://fosdem.org/2026/">FOSDEM</a>,
the yearly <q>free event for software developers to meet, share ideas and
collaborate</q> in Brussels. <a href="/2025/02/01/fosdem-2025/">Last year</a> I
focussed on the Go track, this year I selected a mix of security and Python
related talks to attend.</p>
<h2 id="streamlining-signed-artifacts-in-container-ecosystems--tonis-tiigi">Streamlining Signed Artifacts in Container Ecosystems &mdash; Tonis Tiigi</h2>
<p>It&rsquo;s possible to sign Docker images, but at the moment most are actually not
signed. Also, users should understand what the signature is protecting and what
it&rsquo;s <em>not</em> protecting. We should not want signing just to tick a box on the
security checlist, but because of the security it adds. And we need something
simple: integrated with existing tools, should not slow down tools.</p>
<p>Buildkit powers &ldquo;<code>docker build</code>&rdquo; but is not limited to Dockerfiles. It&rsquo;s high
performance, can build complex builds and has caching.</p>
<p>A modern build is a graph of images, Git repositories, local files, etc. The
results are images, binaries, archives.</p>
<figure><img src="/images/fosdem2026_tonis_tiigi.jpg"
    alt="Photo of Tonis Tiigi explaining the graph that is modern software building"><figcaption>
      <p>Tonis Tiigi explaining that builds of modern software are a complex graph</p>
    </figcaption>
</figure>

<p>We need Supply-chain Levels for Software Artifacts (SLSA) provenance: what has
actually happened in the build? What was the build config? Et cetera. It&rsquo;s useful to
figure out how an artifact was built.</p>
<p>Buildkit does not sign images by default. GitHub has <a href="https://docs.github.com/en/packages/managing-github-packages-using-github-actions-workflows/publishing-and-installing-a-package-with-github-actions#publishing-a-package-using-an-action">an example in the
documentation</a>
to run a build with Buildkit and generate an artifact. It claims to generate an
<q>unforgeable statement</q>. But if your GitHub credentials are
leaked and the attacker can get your hands on the temporary signing key, they can
use it to sign their own artifacts.</p>
<p>Docker created the <a href="https://github.com/docker/github-builder">github-builder</a>
repository. It contains reusable GitHub Actions to securely build images. If you
use this, your images are signed to prove that they were built from a certain
repository, using the configured build steps. Where Buildkit (among other
things) provides isolation, <code>github-builder</code> provides signing context. It also
protects against build dependency leaks.</p>
<p>So that takes care of the signatures, but how do you verify them?</p>
<ul>
<li>The command &ldquo;<code>docker inspect</code>&rdquo; now shows verified signatures</li>
<li>You can manually verify it with <a href="https://github.com/sigstore/cosign">cosign</a></li>
<li>You can also use sigstore/policy-controller for Kubernetes</li>
</ul>
<p>Buildx also includes experimental Rego (Open Policy Agent) policy support. This
means you can write a matching policy for <code>Dockerfile</code>, e.g. <code>Dockerfile.rego</code>,
which is then automatically loaded. All build sources now need to pass policy
for the build to continue (images, Git repositories, URLs, etc).</p>
<p>You can do very complex stuff in the policies. As simple example Tonis showed:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rego" data-lang="rego"><span class="line"><span class="cl"><span class="kd">package</span><span class="w"> </span><span class="nx">docker</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="n">allow</span><span class="w"> </span><span class="kd">if</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nx">input</span><span class="o">.</span><span class="nx">image</span><span class="o">.</span><span class="nx">repo</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&#34;org/app&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nf">docker_github_builder_tag</span><span class="p">(</span><span class="nx">input</span><span class="o">.</span><span class="nx">image</span><span class="o">,</span><span class="w"> </span><span class="s2">&#34;org/app&#34;</span><span class="o">,</span><span class="w"> </span><span class="nx">input</span><span class="o">.</span><span class="nx">image</span><span class="o">.</span><span class="nx">tag</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">}</span><span class="w">
</span></span></span></code></pre></div><p>This policy should make sure that the image can only be built from this
repository and that the image tag should match the Git tag.</p>
<p>Summary:</p>
<ul>
<li>No reason not to sign</li>
<li>Not all signatures are equal</li>
<li>Software pulling packages should verify pulled content</li>
</ul>
<p><a href="https://fosdem.org/2026/schedule/event/HJAJTU-streamlining_signed_artifacts_in_container_ecosystems/">Link to the conference page</a></p>
<h2 id="sequoia-git-making-signed-commits-matter--neal-h-walfield">Sequoia git: Making Signed Commits Matter &mdash; Neal H. Walfield</h2>
<p>Version control systems (also known as VCSs) track the following:</p>
<ul>
<li>Changes to the code</li>
<li>Authorship</li>
<li>Other metadata</li>
<li>Commit message</li>
</ul>
<p>But the author can be faked: the metadata is set by the author, including the
author&rsquo;s name. After a quick &ldquo;<code>git config</code>&rdquo; command you can commit as anyone you
want, for example <a href="https://en.wikipedia.org/wiki/Linus_Torvalds">Linus Torvalds</a>.
Sure, GitHub could see that the committer (the one pushing the commit) and
author are different. However, this is not necessarily bad because we might
simply want to give proper attribution to the author of the commit.</p>
<p>And in theory the forge might also be compromised, or someone may have gotten
permission to push to the project.</p>
<p>To prevent impersonations, we can cryptographically prove who the author is by
signing the commits. But now the problem shifts to the certificates. Because
anyone can create a key with any name (again, for example Linus) attached to it.
So what does a signed commit mean now?</p>
<p>How can we be sure that the author is who they say they are? There are ways:</p>
<ul>
<li>You could talk to developer the verify</li>
<li>You could go to <a href="https://en.wikipedia.org/wiki/Key_signing_party">key signing parties</a></li>
<li>You can use a central authority that you trust (e.g.
<a href="https://keys.openpgp.org/">keys.openpgp.org</a>, the Linux developer keyring,
the <code>distributions-gpg-keys</code> package, or, if you trust Github, use
<code>github.com/&lt;username&gt;.gpg</code>)</li>
</ul>
<p>You can use the following command to show the Git log and the signatures on them:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">git log --show-signature
</span></span></code></pre></div><p>But now you need to actually check that the signatures are indeed made by the
certificates you trust.</p>
<p>It&rsquo;s up to the maintainers of the software to curate a list of contributors and
track when contributors join and leave (yes, there is a temporal element as
well). This is hard. Maintainer needs tooling. And you would want to detect
unauthorized commits (impersonation, a malicious forge, a machine in the middle
or for instance when project is given to a new maintainer by a forge/registry).</p>
<p>What does the solution look like?</p>
<ul>
<li>Clear semantics</li>
<li>The project itself maintains signing policy</li>
<li>Third party uses maintainers&rsquo; policy to authenticate project</li>
<li>Verification, not attestation: do not rely on any external authority</li>
</ul>
<p>(Note that the maintainers can still be socially engineered to include the key
of an attacker in their policy. So they still have to be careful about who is
added to the policy.)</p>
<p>Sequoia git provides:</p>
<ul>
<li>Specification</li>
<li>Config</li>
<li>Tooling</li>
</ul>
<p>With <a href="https://gitlab.com/sequoia-pgp/sequoia-git">Sequoia git</a> (which part of
the <a href="https://sequoia-pgp.org/">Sequoia PGP project</a>) you can have a signing
policy in an <code>openpgp-policy.toml</code> file in the project&rsquo;s Git repository. It
specifies users, their keys and their capabilities. You can use <code>sg-git</code> to help
maintain this file.</p>
<p>For instance to add user Alice and then describe the current policy, you can use
the following commands:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">sq-git policy authorize alice --committer &lt;cert&gt;
</span></span><span class="line"><span class="cl">sq-git policy describe
</span></span></code></pre></div><p>A commit is &ldquo;authenticated&rdquo; if at least one parent commit says the commit is
acceptable (via the policy). To verify that there is an authenticated path from
the current state back to a certain commit we trust, use this command:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">sq-git log --trust-root &lt;sha of trusted commit&gt;
</span></span></code></pre></div><p>Projects may have contributions from others that are not included in the policy.
To maintain an authenticated path when accepting the contribution, a trusted
author needs to merge the contribution via a merge commit that <em>is</em>
authenticated. (You may need to use the &ldquo;<code>--no-ff</code>&rdquo; on the merge to make sure
there is a merge commit though.)</p>
<p><a href="https://fosdem.org/2026/schedule/event/KFSUCW-sequoia-git/">Link to the conference page</a></p>
<h2 id="an-endpoint-telemetry-blueprint-for-security-teams--victor-lyuboslavsky">An Endpoint Telemetry Blueprint for Security Teams &mdash; Victor Lyuboslavsky</h2>
<p>With open source we can inspect something that is broken, we can change the
defaults. With security we are used to the opposite; it&rsquo;s a black box. We are
not used to owning the data. The data exists on the endpoints, but ownership is
transferred to a different team. How can we add more security in a way engineers
understand and can use?</p>
<p>Victor presents a blueprint with the following layers:</p>
<ul>
<li>Endpoint agents</li>
<li>Control layer</li>
<li>Ingestion, streaming &amp; storage</li>
<li>Detection</li>
<li>Correlation, intelligence and response</li>
</ul>
<p>The value is not in the layers themselves, but the boundaries. For example, the
ingestion should move the data reliably but should not care which tool collected
it. This makes them loosely coupled.</p>
<p>For endpoint agents Victor suggests
<a href="https://github.com/osquery/osquery">osquery</a> which allows basic questions about
endpoints. Data is structured and consistent. It aligns with open source values.
(Alternatives: scripts &amp; cron, log shippers like filebeat or tools like auditd
or Event Tracing for Windows.)</p>
<p>Controlling the data (the next layer) means that you want to have:</p>
<ul>
<li>Central config</li>
<li>Live queries</li>
<li>Consistent schemas</li>
</ul>
<p><a href="https://github.com/fleetdm/fleet">Fleet</a> (disclaimer: Victor works here) is
built to manage <code>osquery</code> at scale and a good candidate for this layer.</p>
<p>The control layer needs to work hand-in-hand with ingestion layer. The ingestion
layer moves data to downstream system. E.g. <a href="https://github.com/vectordotdev/vector">Vector</a> or
<a href="https://www.elastic.co/logstash">Logstash</a> can be used here.</p>
<blockquote>
<p>Ingestion isn&rsquo;t where you get clever. It&rsquo;s where you get reliable.</p></blockquote>
<p>Streaming decouples users from consumers and e.g. allows replay. Note that this
is an optional step and it would come <em>after</em> ingestion, not <em>in place of</em> it.
For instance <a href="https://kafka.apache.org/">Apache Kafka</a> can be used in this
layer. Ingestion absorbs the mess. Streaming preserves flexibility.</p>
<p>The storage layer is where telemetry becomes durable. It&rsquo;s about being able to
ask hard questions later. Examples of useful tools:
<a href="https://github.com/ClickHouse/ClickHouse">ClickHouse</a>,
<a href="https://www.elastic.co/elasticsearch">Elasticsearch</a> (which is better at text
search) and <a href="https://github.com/apache/iceberg">Iceberg</a> (which is slower for
active investigation).</p>
<p>For the detection layer you might want to use
<a href="https://github.com/SigmaHQ/Sigma">Sigma</a>. It provided portability. Rules are
translated to native SQL running on ClickHouse. Intent (Sigma signatures)
becomes execution (SQL query to get the data).</p>
<p>Finally the correlation layer: <a href="https://github.com/grafana/grafana">Grafana</a>
can be used for correlation and visualisation. Grafana can query ClickHouse.
Grafana also has alerting.</p>
<p>Note that response isn&rsquo;t just about automation. It&rsquo;s also to pause and ask
better questions. The correlation layer should focus on enabling humans to act.</p>
<p>Open endpoint telemetry is <strong>not</strong> an &ldquo;EDR killer&rdquo;. It does not replace it. It adds
diversity and complements other tools. It provides a second set of eyes.</p>
<p><a href="https://fosdem.org/2026/schedule/event/HYXTPH-endpoint-telemetry-blueprint/">Link to the conference page</a></p>
<h2 id="the-bakery-how-pep810-sped-up-my-bread-operations-business--jacob-coffee">The Bakery: How PEP810 sped up my bread operations business &mdash; Jacob Coffee</h2>
<p>Python loads imports eagerly by default. This leads to memory bloat and cold
start issues. Explicit lazy imports (see
<a href="https://peps.python.org/pep-0810/">PEP 810</a>) only import a module when it&rsquo;s
first accessed not when the import statement is executed.</p>
<p>Lazy import is scheduled to be included in Python 3.15 and looks like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">lazy</span> <span class="kn">import</span> <span class="nn">foo</span> <span class="kn">from</span> <span class="nn">bar</span>
</span></span></code></pre></div><p>The design principles applied are that lazy imports are:</p>
<ul>
<li>Explicit</li>
<li>Local</li>
<li>Granular</li>
</ul>
<p>When parsing the Python code a proxy module is created. Only when the module is
actually used, the proxy is transparently replaced by the real package. You will
not always see improvements, so do not blindly replace all imports with lazy
imports.</p>
<p>PEP 810 also eliminates the need for <code>TYPE_CHECKING</code> guards. (See the <a href="https://docs.python.org/3/library/typing.html#typing.TYPE_CHECKING">typing
docs</a>, in
short: importing a module that is expensive and only contains types used for
type checking in an &ldquo;<code>if TYPE_CHECKING:</code>&rdquo; block.) It also helps for faster test
discovery and collection, less memory usage, decrease cold start slowness in
e.g. AWS Lambda functions, CLI applications, etc.</p>
<p>Meta (with Cinder) saw a 70% startup time reduction and 40% memory savings.
PySide has a 35% startup improvement.</p>
<p>About CLI tools: when using lazy imports you might notice the difference when
using <code>--help</code>. There&rsquo;s no need to load all dependencies to just output the help
text of a tool.</p>
<p>Some notes:</p>
<ul>
<li>Import time side effects (e.g. logging configuration, DB connections) are also
delayed!</li>
<li>Type checkers need to be updated</li>
<li>Import errors move to first use (so in runtime, not at launch). Keep that in
mind when debugging</li>
<li>It&rsquo;s not always faster, so profile your application before migrating and see
where you can potentially benefit</li>
<li>Document your lazy imports!</li>
<li>You cannot do lazy imports in functions</li>
</ul>
<p>Circular imports are probably still a problem, but they just show up later.</p>
<p><a href="https://github.com/JacobCoffee/breadctl">Link to the repo for this talk</a></p>
<p><a href="https://fosdem.org/2026/schedule/event/HAAABD-the_bakery_how_pep810_sped_up_my_bread_operations_business/">Link to the conference page</a></p>
<h2 id="modern-python-monorepo-with-uv-workspaces-prek-and-shared-libraries--jarek-potiuk">Modern Python monorepo with <code>uv</code>, <code>workspaces</code>, <code>prek</code> and shared libraries &mdash; Jarek Potiuk</h2>
<p>Jarek is, besides his other roles, the number 1 Apache Airflow contributor. The
<a href="https://github.com/apache/airflow">Apache Airflow repo</a> is the monorepo he
talks about today. There is also a series of blog posts about this topic: see
<a href="https://medium.com/apache-airflow/modern-python-monorepo-for-apache-airflow-part-1-1fe84863e1e1">part 1</a>,
which links to the other parts.</p>
<p>Airflow drove early requirements for
<a href="https://docs.astral.sh/uv/concepts/projects/workspaces/">uv workspaces</a>. They now
manage 120+ distributions seamlessly with it. It allows them to combine
distributions to work together in a workspace. Also used to import from one
distribution in another one.</p>
<p>The project shares a single virtual environment used by <code>uv</code> in root of project.
If you run &ldquo;<code>uv sync</code>&rdquo; from the top level you get everything. If you run it in a
subdirectory (e.g. <code>airflow-core</code>) you only get what is needed for that
distribution.</p>
<p>Benefits of the <code>uv</code> workspaces:</p>
<ul>
<li>Isolated</li>
<li>Explicit</li>
<li>Flexible</li>
</ul>
<p><a href="https://hatch.pypa.io/1.12/">Hatch</a> has (or will have, at the time of writing)
largely compatible workspaces.</p>
<p>However <a href="https://pre-commit.com/">pre-commit</a> became a bottleneck. They needed
to run 170+ pre commit hooks <strong>on every commit</strong>.
<a href="https://github.com/j178/prek">Prek</a> is drop-in replacement for pre-commit and
works fantastic. It is optimized for speed and monorepos.</p>
<p>Airflow uses symlinked shares libraries (where a shared lib is also a
distribution). The Hatchling build backend needs to replace links with physical
copies during packaging. They use Prek to maintain consistency.</p>
<p><code>uv sync</code> detects conflicts between merged requirements files and Prek hooks
enforce relative imports in shared code to prevent cross coupling issues (IIRC)</p>
<p><a href="https://fosdem.org/2026/schedule/event/WE7NHM-modern-python-monorepo-apache-airflow/">Link to the conference page</a></p>
<h2 id="pyinfra-because-your-infrastructure-deserves-real-code-in-python-not-yaml-soup--loïc-wowi42-tosser">PyInfra: Because Your Infrastructure Deserves Real Code in Python, Not YAML Soup &mdash; Loïc &ldquo;wowi42&rdquo; Tosser</h2>
<p>Loïc is a Frenchmen (which, as he himself states, means he <strong>must</strong> have
opinions) and not a YAML fan to put it mildly. That is: YAML as a programming
language, e.g. how it is used in <a href="https://github.com/ansible/ansible">Ansible</a>.</p>
<figure><img src="/images/fosdem2026_loic_tosser.jpg"
    alt="Photo of Loïc Tosser showing a complex Ansible task in YAML"><figcaption>
      <p>Loïc Tosser demonstrating what happens when you ask a config file to be a programming language</p>
    </figcaption>
</figure>

<p><a href="https://pyinfra.com/">PyInfra</a> is an infrastructure as code library to write
Python code which is then translated to shell scripts to run on the target
hosts. So, in contrast to Ansible, you do not need Python on the target. The
target machine only needs SSH and a POSIX shell. You can also configure Docker
containers with PyInfra.</p>
<blockquote>
<p>If it has SSH, PyInfra can talk to it.</p></blockquote>
<p>PyInfra has idempotent operations and built-in diff checking. Declarative
infrastructure with actual code and not YAML. You can use inventory from
Terraform, Coolify or any API.</p>
<p>You can leverage the entire Python packaging ecosystem. Slack integration? Just
use the right Python package.</p>
<p>PyInfra is not only a CLI tool, you can also use it as a library.</p>
<p>PyInfra is 10 times faster than Ansible, uses 70% less code, has proper code
reuse via <code>import</code> and proper loops instead of <code>with_items</code>. It can have actual
unit tests and can scale to thousands of servers. Also you no longer have error
messages stating that <q>the error appears to be in &hellip; <strong>but may be
elsewhere in the file</strong> &hellip;</q> (looking at you Ansible). PyInfra has
clear error messages without having to specify <code>-vvvv</code> and wading through
hundreds of lines of output.</p>
<p>The suggested migration path:</p>
<ul>
<li>Start small, one playbook at a time</li>
<li>Use your IDE for autocomplete and refactoring</li>
<li>Leverage Python&rsquo;s standard library and the ecosystem with all its packages</li>
<li>Sleep better because you don&rsquo;t have to debug at 3 AM.</li>
</ul>
<p>Is PyInfra production ready? Yes! It has a stable API, is already in use in
production, it&rsquo;s actively maintained and is MIT licensed (so no commercial
entity behind it to steer its direction).</p>
<p>You can get started today with a simple &ldquo;<code>pip install pyinfra</code>&rdquo;.</p>
<p><a href="https://fosdem.org/2026/schedule/event/VEQTLH-infrastructure-as-python/">Link to the conference page</a></p>
<p>(Note from me, Mark, I found Loïc a great speaker: he has lots of energy, is
funny and can transfer his enthusiasm to the room. If the topic interests you
and the video becomes available, I would recommend watching this talk as a great
sales pitch to get started with PyInfra.)</p>
<h2 id="ducks-to-the-rescue---etl-using-python-and-duckdb--marc-andré-lemburg">Ducks to the rescue - ETL using Python and DuckDB &mdash; Marc-André Lemburg</h2>
<p>ETL stands for Extract, Transform, Load. Nowadays we usually do Extract, Load,
Transform because databases are efficient in processing.</p>
<p>DuckDB is open source, in-process analytics data storage (OLAP). It is similar
to SQLite, but for OLAP workloads. It has great Python support and uses SQL as
standard query language. It&rsquo;s pip installable, column based
(<a href="https://arrow.apache.org/">Apache Arrow</a>). It&rsquo;s single writer but allows for
multiple readers, so it&rsquo;s not a distributed database.</p>
<p><a href="https://github.com/pola-rs/polars">Polars</a>&rsquo; streaming can help with processing
your data as a line-by-line stream so you don&rsquo;t have to load the whole file in
memory at once.</p>
<p>Example to load a CSV file into DuckDB extremely fast:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">read_csv</span><span class="p">(...)</span><span class="w">
</span></span></span></code></pre></div><p>You can load the data into staging tables first to prepare everything and not
mess up e.g. existing data. You can then transform data in DuckDB, e.g. filter
out unneeded and duplicate data, validate data, fill in missing data, convert
data types, etc. You can do the transforms in SQL. You can even use native
integrations to write to PostgreSQL, MySQL, etc. Or worst case stream to Python.</p>
<p>Guidelines:</p>
<ul>
<li>Know your queries, that is: know how your data is going to be used</li>
<li>Use the Pareto principle (80/20 rule): optimize for queries that are used
often</li>
<li>Keep a healthy balance between performance and space requirements (which are
often trade-offs)</li>
</ul>
<p>Huge datasets: use the <a href="https://github.com/duckdb/ducklake">DuckLake</a> extension.</p>
<p>To get started: &ldquo;<code>uv add duckdb</code>&rdquo;. Do some experiments and see how it works for
you.</p>
<p><a href="https://fosdem.org/2026/schedule/event/S7RELZ-ducks_to_the_rescue_-_etl_using_python_and_duckdb/">Link to the conference page</a></p>
<h2 id="my-takeaways">My takeaways</h2>
<ul>
<li>Yes, FOSDEM is crowded and you may not be able to get into every talk you want
to see in person, but it&rsquo;s still nice to be there. It&rsquo;s well organised and
there&rsquo;s a friendly atmosphere. Lots of interesting projects to see and people
to talk to. And it&rsquo;s convenient if you want to sponsor your favorite projects
by buying some merchandise.</li>
<li>It&rsquo;s worth investigating signing Docker images (in the right way) further.</li>
<li>Lazy imports look useful! Once Python 3.15 lands it&rsquo;s worth doing profiling on
the projects I work on to see if we can use those to speed things up on
startup and save some memory.</li>
<li>At work we recently decided to go for a monorepo for a project. I want to see
if/how <code>uv</code> workspaces and <code>prek</code> can help us.</li>
<li>I&rsquo;ve written a bunch of Ansible roles to configure my humble homelab and
laptop. Perhaps it&rsquo;s time to switch to PyInfra? It sounds promising and might
be worth the investment of migrating to.</li>
</ul>
<h2 id="about-the-trip">About the trip</h2>
<p><figure class="float-right"><img src="/images/fosdem2026_atomium.jpg"
    alt="Picture of the Atomium at night" width="200px"><figcaption>
      <p>The <a href="https://en.wikipedia.org/wiki/Atomium">Atomium</a> at night</p>
    </figcaption>
</figure>

Last year I drove to Brussels on Friday and stayed at the city center in the
<a href="https://cityboxhotels.com/hotels/brussels/citybox-brussels">Citybox Brussels
hotel</a> for one
night, since I had to be home on Sunday. The upside: it was just a short (15
minute?) tram ride to the FOSDEM location. Unfortunately it did mean I had to
drive home that evening.</p>
<p>This year I had more time, so I booked a room at
<a href="https://www.falkohotel.be/">Falko Hotel</a> for two nights. It&rsquo;s about a 20&ndash;30
minute drive (depending on traffic) to the <a href="https://www.interparking.be/en/parkings/brussels/toison-d-or/">parking
garage</a> I used.
And from there about 20 minutes with pubic transport to the Université libre de
Bruxelles.</p>
<p>Staying another night meant I had more time for sightseeing, had the time to
write this post from my notes and could drive home well rested the next day.</p>
<p>As for tech: besides a phone and laptop, I also brought along two items that
made the trip more comfortable:</p>
<ul>
<li>A <a href="https://mojogear.eu/en/products/mojogear-mini-evo-10-000-mah-power-bank-22-5w">MOJOGEAR Mini
Evo</a>
powerbank to give my phone extra juice to make it through the day. With 10.000
mAh and up to 22.5W of power it&rsquo;s more than sufficient for a day at a
conference. With its small size and less than 175 grams in weight, it&rsquo;s also
easy to carry around.</li>
<li>A <a href="https://www.gl-inet.com/products/gl-sft1200/">GL.iNet Opal (GL-SFT1200)</a>
travel router. I plug it in, hook it up to the hotel internet, start a VPN
connection and all my other devices automatically connect to it and can use
the internet without the hotel snooping on my traffic. (Not that I have an
indication that my hotel would do that, but theoretically they could if I
would not use a VPN.)</li>
</ul>]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[All Day DevOps 2021]]></title>
    <link rel="alternate" href="https://markvanlent.dev/2021/10/28/all-day-devops-2021/" type="text/html" />
    <id>https://markvanlent.dev/2021/10/28/all-day-devops-2021/</id>
    <author>
      <name>map[name:Mark van Lent uri:https://markvanlent.dev/about/]</name>
    </author>
    <category term="chaos engineering" />
    <category term="conference" />
    <category term="costs" />
    <category term="culture" />
    <category term="devops" />
    <category term="infrastructure as code" />
    <category term="terraform" />
    
    <updated>2021-12-09T21:14:34Z</updated>
    <published>2021-10-28T00:00:00Z</published>
    <content type="html"><![CDATA[<p><a href="https://www.alldaydevops.com/">All Day DevOps</a>, the free online DevOps
conference that goes on for 24 hours, was held for the 6th time today. With a
total of 180 speakers spread over 6 tracks, there&rsquo;s even more content than <a href="/2019/11/06/all-day-devops/">the
last time I attended</a>. These are the notes I
took during the day.</p>
<figure><img src="/images/all_day_devops_2021_keynote.png"
    alt="First slide of the kick-off session with Derek Weeks opening the day"><figcaption>
      <p>Derek Weeks opening the day for All Day DevOps</p>
    </figcaption>
</figure>

<h2 id="keynote-cams-then-and-now-and-the-path-forward--thomas-vikingops-krag">Keynote: CAMS Then and Now and the Path Forward &mdash; Thomas &ldquo;VikingOps&rdquo; Krag</h2>
<p>When people think about DevOps, they think of CAMS, which stands for:</p>
<ul>
<li><strong>C</strong>ulture</li>
<li><strong>A</strong>utomation</li>
<li><strong>M</strong>easurement</li>
<li><strong>S</strong>haring</li>
</ul>
<p>The term CAMS was formalized by John Willis and he wrote down his idea in the
article <a href="https://www.chef.io/blog/what-devops-means-to-me">What Devops Means to Me</a>.
This talk will focus on the most important aspect: culture, and specifically
organizational learning.</p>
<h3 id="culture">Culture</h3>
<p>Culture as described by Simon Sinek:</p>

  <figure>

<blockquote >
A group of people with a common set of values and beliefs. When we&rsquo;re
surrounded by people who believe what we believe something remarkable happens.
Trust emerges.
</blockquote>

  <figcaption>
    &mdash;Simon Sinek
  </figcaption>
  </figure>


<p><img src="/images/all_day_devops_2021_organizational_learning_thomas_krag.png" alt="Thomas Krag talks about organizational learning"></p>
<p>The 5 disciplines from <a href="https://en.wikipedia.org/wiki/Learning_organization">organizational learning</a>:</p>
<dl>
<dt>Shared vision</dt>
<dd>Have people play the game not just because of the rules, but because they feel
responsible for the game.</dd>
<dt>Mental Models</dt>
<dd>Breaking your own assumptions and how they can influence your actions. You
need to self-reflect on your own beliefs.</dd>
<dt>Personal Mastery</dt>
<dd>This is about owning yourself, learning and achieving your personal goals. And
to do the latter you first need to define what is important for yourself.</dd>
<dt>Team learning</dt>
<dd>Effective teamwork leads to results that persons cannot achieve on their own.
And individuals working as a team can learn faster than they would by themselves.</dd>
<dt>Systems Thinking</dt>
<dd>Looking into patterns that emerge inside your organization from a
holistic viewpoint instead of just looking at your own team.</dd>
</dl>
<h3 id="automation">Automation</h3>
<p>Automation is about automating the right things. You don&rsquo;t want to do it just to
automate things, but it has to fit in the system. So you have to start with
culture before you think about automation. The ultimate goal is
<a href="https://www.gitops.tech/">GitOps</a> where everything happens via a pull request.</p>
<h3 id="measurement">Measurement</h3>
<p>Measurement started out with a focus on the tooling. But
<strong>measuring how you work</strong> is as important as measuring your infrastructure.</p>
<p>Important key metrics, from <a href="https://www.devops-research.com/research.html">DORA</a>:</p>
<ul>
<li>deployment frequency (how often do you release to production)</li>
<li>lead time for changes (the amount of time for a change to get into production)</li>
<li>time to restore service (how long does it take to recover from a failure in
production)</li>
<li>change failure rate (the percentage of deployments causing a failure in
production)</li>
</ul>
<h3 id="sharing">Sharing</h3>
<p>Share how you are doing (in) DevOps. This is how we ended up here now.
Share how you are improving your own organization. But also share information
<em>within</em> your organization: documentation, videos, presentations,
<a href="https://devopsdays.org/open-space-format/">open spaces</a>,
<a href="https://leancoffee.org/">lean coffee sessions</a></p>
<h3 id="three-ways">Three ways</h3>
<p>CAMS originated in 2010. The
<a href="https://itrevolution.com/the-three-ways-principles-underpinning-devops/">three ways</a>
are principles that came out of the book
<a href="https://itrevolution.com/the-phoenix-project/">The Phoenix Project</a>. They are:</p>
<ul>
<li>1st way: create flow (systems thinking)</li>
<li>2nd way: feedback loops</li>
<li>3nd way: experimentation, risks, learning</li>
</ul>
<p>If we map these to CAMS:</p>
<ul>
<li>Culture: 3rd way</li>
<li>Automation: 1st way</li>
<li>Measurement: 2nd way</li>
<li>Sharing: 3rd way</li>
</ul>
<p>Working on your culture is as important as doing your actual work.</p>
<p>So what&rsquo;s next? CAMS is still as applicable as 10 years ago. It has always been
important, but it was only put into words in 2010. We need to continue sharing
to get the full value out of it.</p>
<h2 id="gamification-of-chaos-testing--bram-vogelaar">Gamification of Chaos Testing &mdash; Bram Vogelaar</h2>
<p>We think of <a href="https://en.wikipedia.org/wiki/Usain_Bolt">Usain Bolt</a> as the record
breaking athlete, but he&rsquo;s also the person that worked really hard to get there.
It takes a lot of time and effort to become good at something. Pilots and firemen
spend most of their time training and not doing what you expect them to do; just
to make sure they perform well under pressure. Also note that pilots use a lot
of checklists to prevent mistakes.</p>
<p>We should do the same: train for when our platform is in an error state. We
should not just be able to detect it, but also solve the problem.</p>
<p>Chaos engineering is <q>the discipline of experimenting on a distributed system
in order to build <strong>confidence</strong> in the system&rsquo;s capability to
withstand turbulent conditions <strong>in production</strong>.</q> This
practice started at Netflix with <a href="https://github.com/Netflix/chaosmonkey">Chaos Monkey</a>.</p>
<p><img src="/images/all_day_devops_2021_chaos_engineering_ecosystem_bram_vogelaar.png" alt="Bram Vogelaar showing an image of the vast CNCF cloud native landscape with a whole section of chaos engineering products"></p>
<p>We need to become comfortable with experimenting. Have game day exercises and
analyze what happened, to improve your training. Do not just focus on the result
of the exercise itself, but also ask questions like &ldquo;was it the right
experiment?&rdquo;</p>
<p>Now that we use containers, add sidecar containers with tools to get metrics or
detect errors. Or to do chaos engineering e.g. with
<a href="https://github.com/Shopify/toxiproxy">Toxiproxy</a>.</p>
<p>Since checklists are boring, we can use gamification to spice things up.
Celebrate failure, and learn from it!</p>
<blockquote>
<p>Living in the year 3000: breaking production on purpose on Saturdays and have
the system remedy the problem itself.</p></blockquote>
<p>Convince management that failure is <em>normal</em> and <em>expected</em> behaviour. Promising
100% uptime is not realistic. Large, complex systems will always be in a
(somewhat) degraded state.</p>
<p>Let engineers be scientists to deal with this complex environment. Give them
training, allow them to do tests (experiments), which results in having valid
monitoring that lead to actionable alerts. Get the engineers in a state where
they are comfortable with failures.</p>
<p>(<a href="https://www.slideshare.net/attachmentgenie">Slides</a>)</p>
<h2 id="watch-your-wallet-cost-optimizations-in-aws--renato-losio">Watch Your Wallet! Cost Optimizations in AWS &mdash; Renato Losio</h2>
<p>Each AWS service has it&rsquo;s own price components. It&rsquo;s a complex subject. Even a
simple service like a load balancer has multiple components. It looks simple
with the &ldquo;$0.008 per LCU-hour&rdquo; price tag, but now you have to figure out what an
LCU-hour is. Then you learn it has four dimensions that are measured: number of
new connections per second, active connections, processed bytes and rule
evaluations. Good luck predicting the costs.</p>
<p>This presentation only sticks to the basics since cost optimization it such a
big topic.</p>
<p>To start to manage/reduce your costs, you need to enable billing and costs for
your DevOps team. If you cannot measure your costs, you cannot manage it. Note
that you do not need an excessive amount of tags for cost management. First you
need to figure out what you are going to change and how it&rsquo;s going to affect the
bill for your company.</p>

  <figure>

<blockquote >
When savings can be measured, they can be recognized, and <strong>cost efficiency
projects become exciting opportunities</strong>. As of early 2021, the most viewed
dashboard at Airbnb is a dashboard of AWS costs.
</blockquote>

  <figcaption>
    &mdash;Anna Matlin, Airbnb
  </figcaption>
  </figure>


<p>What patterns can we avoid? In most organizations, the most expensive parts of your
bill will be:</p>
<ul>
<li>Compute</li>
<li>Storage</li>
<li>Data transfer</li>
</ul>
<p>So we will dive into these subjects.</p>
<h3 id="compute">Compute</h3>
<p>Tips to reduce costs:</p>
<ul>
<li>Avoid fixed IP addresses where possible. This is not so much about the costs
of the IP address itself, but mostly because architectures that require a
fixed IP tend to be complex and more expensive.</li>
<li>Use Graviton (ARM) instances. They have a better price to performance ratio.
You can also migrate managed services (RDS, Lambda functions) to Graviton.</li>
<li>Check out Lightsail. It&rsquo;s less flexible than using EC2 instances, but cheaper.</li>
</ul>
<h3 id="data-transfer">Data transfer</h3>
<p>We usually forget to take data transfer costs into account upfront. It&rsquo;s also a
complex subject and there are a lot of considerations to make.</p>
<p><img src="/images/all_day_devops_2021_aws_data_transfer_costs_renato_losio.png" alt="Renato Losio shows an image with all kind of ways data transfer can cost you money"></p>
<p>Tips:</p>
<ul>
<li>Multi AZ is always good, but are 3 zones always better than 2 zones?</li>
<li>Multi region is easy to do nowadays, but expensive with regard to data
transfer. Ask yourself if you really need it.</li>
<li>If you want to use multiple cloud providers always think about the data
transfer costs. Perhaps the storage costs are lower at provider X, but if you
have to transfer data, you&rsquo;ll probably pay an egress cost.</li>
<li>Using a CDN (CloudFront) might be cheaper than paying for the egress data
transfer otherwise.</li>
</ul>
<h3 id="storage">Storage</h3>
<p>Initially it was simple: there were only two storage classes. Currently there
are 6 different classes with their own prices and characteristics.</p>
<p><img src="/images/all_day_devops_2021_s3_storage_classes_renato_losio.png" alt="Renato Losio shows an image with the various S3 storage classes and their use cases"></p>
<p>The best option is to use lifecycle rules to move data between different
classes. Note that you in a lifecycle policy you cannot filter the objects based
on an extension (e.g. <code>*.jpg</code>) but instead you need to think &ldquo;from left to
right.&rdquo; So you need to think upfront about the prefixes you are going to want to
use in your bucket.</p>
<p>Managed services can offer you automatic and manual backups, which can be great.
But what is the cost of that? Check how much retention you need for example.</p>
<p>With regard to EBS: for most use cases <code>gp3</code> is better and cheaper than <code>gp2</code>
(except for very large volumes). Note that you can change the EBS volume type
without stopping the machine. In most cases you can have a 20% cost saving
without affecting your performance.</p>
<h3 id="aws-changes-quickly">AWS changes quickly</h3>
<p>After each AWS re:Invent, your deployment is probably outdated with regard to cost
optimizations. Examples:</p>
<ul>
<li>Use <code>gp3</code> instead of <code>gp2</code> for your EBS volumes.</li>
<li>The instance type <code>m6</code> might be more interesting than the <code>m5</code> or <code>m4</code> you may
currently be using.</li>
<li>Dublin was the cheapest region in Europe, but for most things Stockholm is
cheaper than Dublin at the moment.</li>
</ul>
<p>So keep up to date with the offerings.</p>
<h2 id="keynote-call-for-code-with-the-linux-foundation-contributing-to-tech-for-good-even-if-youre-not-techincal--daniel-krook-and-demi-ajayi">Keynote: Call for Code with The Linux Foundation: Contributing to Tech-for-Good Even if You&rsquo;re Not Techincal &mdash; Daniel Krook and Demi Ajayi</h2>
<p>Call for Code is a multi year program launched in 2018 to address humanitarian
issues and help bridge potential solutions. Last year the global challenge was
around climate change and a track was added for the social and business impact
of the COVID-19 pandemic.</p>
<p>It&rsquo;s not just about generating ideas to take on the issues. It should
eventually also lead to an adopted open source solution that is sustainable.</p>
<p>The 14 projects discussed today can be found on <a href="https://linuxfoundation.org/projects/call-for-code/">Call for Code page on the Linux Foundation website</a>.
You can also read about them via the <a href="https://developer.ibm.com/callforcode/">IBM developer site</a>.
GitHub is central for how they iterate on the features. The related organizations are :</p>
<ul>
<li><a href="https://github.com/call-for-code">Call for Code with the Linux Foundation</a></li>
<li><a href="https://github.com/call-for-code-for-racial-justice">Call for Code for Racial Justice</a></li>
</ul>
<p>The key takeaway for this session is to make us help improve how the projects do
DevOps. The goal is to ensure that everyone can contribute to the projects, can do so
with confidence and the projects can be deployed with speed.</p>
<p>The Call for Code for Racial Justice open source projects are categorised in three pillars:</p>
<ul>
<li>Police reform and judicial accountability:
<ul>
<li><a href="https://github.com/Call-for-Code-for-Racial-Justice/fairchange">Fair Change</a></li>
<li><a href="https://github.com/Call-for-Code-for-Racial-Justice/Open-Sentencing/">Open Sentencing</a></li>
<li><a href="https://github.com/Call-for-Code-for-Racial-Justice/Incident-Accuracy-Reporting-System">Incident Accuracy Reporting System (IARS)</a></li>
</ul>
</li>
<li>Diverse representation:
<ul>
<li><a href="https://github.com/Call-for-Code-for-Racial-Justice/TakeTwo">Take Two</a></li>
</ul>
</li>
<li>Policy and legislative reform:
<ul>
<li><a href="https://github.com/Call-for-Code-for-Racial-Justice/Legit-Info/blob/main/README.md">Legit-Info</a></li>
<li><a href="https://github.com/Call-for-Code-for-Racial-Justice/Five-Fifths-Voter">Five Fifth Voters</a></li>
<li><a href="https://github.com/Call-for-Code-for-Racial-Justice/Truth-Loop">Truth Loop</a></li>
</ul>
</li>
</ul>
<p>Demi talked about each of these projects in the program and their tech stacks. You
can read more about them on the
<a href="https://developer.ibm.com/callforcode/racial-justice/">Call for Code for Radical Justice section</a>
on the IBM developer site.</p>
<p>Daniel in turn talked about other Call for Code projects:</p>
<ul>
<li><a href="https://clusterduckprotocol.org/">ClusterDuck</a></li>
<li><a href="https://pyrrha-platform.org/">Pyrrha</a></li>
<li><a href="https://www.isac-simo.net/">ISAC-SIMO</a></li>
<li><a href="https://openeew.com/">OpenEEW</a></li>
<li><a href="https://github.com/Call-for-Code/Liquid-Prep">Liquid Prep</a></li>
<li><a href="https://github.com/Call-for-Code/DroneAid">DroneAid</a></li>
<li><a href="https://github.com/Rend-o-matic">Rend-o-matic</a></li>
</ul>
<p>Even if you cannot code, there are numerous ways you can contribute, e.g.
conducting user research, write/review documentation, do design work, advocacy
like speaking at conferences.</p>
<p><img src="/images/all_day_devops_2021_call_for_code_non_technical_contributions_demi_ajayi.png" alt="Demi Ajayi talking about non-technical ways to contribute to Call for Code"></p>
<p>There are multiple ways to get involved:</p>
<ul>
<li>Via the virtual community (<a href="https://callforcode.org/slack">join slack</a>, join events)</li>
<li>Directly via the projects on GitHub</li>
<li>Conduct outreach (recruit friends, host events, etc)</li>
</ul>
<h2 id="devsecops-culture-laughing-through-the-failures--chris-romeo">DevSecOps Culture: Laughing Through the Failures &mdash; Chris Romeo</h2>
<p>Chris shared 10 DevSecOps failures and talked about how to change the culture
and turn these failures into successes.</p>
<h3 id="1-name-and-brand">#1 Name and brand</h3>
<p>This quote nicely sums it up:</p>

  <figure>

<blockquote cite="https://twitter.com/jvehent">
<p>Can we stop the {Sec}Dev{Sec}Ops{Sec} naming foolishness?</p>
<p>Just call it DevOps and focus on making security a natural part of building stuff.</p>

</blockquote>

  <figcaption>
    &mdash;Julian Vehent, <cite><a href="https://twitter.com/jvehent">@jvehent</a></cite>
  </figcaption>
  </figure>


<p>To change the culture:</p>
<ul>
<li>Embrace security and make it part of DevOps</li>
<li>Teach this new definition to everyone involved with your project (developers,
testers, product managers, executives, etc)</li>
<li>Create a &ldquo;DevSecOps&rdquo; swear jar ;-)</li>
</ul>
<h3 id="2-the-infinity-graph">#2 The infinity graph</h3>
<p><img src="/images/all_day_devops_2021_infinity_loop_chris_romeo.png" alt="Chris Romeo shows the often used infinity symbol"></p>
<p>The problem is that nothing ever gets done with this infinity graph. It&rsquo;s not an
accurate representation.</p>
<p>The solution is to talk about pipelines instead and integrating security into
them. Code review should include security, vulnerability scanning should be part
of the pipeline, etc. Ban the infinity graph.</p>
<h3 id="3-security-as-a-special-team">#3 Security as a special team</h3>
<p>Creating a specific security team is the <em>opposite</em> of what DevOps is about. It&rsquo;s about
working together. Security isn&rsquo;t a specialty, it is the responsibility of
everybody. This requires knowledge and expertise.</p>
<p>The other way around is also true: teach security people to code. They don&rsquo;t
have to become great coders, but it would be nice if they can review code and
make suggestions to make things more secure.</p>
<h3 id="4-vendor-defined-devops">#4 Vendor defined DevOps</h3>
<p>Sometimes we let vendors define what DevOps and security are for us, via the
products that they offer. It would be better to find the best of breed outside
of the offering of cloud provider. Take a vendor independent approach and
determine what DevOps means to <strong>you</strong>.</p>
<h3 id="5-big-company-envy">#5 Big company envy</h3>
<p>Looking at the big companies can be discouraging. You most likely have not
invested the same time in it as e.g. Netflix, Etsy, etc. have. So while you
won&rsquo;t be at the same level, don&rsquo;t see this as an excuse to give up. Do the
DevOps that you do. Don&rsquo;t fixate on the top of the class. Get on that path and
make incremental progress.</p>
<p>Use the <a href="https://owasp.org/www-project-devsecops-maturity-model/">OWASP DevSecOps Maturity Model</a> to create a roadmap.</p>
<h3 id="6-overcomplicated-pipelines-and-doing-everything-now">#6 Overcomplicated pipelines and doing everything now</h3>
<p>This can be a complicated subject (see for example the
<a href="https://blog.sonatype.com/2020-devsecops-reference-architecture">DevSecOps Reference Architecture</a>
from Sonatype).</p>
<p><img src="/images/all_day_devops_2021_sonatype_reference_architecture_chris_romeo.png" alt="Chris Romeo shows the Sonatype DevSecOps reference architecture"></p>
<p>But keep it simple! Start with a small subset of security tools. Everybody
related to your project should be able to explain the build pipeline. Don&rsquo;t try
to solve all problems immediately. Take a phased approach.</p>
<h3 id="7-security-as-gatekeeper">#7 Security as gatekeeper</h3>
<p>Security might want to slow down the pipeline and act as a gatekeeper. Don&rsquo;t say
&ldquo;<em>no</em>,&rdquo; but &ldquo;<em>yes, if&hellip;</em>&rdquo; For example: &ldquo;<em>yes</em>, that would be a great feature <em>if</em> you
enable multifactor authentication.&rdquo;</p>
<p>Practice empathy. Both security people for developers, but also the other way
around.</p>
<h3 id="8-noisy-security-tools">#8 Noisy security tools</h3>
<p>You buy a tool and enable every option to &ldquo;get your money&rsquo;s worth.&rdquo; The result
is 10,000 JIRA tickets of things that need to be fixed. This does not help.</p>
<p>It would be better to tune the tools and don&rsquo;t waste time with security findings
that do not matter.</p>
<p>Start with a minimal policy focussing on the largest issue. Developers will then
start to trust the tool and <em>then</em> you can slowly increase the policy.</p>
<h3 id="9-lack-of-threat-modelling">#9 Lack of threat modelling</h3>
<p>You scanning tools cannot find business logic flaws; there&rsquo;s no pattern to it.</p>
<p><img src="/images/all_day_devops_2021_threat_modelling_chris_romeo.png" alt="Chris Romeo shows the condescending Wonka meme with the text &ldquo;DevOps is too quick for threat modeling you say? &hellip; Thousands of business logic flaws beg to differ"></p>
<p>Perform threat modelling outside of the pipeline. It should be done when new
feature assignments go out.</p>
<h3 id="10-vulnerable-code-in-the-wild">#10 Vulnerable code in the wild</h3>
<p>There are lots of vulnerabilities in open source software; this is a supply
chain problem.</p>
<p>To improve this: embed software composition analysis (SCA) in <strong>all</strong> your
pipelines. Set the SCA policy to fail when a vulnerability is detected. If you
filter out a vulnerability (e.g. because there&rsquo;s no fix yet), make sure the
filter will not be active forever.</p>
<h3 id="successes">Successes</h3>
<p>The 10 DevOps successes we can distil from the failures above:</p>
<ol>
<li>Just call it DevOps</li>
<li>Pipelines</li>
<li>Security for everyone</li>
<li>Embrace <strong>your</strong> DevOps</li>
<li>Be content with <strong>your</strong> DevOps</li>
<li>Simple and staged pipeline</li>
<li>Security as trusted partner</li>
<li>Tuned and valuable security tools</li>
<li>Threat modelling for everyone</li>
<li>Breaking the build for vulnerabilities</li>
</ol>
<h3 id="key-takeaways">Key takeaways</h3>
<ul>
<li>Hopefully the real impact of DevOps (when looking back 50 years from now) is
going to be security culture related.</li>
<li>Some of the failures were outside of your control, but you can change them.</li>
<li>Everybody has to code, choose <strong>your</strong> DevOps, keep it simple, lower the
noise, add the best practices we discussed, do some threat modelling, break
the build for vulnerabilities.</li>
<li>Embrace and laugh at the failures.</li>
</ul>
<h2 id="keynote-managing-risk-with-service-level-objectives-and-chaos-engineering--liz-fong-jones">Keynote: Managing Risk with Service Level Objectives and Chaos Engineering &mdash; Liz Fong-Jones</h2>
<p>Besides being a principal developer advocate at Honeycomb, Liz is also a member
of the platform on-call rotation. Honeycomb deploys with confidence up to 14
times a day, every day of the week&mdash;so also on Fridays. How do they manage to
(mostly) meet their Service Level Objectives (SLOs) while also scaling out their
user traffic?</p>
<p>Their confidence recipe:</p>
<ul>
<li>Quantify the amount of reliability.</li>
<li>Be able to identify risk areas that might prevent them from fulfilling their
targets.</li>
<li>Test to verify that the assumptions about the systems are correct.</li>
<li>Respond to that feedback to address the problems found via those experiments
or though natural outages.</li>
</ul>
<h3 id="how-to-measure-reliability">How to measure reliability?</h3>
<p>You need to know how broken is &ldquo;too broken.&rdquo; You don&rsquo;t have to alert on all
problems when working at scale. You need to measure success of the service and
define SLOs. These are a way to measure and quantify your reliability.</p>
<p>Honeycomb&rsquo;s jobs is to reliably ingest telemetry, index it, store it safely and
let people query it in near-real-time. Honeycomb&rsquo;s SLOs measure the things
that their customers care about.</p>
<p>For example, they have set an SLO that the homepage needs to load quickly
(within a few hundred milliseconds) in 99.9% of the times. User queries need to
be run successful &ldquo;only&rdquo; 99% of the time and are allowed to take up to 10 seconds.
On the other hand: ingestion needs to succeed in 99.99% of the time since they
only have one shot at it.</p>
<blockquote>
<p>Services are not just 100% down or 100% up (most of the time).</p></blockquote>
<p>These metrics help Honeycomb make decisions about reliability and product
velocity. If the service is down too much, they need to invest in reliability
(since having features that cannot be used does not add value). On the other
hand: if they exceed the SLO, they can move faster.</p>
<h3 id="recipe-for-shipping-reliably-and-quickly">Recipe for shipping reliably and quickly</h3>
<p>Practices used by Honeycomb:</p>
<ul>
<li>Code is instrumented. Think beforehand what a success or failure in
production would look like.</li>
<li>Functional and visual testing using libraries that create snapshots.</li>
<li>The design for feature flag deployment (only making a feature available to a
small percentage of users initially).</li>
<li>Practice automated integration plus human review. (There&rsquo;s an SLO for how
long the tests are allowed to take. Code reviews are high priority tasks
since you are blocking someone else by not reviewing their code.)</li>
<li>The main branch is safe to release any time.</li>
<li>Automatically roll out the changes.</li>
<li>After the deployment the engineers have to observe what is happening with
their code in production. Only after they confirm that everything is okay,
they can go home. So you are free to deploy on Friday at 19:30 as long as you
are willing to stick around until your code is deployed and you have confirmed
it is not causing issues.</li>
</ul>
<p>For infrastructure Honeycomb also use infrastructure as code practices:</p>
<ul>
<li>They can use CI and feature flags for their infrastructure.</li>
<li>They can automatically provision fleets if needed.</li>
<li>They can automatically quarantine certain paths to keep the main fleet from
crashing or do performance profiling.</li>
</ul>
<h3 id="chaos-engineering">Chaos Engineering</h3>
<p>Left-over error budget is used for chaos engineering experiments. This is
something where you go test a hypothesis. You need to control the percentage of
users affected by it and be able to revert the impact you are causing.</p>
<blockquote>
<p>Chaos engineering is engineering. It&rsquo;s not pure chaos.</p></blockquote>
<p>This works well for stateless things, but how does it work for stateful things?
In the case of the Honeycomb infrastructure, they make sure to only restart one
server or service at a time. They do not introduce too much chaos to reduce the
likelihood that something goes catastrophically wrong.</p>
<p>Two reasons why you will want to do these experiments at 3 PM and not at 3 AM:</p>
<ul>
<li>You want to test at peak traffic instead of low traffic, since the latter
could give you a false sense of security in situations when everything may
still look normal even though this would have been a problem in a high traffic
situation.</li>
<li>When doing things in the afternoon, there are more people available to deal
with something than there would be in the middle of the night.</li>
</ul>
<p>With the experiments they measure if they had an impact on the customer
experience. If they cause a change, does the telemetry reflect this? (Is the
node indeed reported as being offline, for example?) When you fix things, you
need to repeat the experiment and make sure the change indeed fixed the issue.</p>
<p><img src="/images/all_day_devops_2021_chaos_engineering_liz_fong-jones.png" alt="Liz Fong-Jones: Not every experiment succeeds. But you can mitigate the risks."></p>
<p>When you burn the error budget, the <a href="https://sre.google/sre-book/table-of-contents/">SRE book</a>
states that you should freeze deploys. Liz disagrees. If you freeze deploys, but
continue with feature development, the risk of the next deployment only
increases. Instead, Liz advocates for using the team&rsquo;s time to work on
reliability (i.e. change the nature of the work instead of stopping work).</p>
<blockquote>
<p>Fast and reliable: pick both!</p></blockquote>
<p>You don&rsquo;t have to pick between fast and reliable. In a lot of ways fast <strong>is</strong>
reliable. If you exercise your delivery pipelines every hour of every day,
stopping becomes the anomaly instead of deploying.</p>
<h3 id="takeaways">Takeaways</h3>
<ul>
<li>By designing the delivery pipeline for reliability, Honeycomb can meet their
SLOs.</li>
<li>Feature flag can reduce blast radius, and keep you within your SLO.</li>
<li>And when they cannot: there are other ways to mitigate the risks.</li>
<li>By discovering risks at 3 PM and not 3 AM, you improve the customer experience
since the system is more resilient.</li>
<li>If something does go catastrophically wrong, remember that the SLO is a
guideline not a rule. SLOs are for managing predictable-ish unknown-unknowns
and not things that are completely outside of your control.</li>
<li>We are all part of sociotechnical systems. Customers, engineers and stakeholders alike.</li>
<li>Outages or failed experiments are learning opportunities, not reasons to fire someone.</li>
<li>SLOs are an opportunity to have discussions about trade-offs between stability and speed.</li>
<li>DevOps is about talking to each other and talking to our customers.</li>
</ul>
<h2 id="common-pitfalls-of-infrastructure-as-code-and-how-to-avoid-them--tim-davis">Common Pitfalls of Infrastructure as Code (And how to avoid them!) &mdash; Tim Davis</h2>
<p>We start with the basics: what is Infrastructure as Code (IaC)? With the advent
of cloud providers, you no longer use hardware, but a UI to stand up
infrastructure. This led to <a href="https://en.wikipedia.org/wiki/Shadow_IT">shadow IT</a>
since developers ran off with a credit card to provision what they needed
themselves, instead of using slow, internal IT systems.</p>
<p>Developers however rather write code and use developer methodologies than click
through a UI. This is where IaC started. With it you can create and manage
infrastructure by writing code.</p>
<p><img src="/images/all_day_devops_2021_pitfalls_tim_davis.png" alt="Tim Davis explains that IaC combines the pitfalls from both infrastructure and code"></p>
<p>What are he pitfalls? The bad news: you get all the pitfalls of infrastructure
and all the pitfalls of code. But you&rsquo;ve probably already got a lot of
experience with those issues and teams to handle them. You just use a different
methodology.</p>
<p>The first pitfall is not fostering the communication between the groups that
have experience and tools.</p>
<h3 id="infrastructure-pitfalls">Infrastructure pitfalls</h3>
<p>Which framework/tool do you pick? There are basically two categories:
multi-cloud or cloud agnostic tools on the one hand (like
<a href="https://www.terraform.io/">Terraform</a> and <a href="https://www.pulumi.com/">Pulumi</a>)
and cloud specific tools on the other (like
<a href="https://aws.amazon.com/cloudformation/">CloudFormation</a>). Note that for example
with Terraform and Pulumi you still have to rewrite code when switching from one
cloud provider to another, but at least the tool is familiar.</p>
<p>Security is a huge thing. You still need to know how to design your VPC, IAM
policies, security policies, etc. You still need to communicate with all the
teams that have the experience. It&rsquo;s not just Dev and Ops. With tools like
<a href="https://github.com/accurics/terrascan">Terrascan</a> and
<a href="https://www.checkov.io/">Checkov</a> you can shift-left the security aspect
instead of trying to bolt it on afterwards.</p>
<h3 id="code-pitfalls">Code pitfalls</h3>
<p>The biggest thing issue is with default values. If you use the UI, there are a
lot of boxes that may be blank or have stuff in them. Some of the boxes can be
left blank, for some you need to specify what you want. The UI is going to
yell at you; if you use IaC things may be less in your face.</p>
<p>You don&rsquo;t want to deploy something with an open policy.
<a href="https://www.openpolicyagent.org/">Open Policy Agent</a> can really help you to make sure you
stay within your allowed parameters. For instance you can write a policy to make
sure you are only use a specific region, don&rsquo;t deploy an open S3 bucket or that
you only use certain sizes of EC2 instances.</p>
<p>If you hard code certain values in Terraform or other IaC tools, you might need
to copy/paste a lot of code if you want to create e.g. a test, acceptance and
production environment. To mitigate these <a href="https://en.wikipedia.org/wiki/Don%27t_repeat_yourself">DRY (don&rsquo;t repeat
yourself)</a> issues you can
for instance use
<a href="https://www.terraform.io/docs/language/modules/index.html">Terraform modules</a> or
<a href="https://terragrunt.gruntwork.io/">Terragrunt</a>.</p>
<p>State size can become a problem. If you want to, you can put all of your
infrastructure in the same state file (which is where your tool stores the state
of the infrastructure). However it means the tool will have to check <em>all</em>
resources in the state file to detect if it needs to do something. To mitigate
this, you can again use Terraform modules. It helps with performance and makes
the codebase more manageable.</p>
<h2 id="wrap-up">Wrap-up</h2>
<p>The conference is still ongoing while I publish this post. However, this is it
for me for All Day DevOps for this year. I learned new things and got inspired.</p>
<p>Thanks to the organizers, moderators and speakers for hosting another great
event.</p>]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[All Day DevOps]]></title>
    <link rel="alternate" href="https://markvanlent.dev/2019/11/06/all-day-devops/" type="text/html" />
    <id>https://markvanlent.dev/2019/11/06/all-day-devops/</id>
    <author>
      <name>map[name:Mark van Lent uri:https://markvanlent.dev/about/]</name>
    </author>
    <category term="aws" />
    <category term="conference" />
    <category term="culture" />
    <category term="devops" />
    <category term="observability" />
    <category term="infrastructure as code" />
    <category term="serverless" />
    <category term="sre" />
    <category term="tools" />
    
    <updated>2021-11-09T20:09:33Z</updated>
    <published>2019-11-06T00:00:00Z</published>
    <content type="html"><![CDATA[<p><a href="https://www.alldaydevops.com/">All Day DevOps</a> is an online conference
which lasts for 24 hours. With 150 sessions across 5 tracks, there&rsquo;s enough
content to consume.</p>
<p><em>(As with all my recent conference notes: these are just notes, not complete summaries.)</em></p>
<p><img src="/images/all_day_devops_2019.png" alt="All Day DevOps"></p>
<h2 id="adidoescode-where-devops-meets-the-sporting-goods-industry--fernando-cornago">#adidoescode Where DevOps Meets The Sporting Goods Industry &mdash; Fernando Cornago</h2>
<p>Adidas is a big company, about 57,000 employees. Software is also having an
impact on sports, e.g. registering your performance, etc. Is Adidas already a
software company? Definitely not, but they should start behaving as if they are.
They are building their applications 10,000 times a day.</p>
<p>New book from Gene Kim (the author of
<a href="https://itrevolution.com/the-phoenix-project/">The Phoenix Project</a>):
<a href="https://itrevolution.com/the-unicorn-project/">The Unicorn Project</a>.
Expected end of November 2019.</p>
<p>The platform you need to be able to accelerate you business should be:</p>
<ul>
<li>on demand and self-service</li>
<li>scalable and elastic</li>
<li>pay per use (which helps with the efficiency mindset)</li>
<li>transparent and measurable (observability)</li>
<li>open source + inner source</li>
</ul>
<p>To improve sales, Adidas&rsquo; platform can run where their customers are (globally)
and can scale up if needed and scale down again when possible.</p>
<p>Making use of data is key. The data is seen as a baton in a relay race and makes
its way through the teams. A large part of time consumed in creating a service
is spent at looking at the data. The data is used to train their AI systems.</p>
<p>You want to deliver value and make sure that things are working. Testing needs
to be done on the physical level (the actual products that are being sold), but
also on the virtual level: the IT platform also needs to perform and work as
expected.</p>
<p>At Adidas they went from having a couple of &ldquo;DevOps engineers&rdquo; to having a
DevOps culture in all teams. This removed bottlenecks and allowed them to grow
faster. They have substantially reduced costs and manual testing and improved
drastically on time to production.</p>
<h2 id="multi-cloud-day-to-day-devops-power-tools--ronen-freeman">Multi Cloud &ldquo;day-to-day&rdquo; DevOps Power Tools &mdash; Ronen Freeman</h2>
<p>About the relation between SRE (Site Reliability Engineering) and DevOps: SRE is
an implementation of DevOps.</p>
<p>What defines a great SRE:</p>
<ul>
<li>Quality
<ul>
<li>Be curious
<ul>
<li>Stay up to date</li>
<li>Attend events</li>
<li>Tinker with things</li>
</ul>
</li>
<li>Adapt to change (change <strong>will</strong> happen, anticipate it, monitor it, adapt
to it and prepare for the next change)</li>
<li>Learn: <q>learn fast so you may promptly begin again</q></li>
<li>Laziness: automate, but ask your self the question whether it will
actually save time or not</li>
</ul>
</li>
<li>Support
<ul>
<li>Googling is an art form</li>
<li>You&rsquo;ll learn the answer itself but also discover things in the process</li>
<li>There are numerous forums (for example: Stack Overflow, GitHub, Slack,
etc)</li>
<li>Have a daily standup</li>
<li>Pass on knowledge</li>
</ul>
</li>
<li>Tools: there are a number of areas to consider:
<ul>
<li>Code: Bash</li>
<li>Package: Docker</li>
<li>Deliver: <a href="https://concourse-ci.org/">Concourse</a></li>
<li>Platform:
<ul>
<li>Infrastructure: Kubernetes (<q>relatively simple to get started with</q>);
<a href="https://cloud.google.com/kubernetes-engine/">GKE</a> is the most mature</li>
<li>Configuration tool: Terraform</li>
</ul>
</li>
<li>Monitor: <a href="https://cloud.google.com/products/operations">Google Stackdriver</a>
(since already running on GKE)</li>
</ul>
</li>
</ul>
<p>There&rsquo;s a demo app:
<a href="https://github.com/Darillium/addo-demo">https://github.com/Darillium/addo-demo</a></p>
<h2 id="how-to-run-smarter-in-production-getting-started-with-site-reliability-engineering--jennifer-petoff">How to Run Smarter in Production: Getting Started with Site Reliability Engineering &mdash; Jennifer Petoff</h2>
<p>Software engineering is focussed on design and building, not about running the
application.</p>
<p>There are multiple places in organizations where friction occurs:</p>
<ul>
<li>Business vs development: this is addressed by Agile</li>
<li>Development vs operations: this friction is addressed by DevOps</li>
</ul>
<p>SRE is a bridge between business, development and operations.</p>
<p>Four key SRE principles:</p>
<ul>
<li>SRE needs Service Level Objectives (SLOs) with consequences</li>
<li>SREs must have time to improve the world</li>
<li>SRE teams must be able to regulate their workload</li>
<li>Failure is an opportunity to learn (having a blamelessness culture)</li>
</ul>
<p>Some terminology:</p>
<ul>
<li>Service Level Objectives (SLOs): a goal you want to achieve. For example, you
may want to set an uptime SLO, but typically you want to dig deeper, like
99.99% of HTTP requests should succeed with a &ldquo;200 OK&rdquo; response. You want to
measure something your user cares about.</li>
<li>Service Level Agreements (SLAs): these are &ldquo;handshakes between companies,&rdquo;
contractual guarantees.</li>
<li>Error budget: the gap between your SLO and perfect reliability. E.g. the
allowed downtime to still match your 99.9% uptime SLO.</li>
<li>Error budget policy: this policy describes what do you do when you have spent
your error budget. Example: no new features until within your error budget
again.</li>
</ul>
<blockquote>
<p>100% is the wrong reliability target for basically everything</p></blockquote>
<p>SRE is about balancing between reliability, engineering time, delivery speed and
costs.</p>
<p>Note that you <em>can</em> have an error budget policy without even hiring a single
SRE. You can already start with an SLO.</p>
<p>SLOs and error budgets are a necessary first step if you want to improve the
situation.</p>
<p>When talking about making tomorrow better than today, you need to address toil
(work that needs to be done that is manual, repetitive, automatable and does not
add value). If your SRE team is spending time on toil, they are basically an ops
team. Instead you want them to make the system more robust and improve the
operability of the system.</p>
<p>One way to be able to have your SRE team regulate their workload is to put them
in a different part of the organization.</p>
<p>If you want to start with SRE, start with SLOs and let your SRE team own the
most critical system first. Don&rsquo;t make them responsible for all systems at once.
Let them deal with the most important systems first. Once they have that under
control (removed toil, etc), they can take on more work.</p>
<p>You need leadership buy-in for every aspect of SRE work.</p>
<p>Aim for reliability and consistency upfront. SRE teams should consult on
architectural discussions to get to resilient systems.</p>
<p>SRE teams can benefit from automation:</p>
<ul>
<li>Eliminate toil</li>
<li>Capacity planning (auto scaling instead of forecasting)</li>
<li>To fix issues: if you can write a playbook, you can automate it.</li>
</ul>
<p>SRE teams embrace failure:</p>
<ul>
<li>Setting SLO less than 100%</li>
<li>Blamelessness at all levels</li>
<li>Learning from failure</li>
<li>Make the postmortems available so other teams can also learn from them</li>
</ul>
<p>Resources (books):</p>
<ul>
<li><a href="https://sre.google/sre-book/table-of-contents/">Site Reliability Engineering</a></li>
<li><a href="https://sre.google/workbook/table-of-contents/">The Site Reliability Workbook</a></li>
<li><a href="https://www.oreilly.com/library/view/seeking-sre/9781491978856/">Seeking SRE</a></li>
</ul>
<h2 id="everybody-gets-a-staging-environment--yaron-idan">Everybody Gets A Staging Environment! &mdash; Yaron Idan</h2>
<p>How can you give an isolated and secure staging environment for each developer?
If you follow the path Yaron&rsquo;s company (Soluto) took, prepare for a rocky
transition, getting used to new terminology when moving to Kubernetes and users
that can get upset at first.</p>
<p>Their first step: automate the entire process. The realized they needed
onboarding documentation and templates. They used Helm for the templates.</p>
<p>Deploy feature branches to the production environment. When developers open a
pull request, the CI/CD tool adds a link to where this PR is deployed. This
empowers the developers and provides a sense of security. This accelerated the
way the developers built features and the adoption of the platform by the
company.</p>
<p>The benefits:</p>
<ul>
<li>Code is fit for production during entire life cycle of the feature branch.</li>
<li>Makes performance testing easier (after all: it&rsquo;s deployed on the production environment).</li>
<li>Easier to share things with the stakeholders and thus improves collaboration.</li>
<li>You can deploy an entire set of microservices.</li>
</ul>
<p>The tools used for their implementation:</p>
<ul>
<li>GitHub (VCS)</li>
<li>Docker (code as configuration)</li>
<li>CodeFresh (CI)</li>
<li>Helm (CD)</li>
<li>Kubernetes (production platform)</li>
</ul>
<p>You can swap the aforementioned tools with other tools, but you must be able to
automate them and make sure that they work with the push of a single button.</p>
<p>GitHub repository for the demo:
<a href="https://github.com/yaron-idan/staging-environments-example">https://github.com/yaron-idan/staging-environments-example</a></p>
<h2 id="devops-to-the-next-level-with-serverless-chatops--jan-de-vries">DevOps to the Next Level with Serverless ChatOps &mdash; Jan de Vries</h2>
<p>What DevOps is about:</p>
<ul>
<li>Shipping code</li>
<li>Adding value</li>
<li>Do this continuously</li>
</ul>
<p>It&rsquo;s <strong>not</strong> about:</p>
<ul>
<li>Adding additional load to your team</li>
<li>Lowering quality</li>
<li>More support calls</li>
<li>Slow response times when there are errors</li>
</ul>
<p>One option is to send emails if something is wrong. But these emails are slow
(knowing that there was an issue <em>yesterday</em> is too late) and it generates a lot
of noise. Monitoring dashboards are nice, but you have to actively look at them
to detect if there&rsquo;s something wrong.</p>
<p>A better solution is using chat applications (like Microsoft Teams or Slack).</p>
<p>Your first step is to emit events when something fails (in Azure you can sent
such events to <a href="https://azure.microsoft.com/en-us/services/event-grid/">Event Grid</a>).
Next, you have to subscribe to these events/topics. You can use an Azure
Function for this and then have it post an actionable message to Teams.</p>
<p>These messages can have buttons to take immediate action (e.g. post to a web API
like an Azure Function) to recover from the problem. The output from the recover
action can also send events, so you can get another notification in your chat
application about the result.</p>
<p>GitHub repository with demo code: <a href="https://github.com/Jandev/ServerlessDevOps">https://github.com/Jandev/ServerlessDevOps</a></p>
<h2 id="building-modular-infrastructure-in-code--fergal-dearle">Building Modular Infrastructure in Code &mdash; Fergal Dearle</h2>
<p>Why use infrastructure as code (IaC)?</p>
<ul>
<li>Fergal is a fan of immutable infrastructure and the pets vs cattle analogy (which
goes further than just servers)</li>
<li>Everything as code</li>
<li>You don&rsquo;t have to log into a server via SSH to fix things</li>
<li>Better testability and maintainability</li>
</ul>
<p>The <a href="https://itrevolution.com/book/the-devops-handbook/">DevOps handbook</a> (part
III, chapter 9) talks about on demand spinning up a production-like environment.
The only way to do this is to use IaC.</p>
<p>First thing is understanding your infrastructure (VPC, networking, DNS,
services, database). Look at patterns like requestor/broker. Use these patterns
to create modular stacks.</p>
<p>Patterns:</p>
<ul>
<li>Singleton stack pattern: one stack per instance. But this is actually an
anti-pattern because you are mixing configuration and infrastructure.</li>
<li>Multiheaded stack pattern: multiple stacks in single project, config built in.</li>
<li>Template stack pattern: single stack, config separate, multiple instances from
that template</li>
</ul>
<p>Stack templates &amp; stack instances:</p>
<ul>
<li>Reusable template that implements building blocks</li>
<li>Appropriate tags</li>
<li>Must be able to uniquely identify instances (namespacing)</li>
</ul>
<blockquote>
<p>Stay clear of monoliths</p></blockquote>
<p>Monoliths are stacks with everything in them. This is one end of the spectrum.
The other end is to have one service per stack. You&rsquo;ll probably will want to end
up somewhere in between, where you have a component stack (e.g. with a
requestor/broker combination). And then you can combine those stacks.</p>
<p>Static stack inputs: environment names, app names. You should externalize this
config from your stack, e.g. by feeding these parameters via the command line or
use the <a href="https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-parameter-store.html">SSM Param Store</a>
and <a href="https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html">Secrets Manager</a>.</p>
<p>(Note to self: Fergal mentioned <a href="https://github.com/Sceptre/sceptre">Sceptre</a>.
Check what this is about.)</p>
<p>Common problems:</p>
<ul>
<li>Drift can happen, especially if you also make manual changes (don&rsquo;t!)</li>
<li>Updates are not forward compatible</li>
<li>Region incompatibilities</li>
</ul>
<p>Best practices:</p>
<ul>
<li>Only use template stacks</li>
<li>Use layered and component</li>
<li>Externalize config</li>
<li>Test, test, test!</li>
<li>Only use automation, no manual steps</li>
</ul>
<h2 id="crossing-the-river-by-feeling-the-stones--simon-wardley">Crossing The River By Feeling The Stones &mdash; Simon Wardley</h2>
<p>In <a href="https://en.wikipedia.org/wiki/The_Art_of_War">The Art of War</a> Sun Tzu
discusses five fundamental factors which you should consider:</p>
<ul>
<li>Purpose</li>
<li>Landscape</li>
<li>Climate</li>
<li>Doctrine</li>
<li>Leadership</li>
</ul>
<p>John Boyd developed the <a href="https://en.wikipedia.org/wiki/OODA_loop">OODA loop</a>,
which cycles through the following:</p>
<ul>
<li>Observe</li>
<li>Orient</li>
<li>Decide</li>
<li>Act</li>
</ul>
<p>You can combine these cycles:</p>
<p><img src="/images/all_day_devops_2019_simon_wardley1.png" alt="Simon Wardley combining Sun Tzu&rsquo;s five factors and John Boyd&rsquo;s OODA loop"></p>
<p>By moving we learn, as long as we can observe the environment.</p>
<p>Maps are ways to:</p>
<ul>
<li>explore</li>
<li>communicate</li>
<li>share understanding</li>
<li>de-personalize</li>
</ul>
<p>Graphs are not maps. Identical graphs can spatially be different. On a map,
space has meaning, in a graph not. And because space has meaning, we can
explore, e.g. by asking ourselves the question &ldquo;what if we go in <em>that</em>
direction?&rdquo;</p>
<p>What makes a map?</p>
<ul>
<li>Anchor (compass)</li>
<li>Position (places, relative to the compass)</li>
<li>Consistency of movement (going to right below means go to south-east, assuming
on this map &ldquo;up&rdquo; is north)</li>
</ul>
<p>In business there are a lot of maps, e.g. a systems diagram. But almost
everything in business we call a map actually is a graph. Which means we don&rsquo;t
have a sense of the landscape.</p>
<p>How do you create a map for a business? Start with the anchors, e.g. business
and customers. Work from there to get all related components and express them in
a chain of needs.</p>
<p>But how do you position those components? We can use &ldquo;visibility&rdquo; as an analogy
of distance. You can use this to create a &ldquo;partial ordered chain of needs.&rdquo;</p>
<p>And what about movement? You can describe movement as a chain of changes. But
how do you do <em>that</em>? You can order components based on their evolutionary
stage. Are they in the genesis stage, do we use custom built components, do we
have standard products or can we consider it a commodity?</p>
<p>By combining these we can create a map of components by using the position
(visibility) on the y-axis and the movement (evolution) on the x-axis.</p>
<figure><img src="/images/all_day_devops_2019_simon_wardley2.png"
    alt="Simon Wardley showing a map"><figcaption>
      <p>Simon Wardley shows a map of a fictional company selling cups of tea</p>
    </figcaption>
</figure>

<p>You can add intent (evolutionary flow) to your map, e.g. move a component from
custom built to using a commodity.</p>
<p>By using such a map, you can de-personalize discussions since now you talk about
the map, not a story.</p>
<p>Simon told a story of a company that wanted to invest in robots to do manual
labour. After putting their business in a map, the problem became clear: they
were using custom built racks and as a result needed to customize their servers
because they would not fit their non-standard rack. Because the market had changed
since the company started, they could switch to commodity (cloud) and as a
result also did not need robots.</p>
<p>But keep this in mind:</p>
<blockquote>
<p>All maps are imperfect representations of a space</p></blockquote>
<h2 id="deploying-microservices-to-aws-fargate--ariane-gadd">Deploying Microservices to AWS Fargate &mdash; Ariane Gadd</h2>
<p>At KPMG they moved a project from using Amazon EC2 to AWS Fargate. They also
replaced CloudFormation with Terraform for standardization.</p>
<p>Why managed container services and not EKS?</p>
<ul>
<li>Immutable deployments</li>
<li>Cluster provisioning handled by Fargate</li>
<li>Only pay for what you execute</li>
<li>No OS patching</li>
<li>Smaller attack service</li>
</ul>
<p>They used <a href="https://github.com/fabfuel/ecs-deploy">ECS deploy</a> to deploy
containers to Fargate.</p>
<p>Why was the customer okay with this change?</p>
<ul>
<li>They were already familiar with Docker</li>
<li>Fargate has integrations with ECS and ECR.</li>
<li>Overhead of managing EC2 vs cost of AWS Fargate</li>
</ul>
<p>Fargate runs in your own VPC so you own the network, etc.</p>
<p>The microservices for this project are placed behind an application load
balancer (ALB) using AWS Web Application Firewall (WAF) and TLS termination.
Logs are sent to AWS CloudWatch Logs. They used Anchor for end-to-end container
security and compliance.</p>
<p>Benefits of AWS for KPMG:</p>
<ul>
<li>Cost reduction (automation tools were already in place, they already used
Lambda for account creation and had reserved instances)</li>
<li>Speed of delivery</li>
<li>Allows collaboration with developers (automation, DevOps).</li>
<li>70% of the engineers were already AWS certified.</li>
</ul>
<h2 id="automate-everyday-tasks-with-functions--sean-odell">Automate Everyday Tasks with Functions &mdash; Sean O&rsquo;Dell</h2>
<p>Common use cases for serverless applications are things like web applications
(such as static websites), data processing, Amazon Alexa (skills) and chatbots.</p>
<blockquote>
<p>Not every workload should run in a serverless fashion</p></blockquote>
<p>Organizational and application context are relevant. Serverless is just another
option. You pick what is right for your use case as serverless is not a silver
bullet.</p>
<p>AWS Lambda in a nutshell: there is an event source (e.g. a state change in data
or a resource), which triggers a function, which in turn interacts with a service.</p>
<p>GitHub repository with examples: <a href="https://github.com/vmwarecloudadvocacy/cloudhealth-lambda-functions">https://github.com/vmwarecloudadvocacy/cloudhealth-lambda-functions</a></p>
<p>(Note to self: check <a href="https://docs.amplify.aws/">Amplify</a>)</p>
<h2 id="beyond-dev--ops--patrick-debois">Beyond dev &amp; ops &mdash; Patrick Debois</h2>
<p>The DevOps pipeline usually starts at the backlog. But what happens before? And
how does that influence people outside the engineering department?</p>
<p>Patrick joined a startup five years ago. He fixed the DevOps pipeline. After a
while the organization wanted to sell a product.</p>
<p>The book <a href="https://www.amazon.com/Machine-Radical-Approach-Design-Function/dp/1626342245">The Machine</a>
by Justin Roff-March describes &ldquo;the agile version of sales.&rdquo;</p>
<p>Sales had more in common with IT than Patrick expected. The sales pipeline
looked like a Kanban board. Sales people are also on call, especially if you do
international sales. If we can deliver faster, this increases the chance of a
sale. Sales persons are actually mind readers; this reminded Patrick how IT
people feel about tickets in the backlog: &ldquo;what is meant with this ticket, what
do the customers actually want?&rdquo;</p>
<p>Thinking about the life of the project after it has been delivered, in terms of
maintenance and operational costs, is important. Doing this upfront, is
beneficial for the project.</p>
<p>The biggest impact sales had on IT was the never ending hammer of &ldquo;can you make
things simpler?&rdquo; Sales can help you with making your design less complex.
Together you can determine what actually adds value for the customer.</p>
<p>Having publicly available documentation really helps and makes it easier to sell
the product. Getting the customers on Slack also helped. This showed that the
company was willing to listen and help. This also helps sales.</p>
<p>After sales, the marketing department got focus. Patrick has seen the most
interesting automation within the marketing department (e.g. email sending).
Marketing has a lot of fancy metrics and tools to measure stuff; did they invent
metrics? The IT department can help marketing by providing information.
Conferences are also a marketing instrument. Marketing also has CALMS; they do
very similar things.</p>
<p>We&rsquo;ve got sales and marketing covered, but for a lot of things we still need
budgets. To budget things, you also need to estimate things. Finance knows what
it is to be on the receiving end of tickets where you have to guess what is
going on.</p>
<p>Sometimes you need to buy things. Patrick looked into agile procurement. Lots of
stuff has to be figured out: is it a fit? is there a way to be partners?</p>
<p>Serverless is a manifestation of servicefull, where you use a lot of services,
so you can focus on your domain instead of having to build everything yourself.</p>
<p>With cloud services there doesn&rsquo;t have to be communication initially: you read
the website, check the documentation, pull your credit card and start with a
proof of concept. Communication is still important. You have to treat your
suppliers with respect.</p>
<p>HR: how does your company support hiring people? Have people join the team for
some actual work to see if they are a good fit or not.</p>
<p>Do you know what <a href="https://en.wikipedia.org/wiki/Pair_programming">pair programming</a>
is? Well, <a href="https://en.wikipedia.org/wiki/Mob_programming">mob programming</a> is a whole
other level of collaborating. (Note that it&rsquo;s more than just have one person
typing and a bunch of other people commenting&mdash;it has a lot of patterns going
on.) Teams can get very enthusiastic about this and can be very productive the
first week. But if they are not careful, the team is exhausted the next week.
You need to develop a sustainable pace.</p>
<p>Patrick learned a lot from talking to the different departments. They are solving
similar problems as IT is.</p>
<p>Reading tip:
<a href="https://www.amazon.com/Company-wide-Agility-Beyond-Budgeting-Sociocracy/dp/154467287X">Company-wide Agility with Beyond Budgeting, Open Space &amp; Sociocracy</a>
by Jutta Eckstein and John Buck.</p>
<h2 id="the-open-source-observability-toolkit--mickey-boxell">The Open Source Observability Toolkit &mdash; Mickey Boxell</h2>
<p>Some characteristics of a modern application:</p>
<ul>
<li>Microservices</li>
<li>Distributed</li>
<li>Multiple programming languages</li>
<li>Scalable</li>
<li>Ephemeral</li>
</ul>
<p>This brings new challenges compared to the old situation with a monolithic
application running on a single machine. You still want to address threats to
customer satisfaction. But the old debugging solutions we used for a monolith
will not work.</p>
<p>Observability means designing and operating a more visible system. This includes
systems that can explain themselves without you having to deploy new code. The
tools help you understand the difference between a healthy and unhealthy system.</p>
<blockquote>
<p>Modern, distributed systems <strong>will</strong> experience failures.</p></blockquote>
<p>Observability takes a holistic approach that recognizes the impact an issue has
on the business as a whole. Outages affect the whole business, which should be
able to react. Observability gives tools to address issues when they arrive.</p>
<p>External outputs:</p>
<ul>
<li>Logs</li>
<li>Metrics</li>
<li>Traces</li>
</ul>
<p>These outputs don&rsquo;t make your system more observable, but they can be part of
the holistic approach. They can help you with e.g. a root cause analysis.</p>

  <figure>

<blockquote >
Monitoring tells you whether a system is working, observability lets you ask
why it isn&rsquo;t working.
</blockquote>

  <figcaption>
    &mdash;Baron Schwartz
  </figcaption>
  </figure>


<p>Key SRE concepts:</p>
<ul>
<li>Service Level Indictors (SLIs; what are you measuring), Service Level
Objectives (SLOs; what should the values be), Service Level Agreements (SLAs;
defines the planned reaction if the objectives are not met).</li>
<li>Subset of Google&rsquo;s golden SRE signals: RED (Rate, Errors, Duration).</li>
<li>Mean Time To Failure (MTTF) and Mean Time To Repair (MTTR)</li>
</ul>
<p>Users are comfortable with a certain level of imperfection. A site that is a bit
slow sometimes, is less annoying than a shopping cart that sometimes looses its
content. Use this to determine what you&rsquo;ll spend your time on first.</p>
<h3 id="logging">Logging</h3>
<p>Logging is the most fundamental pillar of monitoring. Logging records events
that took place at a given time. Supported by most libraries because it is such
a fundamental thing.</p>
<p>You only get logs if you actually put them in your code. You also have to make
sure you don&rsquo;t lose them, for instance by aggregating them.</p>
<p>Tools:</p>
<ul>
<li><a href="https://www.fluentd.org/">Fluentd</a>: scrape, process and ship logs</li>
<li><a href="https://www.elastic.co/elasticsearch/">Elasticsearch</a>: a data store</li>
<li><a href="https://www.elastic.co/kibana/">Kibana</a>: interact with your logs</li>
</ul>
<h3 id="metrics">Metrics</h3>
<p>Metrics are a numeric aggregation of data describing the behavior of a component
measured over time. They are useful to understand typical system behavior.</p>
<p>Tools:</p>
<ul>
<li><a href="https://prometheus.io/">Prometheus</a>: scrape data, store the data and query
the data</li>
<li><a href="https://grafana.com/">Grafana</a>: visualize the metrics, can aggregate data
from different sources</li>
</ul>
<p>Alerts are notifications to indicate that a human needs to take action.</p>
<h3 id="tracing">Tracing</h3>
<p>Tracing is capturing a set of causally related events. Helpful for debugging.
Example: each request has a global ID and if you insert this ID as metadata at
each step, you can trace what happens.</p>
<p>Tools: <a href="https://www.jaegertracing.io/">Jaeger</a> and <a href="https://zipkin.io/">Zipkin</a>
can visualize and inspect traces.</p>
<p>The challenge is that tracing is hard to retrofit into an existing application.
You need to instrument all component. Service meshes can make it easier to
retrofit tracing.</p>
<h3 id="service-meshes">Service meshes</h3>
<p>A service mesh is a configurable infrastructure layer for microservice
applications. It can monitor and control the traffic in your environment. It can
be a great approach for observability. You will get less information from a
service mesh compared to adding tracing to all of your components, but on the
other hand it takes less effort to implement than instrumenting your existing
code base.</p>
<p>Tools:</p>
<ul>
<li><a href="https://istio.io/">Istio</a></li>
<li><a href="https://linkerd.io/">Linkerd</a></li>
<li><a href="https://traefik.io/traefik-mesh/">Maesh</a><sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></li>
<li><a href="https://kuma.io/">Kuma</a></li>
<li><a href="https://www.consul.io/">Consul</a></li>
</ul>
<p>Using service meshes won&rsquo;t eliminate the need to instrument your code, but it
can make your life more simple.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Called Traefik Mesh these days.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[Devopsdays Amsterdam 2019: workshops]]></title>
    <link rel="alternate" href="https://markvanlent.dev/2019/06/26/devopsdays-amsterdam-2019-workshops/" type="text/html" />
    <id>https://markvanlent.dev/2019/06/26/devopsdays-amsterdam-2019-workshops/</id>
    <author>
      <name>map[name:Mark van Lent uri:https://markvanlent.dev/about/]</name>
    </author>
    <category term="aws" />
    <category term="conference" />
    <category term="infrastructure as code" />
    <category term="serverless" />
    <category term="terraform" />
    
    <updated>2021-11-09T20:09:33Z</updated>
    <published>2019-06-26T00:00:00Z</published>
    <content type="html"><![CDATA[<p>For the fourth year in a row, I was able to go to devopsdays Amsterdam. And as
usual I also joined the workshop day since I really like the deep dives these
workshops provide.</p>
<p>I selected two Terraform workshops and one about AWS Lambda this year.</p>
<h2 id="terraform-workshop--mikhail-advani-mendix">Terraform Workshop &mdash; Mikhail Advani (Mendix)</h2>
<p>Mikhail provided a Git repository to use during the workshop:
<a href="https://github.com/mikhailadvani/terraform-workshop">https://github.com/mikhailadvani/terraform-workshop</a></p>
<p>To dive right in, have a look at this snippet to write a file to the current
directory (in this case the in the <code>part-1</code> directory of the workshop repo):</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="cl"><span class="k">resource</span> <span class="s2">&#34;local_file&#34; &#34;user&#34;</span> {
</span></span><span class="line"><span class="cl"><span class="n">  content</span> <span class="o">=</span> <span class="s2">&#34;...&#34;</span>
</span></span><span class="line"><span class="cl"><span class="n">  filename</span> <span class="o">=</span> <span class="s2">&#34;${path.module}/${var.filename}&#34;</span>
</span></span><span class="line"><span class="cl">}
</span></span></code></pre></div><p>Terraform does not accept relative paths. But using absolute paths means you&rsquo;ll
probably end up with a path containing a home directory of a specific user,
which is annoying for team members that have a different username.</p>
<p>To prevent this, you can use the &ldquo;<code>${path.module}</code>&rdquo; variable just like in the
example above. You can also use Docker to have a standard environment you run
Terraform in. Using Docker can also be a practical approach when you want to use
Terraform in your CI.</p>
<p>You also cannot have one variable refer another (at least, in Terraform 0.11).
You&rsquo;ll have to use a <a href="https://www.terraform.io/docs/language/values/locals.html">local value</a>,
for example:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="cl"><span class="k">locals</span> {
</span></span><span class="line"><span class="cl"><span class="n">    filename</span> <span class="o">=</span> <span class="s2">&#34;${var.name}.txt&#34;</span>
</span></span><span class="line"><span class="cl">}
</span></span></code></pre></div><p>Mikhail demonstrated the use of:</p>
<ul>
<li>
<p><code>terraform plan -var-file=&lt;filename&gt;</code>: this allows you to provide values for your
variables (<a href="https://www.terraform.io/docs/cli/commands/plan.html#var-file-foo">docs</a>).</p>
</li>
<li>
<p><code>terraform plan -out=&lt;filename&gt;</code>: you can save your plan. If you feed the
&ldquo;<code>terraform apply</code>&rdquo; command this file, it will not ask for confirmation since
Terraform assumes the plan has already been reviewed
(<a href="https://www.terraform.io/docs/cli/commands/plan.html#out-path">docs</a>).</p>
<p>If your (remote) state has changed between generating the plan and running
<code>apply</code> with that plan, Terraform will detect that and fail.</p>
</li>
<li>
<p><a href="https://www.terraform.io/docs/language/meta-arguments/depends_on.html">depends_on</a>:
ensure that resource A is created before B.</p>
</li>
</ul>
<p>If you work in a team, you want use something like S3 for the state file and
DynamoDB for locks. And you probably want to use Terraform to create the S3
bucket and the DynamoDB table. But to be able to create infrastructure you need
to have a state file.
<a href="https://en.wikipedia.org/wiki/Catch-22_(logic)">Catch-22</a>. To solve this,
you&rsquo;ll have to split this up in two phases:</p>
<ol>
<li>In the module where you define the S3 bucket and DynamoDB table, you don&rsquo;t
add the S3 backend initially and run <code>init</code>, <code>plan</code> and <code>apply</code>. This will
create the resources, but keep the Terraform state locally.</li>
<li>Once you are done, you can add the S3 backend and run <code>terraform init</code> again.
Terraform now picks up that the state should be stored remotely and offers to
move the current state.</li>
</ol>
<p>Note that Terraform will create resources in parallel.</p>
<p>How do you manage secrets? Mikhail is using a <code>secret.tfvars</code> file and
<a href="https://www.agwa.name/projects/git-crypt/">git-crypt</a>. The benefit of using
vars files over environment variables is you can put the former in version
control. The official recommendation is to use
<a href="https://www.vaultproject.io/">Vault</a> (also made by Hashicorp). You can
possibly also use e.g. an AWS service to store your secrets.</p>
<p><img src="/images/devopsdays2019_mikhail_advani.jpg" alt="Mikhail Advani about moving resources into a separate module"></p>
<p>If you rename resources or move them into modules, Terraform detects that a
resource is no longer in your code and also picks up the new resource. It will
try to delete the old one and create the new resource. To prevent this, you&rsquo;ll
have to tell Terraform the resource has moved. Use &ldquo;<code>terraform state mv</code>&rdquo;
(<a href="https://www.terraform.io/docs/cli/commands/state/mv.html">docs</a>) to fix this.</p>
<p>An example of a module using
<a href="https://github.com/gruntwork-io/terratest">Terratest</a> to test the Terraform
code:
<a href="https://github.com/mikhailadvani/terraform-s3-backend">https://github.com/mikhailadvani/terraform-s3-backend</a></p>
<p>Tips from audience:</p>
<ul>
<li>Use “<code>terraform validate</code>”.</li>
<li>You probably don&rsquo;t want to put your Terraform state in version control since
it can contain credentials in plain text.</li>
</ul>
<h2 id="terraform-best-practices-with-examples-and-arguments---anton-babenko">Terraform best practices with examples and arguments &mdash;  Anton Babenko</h2>
<p>There are different types of modules:</p>
<ul>
<li>Resource modules</li>
<li>Infrastructure modules</li>
</ul>
<p>A composition is a collection of infrastructure modules. An infrastructure module consists of
resource modules, which implement resources.</p>
<p>You can use infrastructure modules e.g. to enforce tags and company standards.
You can also use things like preprocessors, jsonnet and cookiecutter in them.</p>
<p>Terraform modules frequently cannot be re-used because they are written for a
very specific situation and sometimes have hard coded assumptions in them.</p>
<p>Don&rsquo;t put provider information in your module. For instance: although you <em>can</em>
specify module versions, please <strong>do not do this</strong>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="cl"><span class="c1"># Don&#39;t do this!
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="k">terraform</span> {
</span></span><span class="line"><span class="cl">  <span class="k">required_providers</span> {
</span></span><span class="line"><span class="cl"><span class="n">    aws</span> <span class="o">=</span><span class="n"> &#34;&gt;</span><span class="o">=</span> <span class="m">2</span><span class="p">.</span><span class="m">7</span><span class="p">.</span><span class="m">0</span><span class="err">&#34;</span>
</span></span><span class="line"><span class="cl">  }
</span></span><span class="line"><span class="cl">}
</span></span></code></pre></div><p>Avoid provisioners in modules. Use public cloud capabilities instead, like
<a href="https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/launch_configuration#user_data">user_data</a>.
(Mark: for an example, see
<a href="https://github.com/sjparkinson/terraform-ansible-example/tree/master/terraform">https://github.com/sjparkinson/terraform-ansible-example/tree/master/terraform</a>)</p>
<p><img src="/images/devopsdays2019_anton_babenko_1.jpg" alt="Anton Babenko talking about null_resources"></p>
<p>Traits of good modules:</p>
<ul>
<li>Documentation and examples: use <a href="https://github.com/terraform-docs/terraform-docs">terraform-docs</a></li>
<li>Feature rich</li>
<li>Sane defaults</li>
<li>Clean code</li>
<li>Tests</li>
</ul>
<p>More info: <a href="https://www.antonbabenko.com/using-terraform-continuously-common-traits-in-modules/">Using Terraform continuously — Common traits in modules</a></p>
<h3 id="organizing-your-code">Organizing your code</h3>
<p>There are two opposite ways of structuring your code</p>
<ul>
<li>&ldquo;All-in-one&rdquo;, which has the benefit that variables and output are declared in
one place, but one of the downsides is that the blast radius in case of a
problem can be quite large.</li>
<li>&ldquo;1-in-1&rdquo; has benefits like a smaller blast radius and that it&rsquo;s easier to
understand. Downside is that things are more scattered around.</li>
</ul>
<p><img src="/images/devopsdays2019_anton_babenko_2.jpg" alt="Anton Babenko about all-in-one vs 1-in-1 code organization"></p>
<p><a href="https://github.com/gruntwork-io/terragrunt">TerraGrunt</a> <q>reduces amount of code
to do similar things.</q> It has extra features like execution of hooks and a number
of additional functions. TerraGrunt is opinionated.</p>
<p>Terraform workspaces are the worst feature of Terraform ever. (Provisioner is
the 2nd worst.) Workspaces allow us to execute the same set of Terraform configs
but with slightly different properties. (For example: if this is the production
environment spin up 5 instances, else 1 instance.) Workspaces are not
infrastructure as code friendly. You cannot answer, from the code:</p>
<ul>
<li>How many workspaces do I have?</li>
<li>What has been deployed in workspace X?</li>
<li>What is the difference between workspace X and Y?</li>
</ul>
<p>Better: use reusable modules instead of workspaces.</p>
<h3 id="terraform-012">Terraform 0.12</h3>
<p>Will it help us?</p>
<p>It&rsquo;s the biggest rewrite of Terraform since its creation. There are backward
incompatible changes though.</p>
<p>Main changes:</p>
<ul>
<li>HCL2 simplified syntax</li>
<li>Loops</li>
<li>Dynamic blocks (<code>for_each</code>)</li>
<li>Conditional operators (<code>... ? ... : ...</code>) that work as you expect</li>
<li>Extended types of variables</li>
<li>Templates in values</li>
<li>Links between resources are supported (<code>depends_on</code> everywhere)</li>
</ul>
<p>(The <a href="https://www.hashicorp.com/blog/products/terraform">Hashicorp blog</a> has a
number of articles about Terraform 0.12 if you want to know more.)</p>
<p>Terraform developers write and support Terraform modules, enforce company
standards, etc. Terraform users (everyone) use modules by specifying the right
values; they are the domain experts but don&rsquo;t care too much about the inner
workings of the Terraform modules.</p>
<p>Terraform 0.12 allows developers to implement more flexible/dynamic/reusable
modules. For Terraform users the only benefit is HCL2&rsquo;s lightweight syntax.</p>
<p>There is a command to check your code for 0.12 compatibility: “<code>terraform 0.12checklist</code>”. Once everything is fine you can use “<code>terraform 0.12upgrade</code>”.
See the <a href="https://www.terraform.io/upgrade-guides/0-12.html">upgrade guide</a> for
more information.</p>
<p>Note that the 0.12 state file is not compatible with the 0.11 state file. If you
have a remote (shared) state, once one member of the team upgrades to 0.12, the
whole team needs to upgrade.</p>
<h3 id="best-practices">Best practices</h3>
<ul>
<li>Check the <a href="https://www.terraform.io/docs/index.html">official documentation</a></li>
<li><a href="https://learn.hashicorp.com/">Hashicorp tutorials/workshops</a></li>
<li>Anton&rsquo;s <a href="https://www.terraform-best-practices.com/">Terraform Best Practices</a>
(It&rsquo;s not very up-to-date though, according to Anton himself.)</li>
</ul>
<h3 id="workshopqa-part">Workshop/Q&amp;A part</h3>
<p>Anton has a workshop you can do on your own pace at
<a href="https://github.com/antonbabenko/terraform-best-practices-workshop">https://github.com/antonbabenko/terraform-best-practices-workshop</a>.
The workshop builds real infrastructure using
<a href="https://github.com/terraform-aws-modules">https://github.com/terraform-aws-modules</a>.</p>
<p>During this section of his session Anton answered a bunch of questions people
in the audience had. Tools that were discussed:</p>
<ul>
<li>Terraform pull request automation with <a href="https://www.runatlantis.io/">Atlantis</a></li>
<li>Prettify plan output using <a href="https://github.com/dmlittle/scenery">scenery</a> (not
needed in Terraform 0.12)</li>
<li>Use Terraform while writing in a programming language instead of HCL with
<a href="https://www.pulumi.com/">Pulumi</a></li>
<li>Build your infrastructure as code: <a href="https://modules.tf/">modules.tf</a></li>
<li>Visualize your cloud architecture: <a href="https://www.cloudcraft.co/">Cloudcraft</a></li>
</ul>
<h3 id="other-resources">Other resources</h3>
<ul>
<li><a href="https://www.antonbabenko.com/my-terraform-aws-journey-june-2019/">My Terraform AWS journey — HashiTimes Interview</a></li>
<li><a href="https://registry.terraform.io/">Terraform modules</a></li>
</ul>
<h2 id="getting-started-with-serverless-development-with-aws-lambda--yan-cui">Getting started with Serverless development with AWS Lambda &mdash; Yan Cui</h2>
<p>Related repository: <a href="https://github.com/theburningmonk/getting-started-with-serverless-development-with-lambda-devopsdays-ams">https://github.com/theburningmonk/getting-started-with-serverless-development-with-lambda-devopsdays-ams</a></p>
<p>There is a (soft) limit of 1000 concurrent running Lambdas. This can be
increased by sending a request to AWS. But even if you convinced Amazon to
increase it to for instance 10.000, they only scale up at a rate of 500 per minute. So if you
have a workload where you have sudden spikes and need to scale quickly, Lambda
might not be what you need.</p>
<p>Note that you can set a &ldquo;reserved concurrency&rdquo; on a Lambda; this setting acts as
a max concurrency of a function.</p>
<p>Use tags and have a naming convention. E.g. add a &ldquo;team&rdquo; tag so you know which
team to contact about a Lambda. This is useful when you have many, many
functions.</p>
<p>Separate customers also get their own instances to run their Lambdas; they are
isolated from each other.</p>
<p>By default functions don&rsquo;t have access to resources in your VPC. You can add it
to your VPC, but currently there is a cold start penalty. This penalty can be
several seconds; which is a lot if the Lambda only takes a few milliseconds to
run. So you probably should not put your function in a VPC if you do not need
it.</p>
<p><img src="/images/devopsdays2019_yan_cui_1.jpg" alt="Yan Cui configuring an AWS Lambda function"></p>
<p>With regards to execution: you are billed in 100ms blocks. So if your function
takes 60ms to run, there is&mdash;from a cost perspective at least&mdash;no benefit in
speeding up the Lambda.</p>
<p>If you want to debug your functions locally in VS Code and you have
the serverless framework installed locally, you can update your <code>launch.json</code>
file:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;version&#34;</span><span class="p">:</span> <span class="s2">&#34;0.2.0&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;configurations&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;node&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="nt">&#34;request&#34;</span><span class="p">:</span> <span class="s2">&#34;launch&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;Launch Program&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="nt">&#34;program&#34;</span><span class="p">:</span> <span class="s2">&#34;${workspaceFolder}/node_modules/.bin/sls&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="nt">&#34;args&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;invoke&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;local&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;-f&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;hello&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;-d&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;{}&#34;</span>
</span></span><span class="line"><span class="cl">            <span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>How do you organize your code? Don&rsquo;t have a mono repo. Use microservices or at
least a service oriented architecture. Give every microservice its own repo,
along with code only used by this service.</p>
<p>Advice: give the service name the same name as the repo to make it easier to
find things.</p>
<blockquote>
<p>Organize functions into repositories according to boundaries you identify
within your system.</p></blockquote>
<p>How do you share code between Lambda functions? It depends. Within the same repo
you can use e.g. <code>lib/utils.js</code>. You can also create an npm package. Or create a
new service to provide the functionality.</p>
<p>When using Lambda, you&rsquo;ll probably end up using more external services. The
functions themselves are simple, but the risk shifts to the integration with the
external services.</p>
<p>Another risk is the security. With microservices you have more control, but also
more things to keep secure. And thus also more places you can misconfigure.</p>
<p>Combining those two: the risk profile for a serverless application is completely
different from &ldquo;traditional&rdquo; applications.</p>]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[Microsoft Ignite | The Tour: Amsterdam — day two]]></title>
    <link rel="alternate" href="https://markvanlent.dev/2019/03/21/microsoft-ignite-the-tour-amsterdam-day-two/" type="text/html" />
    <id>https://markvanlent.dev/2019/03/21/microsoft-ignite-the-tour-amsterdam-day-two/</id>
    <author>
      <name>map[name:Mark van Lent uri:https://markvanlent.dev/about/]</name>
    </author>
    <category term="azure" />
    <category term="cloud" />
    <category term="devops" />
    <category term="infrastructure as code" />
    <category term="monitoring" />
    <category term="observability" />
    <category term="sre" />
    
    <updated>2021-11-09T20:09:33Z</updated>
    <published>2019-03-21T00:00:00Z</published>
    <content type="html"><![CDATA[<p>On the second day of <em>Microsoft Ignite | The Tour: Amsterdam</em> I attended talks in
the &ldquo;Operating applications and infrastructure in the cloud&rdquo; learning path.</p>
<p>(On the <a href="/2019/03/20/microsoft-ignite-the-tour-amsterdam-day-one/">first day</a>
I followed the sessions of the &ldquo;Building your application for the cloud&rdquo;
learning path.)</p>
<p><img src="/images/mitta2019_banner.jpg" alt="Microsoft Ignite | The Tour: Amsterdam banner at the RAI Amsterdam"></p>
<h2 id="modernizing-your-infrastructure-moving-to-infrastructure-as-code-iac--emily-freeman">Modernizing your infrastructure: moving to Infrastructure as Code (IaC) &mdash; Emily Freeman</h2>
<p>Operations is changing&mdash;modernizing. DevOps reduces or eliminates the friction
between the development and operations teams.</p>
<p>One way of defining DevOps is with the CALMS acronym:</p>
<ul>
<li>Culture</li>
<li>Automation</li>
<li>Lean</li>
<li>Measurement</li>
<li>Sharing</li>
</ul>
<p>SRE has evolved beyond its original form. SRE can be seen as a prescriptive
implementation of DevOps. It has a focus on deploying and maintaining
applications.</p>
<p>About failure:</p>
<blockquote>
<p>In the cloud, sometimes it is rainy.</p></blockquote>
<p>If you are using the cloud, you are using APIs continuously. This allows us to
automate a lot of things. And do configuration via (YAML) files which can be
version controlled. This gives us repeatability and reliability.</p>
<p>Toil is defined as manual, repetitive work. Automation removes the toil. By
creating code to do the work, we can reuse it.</p>
<p><a href="https://docs.microsoft.com/en-us/azure/azure-resource-manager/templates/">Azure Resource Manager templates</a>:
you can manually deploy infrastructure and then view the template from within
the Azure portal. This template is a configuration file which you can store and
use over and over again.</p>
<p>Deploying small batches of code is key. The risk of small changes is less than
having big (and many) changes. Small batches also make it simpler to identify
bugs. Releasing often makes sure feedback on changes is received faster.</p>
<p><img src="/images/mitta2019_emily_freeman_1.jpg" alt="Emily Freeman demonstrating Azure Pipeline"></p>
<p>Microsoft works hard to make sure that
<a href="https://azure.microsoft.com/en-us/services/devops/pipelines/">Azure Pipelines</a> work
with the tools you already use.</p>
<p>Additional resources:</p>
<ul>
<li><a href="https://github.com/Microsoft/IgniteTheTour/tree/master/SRE%20-%20Operating%20applications%20and%20infrastructure%20in%20the%20cloud/SRE10">Code</a></li>
<li><a href="https://queue.acm.org/detail.cfm?id=2945077">The Small Batches Principle</a></li>
</ul>
<h2 id="monitoring-your-infrastructure-and-applications-in-production--jason-hand">Monitoring your infrastructure and applications in production &mdash; Jason Hand</h2>
<p>Why, what and how do we monitor?</p>
<p>Let&rsquo;s first have a look at a definition of SRE:</p>
<blockquote>
<p>Site Reliability Engineering is an engineering discipline devoted to helping
an organization sustainably achieve the <strong>appropriate</strong> level of
<strong>reliability</strong> in their systems, services, and products.</p></blockquote>
<p>Note the terms &ldquo;appropriate&rdquo; and &ldquo;reliability.&rdquo; Hundred percent reliability is
expensive but not always needed. For planes and pacemakers you want absolute
reliability. For applications, this usually is not needed. Ask yourself what the
parts of reliability are that you want to be concerned with.</p>
<p>Why monitor?</p>
<ul>
<li>Are apps doing what we expect from them?</li>
<li>Are apps doing what <em>others</em> (customers) expect?</li>
<li>Are we meeting our goals or at least moving in the right direction?</li>
<li>Our systems are constantly in change. What is happening with these changes?</li>
</ul>
<p>Monitoring is the most important part of reliability.</p>
<figure><img src="/images/mitta2019_why_monitor_sheet.png"
    alt="Mikey Dickerson’s Hierarchy of Reliability"><figcaption>
      <p>Source: sheets from this presentation</p>
    </figcaption>
</figure>

<p>The difference between <em>reacting</em> and <em>responding</em> to incidents is being
prepared or not. You&rsquo;ll want to be prepared to be able to respond, not just
react.</p>
<p>Some organizations perform <a href="https://en.wikipedia.org/wiki/Root_cause_analysis">root cause analyses</a>
after failures. Complex systems usually don&rsquo;t have a single root cause though&mdash;more
often than not it is a combination of causes. Since having more than one root
cause is a contradiction in terms, Jason much rather talks about &ldquo;post-incident
reviews.&rdquo;</p>
<p>Back to the term <em>reliability</em>. This can be measured in a lot of ways. To name a
few:</p>
<ul>
<li>Availability: will the system respond when I ask it something?</li>
<li>Latency: &ldquo;slow is the new down&rdquo; (we don&rsquo;t have the time for slow applications).</li>
<li>Throughput: can you get the volume of data from one point to another in a way you expect?</li>
<li>Coverage: can a process plough through all the available data?</li>
<li>Correctness: did it do what it was supposed to do?</li>
<li>Fidelity: See for example Netflix: perhaps the search is temporarily
unavailable but you can still scroll through the catalog and watch a movie.</li>
<li>Freshness: is the data you get the most recent stuff?</li>
<li>Durability: if you write data, can you retrieve it again?</li>
</ul>
<p>It is not just about what <em>we</em> see, but also about what <em>our customer</em> experiences. We
should put the customer first and look at things from their perspective.</p>
<p>Service Level Indicators (SLIs) are a ratio/proportion. For instance: the
number of successful HTTP calls divided by the total number of HTTP calls.</p>
<p>But where do the numbers come from? A few options:</p>
<ul>
<li>Reported by the server</li>
<li>Reported by clients</li>
<li>Reported by the application</li>
<li>Reported by the front-end (load balancer)</li>
<li>Reported by our monitoring / testing infrastructure</li>
</ul>
<p>You need to include the way you measure in the SLI.</p>
<p>There is no &ldquo;right way&rdquo; to measure things, but there are trade-offs. And you
need to be aware of them. Ideally you should monitor as close to the user as
possible.</p>
<p>The SLIs are the measurements. The Service Level Objectives (SLOs) determine
what is acceptable. If the SLI (indicator) dips below the SLO (objective), we
need to get people involved.</p>
<p>How to build a good SLO? Take the thing to monitor (e.g. HTTP requests), look at
the SLI proportions and add a time statement. An example: &ldquo;90% of the HTTP
requests as reported by the load balancer succeeded in the last 30 day window.&rdquo;
You can make your SLOs more complex if needed, e.g. by having a lower and upper
bound.</p>
<p>Alert on actionable things. Alerts are <strong>not</strong> logs, notifications, heartbeats
or normal situations. Alerts are about things a person needs to go do. And not
just any person, but the right person. It should not be something that can
be automated. Ideally alerts are pretty simple and include:</p>
<ul>
<li>Where is this coming from?</li>
<li>Which SLO is breached?</li>
<li>What is the impact for the customer?</li>
<li>What systems are impacted?</li>
<li>What are the first couple of steps to take?</li>
</ul>
<p>Possible future direction of monitoring:</p>
<ul>
<li>There&rsquo;s no one silver bullet. We&rsquo;ll probably want to integrate multiple monitoring solutions.</li>
<li>Modular monitoring</li>
<li>Aggregate monitoring (measure the system as a whole)</li>
<li>Correlation and anomaly detection (AI / ML)</li>
<li>Monitoring the monitoring</li>
</ul>
<p>Additional resources:</p>
<ul>
<li><a href="https://github.com/Microsoft/IgniteTheTour/tree/master/SRE%20-%20Operating%20applications%20and%20infrastructure%20in%20the%20cloud/SRE20">Code</a></li>
<li><a href="https://docs.microsoft.com/en-us/azure/azure-monitor/overview">Azure Monitor overview</a></li>
</ul>
<h2 id="troubleshooting-failure-in-the-cloud--jason-hand">Troubleshooting failure in the cloud &mdash; Jason Hand</h2>
<p>Our systems have become more complex than they have been in the past. We are now
using microservices, containers, functions, etc. Our focus for this session:
going from the detection phase (previous session) to the response phase.</p>
<p>Before talking about how to detect something is wrong, you need to define that
you mean with &ldquo;wrong.&rdquo;</p>
<p>Azure has a nice toolbox to help you:</p>
<ul>
<li><a href="https://azure.microsoft.com/en-us/services/monitor/">Azure monitor</a></li>
<li><a href="https://docs.microsoft.com/en-us/azure/service-health/service-notifications">Service health alerts</a></li>
<li><a href="https://docs.microsoft.com/en-us/azure/azure-monitor/app/app-insights-overview">Azure Application Insights</a></li>
<li><a href="https://docs.microsoft.com/en-us/azure/azure-monitor/essentials/platform-logs-overview">Diagnostic logs</a> (application and system)</li>
<li>Live debugging via log streams</li>
<li><a href="https://docs.microsoft.com/en-us/azure/azure-monitor/logs/log-query-overview">Log Analytics</a></li>
</ul>
<p>The first question is often: is it just me? Azure (and the other cloud
providers) can also have problems. The
<a href="https://docs.microsoft.com/en-us/azure/service-health/resource-health-overview">Azure Resource Health</a>
page in the Service Health section shows you if your components have problems.
It even has historical data. But because you don&rsquo;t want to stare at a dashboard
all day long, you&rsquo;ll want to also have health alerts for when something goes
wrong.</p>
<p>Guidance for creating service health alerts:</p>
<ul>
<li>Don&rsquo;t create too many, but also not too few. Find your sweet spot.</li>
<li>Make sure they don&rsquo;t overlap (to avoid confusion).</li>
<li>Don&rsquo;t alert on non-production environments.</li>
<li>Keep the responders in mind (what would they want).</li>
<li>Separate alerts for issues, planned maintenance and advisories.</li>
</ul>
<p><a href="https://docs.microsoft.com/en-us/azure/azure-monitor/app/app-map">Application Map</a> shows
components and how they interact with each other. Where is slowness? Where are
problems?</p>
<p>You will need to
<a href="https://docs.microsoft.com/en-us/azure/app-service/troubleshoot-diagnostic-logs">enable diagnostic logging</a>
to get a bunch of logs. And you don&rsquo;t have to go to the portal, you can use the
CLI tool to download them. You can then use a separate tool to analyse the logs.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">az webapp log download --resource-group &lt;name&gt; --name &lt;webapp name&gt;
</span></span></code></pre></div><p><a href="https://azure.microsoft.com/en-us/features/cloud-shell/">Azure Cloud Shell</a>
allows you to have a terminal session in your browser. You don&rsquo;t need your
local CLI.</p>
<p>Use Log Analytics for more complex troubleshooting. It collects a lot of
information from a log of sources.</p>
<p><img src="/images/mitta2019_jason_hand_3.jpg" alt="Jason Hand about Azure Log Analytics"></p>
<p>You&rsquo;ll need to use the <a href="https://docs.microsoft.com/en-us/azure/data-explorer/kusto/query/">Kusto Query Language</a>
to get to the information. Queries start with a table, then commands to filter,
sort, specify time range, etc. Once you have queries that answer your questions,
you can keep them handy by saving them right in the Azure portal.</p>
<p>Azure <a href="https://docs.microsoft.com/en-us/azure/azure-monitor/visualize/workbooks-overview">Monitor Workbooks</a>
and Troubleshooting Guides (a.k.a. runbooks) combine text, queries, metrics and
parameters into reports. This helps with onboarding and will also help your
first responders.</p>
<p>Workbooks are editable by others that have access to the same resources.
Troubleshooting Guides are like workbooks, but for dealing with problems.</p>
<p>Future for troubleshooting:</p>
<ul>
<li>Observability (<a href="https://opencensus.io/">OpenCensus</a> and other distributed
tracing tools)</li>
<li>Anomaly detection</li>
</ul>
<p>Additional resources:</p>
<ul>
<li><a href="https://github.com/Microsoft/IgniteTheTour/tree/master/SRE%20-%20Operating%20applications%20and%20infrastructure%20in%20the%20cloud/SRE30">Code</a></li>
<li><a href="https://docs.microsoft.com/en-us/azure/azure-monitor/logs/get-started-queries">Get started with Azure Monitor log queries</a></li>
</ul>
<h2 id="scaling-for-growth-and-resiliency--jeramiah-dooley">Scaling for growth and resiliency &mdash; Jeramiah Dooley</h2>
<p>How do we take an application and how do we scale it? Bigger picture: asking a
lot of &ldquo;why?&rdquo; questions.</p>
<p>Do we need to scale out our application and the underlying infrastructure? Maybe.</p>
<ul>
<li>Perhaps we should first check if the application is buggy (memory leaks) or if
we can optimize it for performance (e.g. database queries).</li>
<li>Is there a temporary (or artificial) reason the site has a spike?</li>
<li>Is there unused capacity we can move around?</li>
<li>Scaling costs money, but if we don&rsquo;t do it we give the customers a bad
experience. This is a risk vs expense issue.</li>
<li>When is the application no longer meeting our expectations (SLOs)?</li>
</ul>
<p>We need to factor for both the anticipated and unanticipated spikes.</p>
<p>One can scale up (vertical) or out (horizontal). Scaling up is adding more
resources (CPU, memory) to an instance. Scaling out is adding more machines.
When scaling out, the application needs to be aware of the change (additional or
less instances).</p>
<p><img src="/images/mitta2019_jeramiah_dooley_2a.jpg" alt="Jeramiah Dooley about scaling up vs scaling out"></p>
<p>You might not want to scale just one portion of an app since it can cause
problems in another part of the app.</p>
<p>Autoscaling can trigger on quite specific conditions: only today if CPU &gt; 80%
add more instances.</p>
<p>But should you use autoscaling?</p>
<ul>
<li>Do you know your app? When do you need to add more resources? How will your
app respond? How quickly? What happens when you scale down?</li>
<li>Do you know your traffic?</li>
<li>Autoscaling also has its limits. (E.g. there is going to be a delay.)</li>
<li>Do you understand the metrics that matter?</li>
</ul>
<blockquote>
<p>It&rsquo;s not magic, it&rsquo;s just automation.</p></blockquote>
<p>An autoscaling profile has:</p>
<ul>
<li>Capacity:
<ul>
<li>Minimum</li>
<li>Maximum</li>
<li>Default</li>
</ul>
</li>
<li>Triggers (a metric we want to trigger on and a value for that metric)</li>
<li>Recurrence (when do we want this to happen? Always? First week of the month?)</li>
</ul>
<p>Warnings with regards to autoscaling:</p>
<ul>
<li>Scaling out will happen when <em>any</em> of the triggers are met.</li>
<li>Scaling in will trigger when <em>all</em> triggers are met.</li>
<li>Autoscaling always wins over manual scaling settings.</li>
</ul>
<p>Okay, we know how to handle load. But do we know what to do when we have a
localized failure (e.g. a data center that goes down)? Use:</p>
<ul>
<li>Multiple regions/geographies</li>
<li>Region pairs</li>
<li>Availability zones</li>
</ul>
<figure><img src="/images/mitta2019_jeramiah_dooley_2b.jpg"
    alt="Jeramiah Dooley about the relation between Geography, Region Pairs, Regions and Availability Zones"><figcaption>
      <p>Jeramiah Dooley about the relation between Geography, Region Pairs, Regions and Availability Zones</p>
    </figcaption>
</figure>

<p><a href="https://azure.microsoft.com/en-us/services/frontdoor/">Azure Front Door</a>:</p>
<ul>
<li>Global HTTP load balancer with instance failover</li>
<li>Bonuses:
<ul>
<li>SSL offloading</li>
<li>Firewall and DDoS Protections</li>
<li>Central control plane for traffic orchestration</li>
</ul>
</li>
</ul>
<p>Future possible directions for scaling:</p>
<ul>
<li>Content delivery networks (CDNs)</li>
<li>Failovers</li>
<li>Traffic shaping</li>
<li>Draining data centers</li>
</ul>
<p>Additional resources:</p>
<ul>
<li><a href="https://github.com/Microsoft/IgniteTheTour/tree/master/SRE%20-%20Operating%20applications%20and%20infrastructure%20in%20the%20cloud/SRE40">Code</a>
(Tip: the code apparently has a very nice deployment script.)</li>
<li><a href="https://docs.microsoft.com/en-us/azure/app-service/manage-scale-up">Scale up an app in Azure</a></li>
<li><a href="https://docs.microsoft.com/en-us/azure/azure-monitor/autoscale/autoscale-get-started">Get started with Autoscale in Azure</a></li>
<li><a href="https://docs.microsoft.com/en-us/azure/best-practices-availability-paired-regions">Business continuity and disaster recovery (BCDR): Azure Paired Regions</a></li>
<li><a href="https://docs.microsoft.com/en-us/azure/frontdoor/front-door-overview">What is Azure Front Door Service?</a></li>
</ul>
<h2 id="responding-to-and-learning-from-failure--emily-freeman">Responding to and learning from failure &mdash; Emily Freeman</h2>
<p>Failure is inevitable.</p>
<p>Teach people how to solve problems. It will prevent you from being woken up for
everything at night.</p>
<p><img src="/images/mitta2019_emily_freeman_2.jpg" alt="Emily Freeman"></p>
<p>Having to react to failure leads to:</p>
<ul>
<li>Unplanned work</li>
<li>Context switching</li>
<li>Confusion</li>
<li>Increased time to repair</li>
</ul>
<p>We tolerate a certain level of failure every day. Things that are <strong>not</strong> fine
however:</p>
<ul>
<li>Alerts to inbox</li>
<li>Alerts to a common channel where memes are combined with alerts since you&rsquo;ll miss
the important stuff</li>
<li>Shoulder tapping</li>
<li>Delays in communication</li>
<li>Tiered and off-shore support</li>
<li>Network Operations Center</li>
</ul>
<p>We can borrow concepts from other disciplines. Like creating a space to
communicate and collaborate. Ideally there is a single space for all
communication for an incident; a space devoted to that single incident. The
benefits of having such a space: you&rsquo;ll have a time stamped record (which helps
the post-incident review), and it focusses communication.</p>
<p>Train your first responders and provide them with context and clear escalation
procedures.</p>
<p>Update your stakeholders throughout the incident.</p>
<p>Important when alerting is adding a severity of the incident:</p>
<ul>
<li>Sev1 (critical): complete outage</li>
<li>Sev2 (critical): major functionality broken, revenue affected</li>
<li>Sev3 (warning): minor problem</li>
<li>Sev4 (warning): redundant component failure</li>
<li>Sev5 (info): false alarm or unactionable alert</li>
</ul>
<p>The life cycle of an incident:</p>
<ul>
<li>Detection: discovering that there is an issue. You&rsquo;ll have to figure out what
is wrong ASAP.</li>
<li>Response: about 1/3 of your time is spent in this phase. The incident
commander (IC) is the person in charge of the incident response. A
communication chief (or scribe, or whatever you call this person) documents
the situation for a historical record. Don&rsquo;t use this to place blame, but to
analyze what happened.</li>
<li>Remediation: you know what went wrong and how to fix it. Always under-promise
and over-deliver in your timeline. ChatOps tools might help to remediate a
situation.</li>
<li>Analysis: host a post-incident review to learn from what happened. The goal is
not to produce artifacts, instead the goal is to maximize organizational
learning. Prevent a similar incident in the future. Record counter measures
and assign a date and responsible person.</li>
<li>Readiness: prepare for the future.</li>
</ul>
<blockquote>
<p>Incidents are not linear and neither is the response.</p></blockquote>
<p>The last two phases are often forgotten while they are the most important. Since
these will prevent the same failure in the future from happening.</p>
<p>Additional resources:</p>
<ul>
<li><a href="https://github.com/Microsoft/IgniteTheTour/tree/master/SRE%20-%20Operating%20applications%20and%20infrastructure%20in%20the%20cloud/SRE50">Code</a></li>
<li><a href="https://risk-engineering.org/concept/Rasmussen-practical-drift">Rasmussen and practical drift</a></li>
<li><a href="https://docs.microsoft.com/en-us/azure/logic-apps/">Azure Logic Apps Documentation</a> (concerning building automated workflows)</li>
<li><a href="https://docs.microsoft.com/en-us/azure/azure-monitor/alerts/alerts-overview">Overview of alerts in Microsoft Azure</a></li>
<li><a href="https://docs.microsoft.com/en-us/learn/modules/build-chat-bot-with-azure-bot-service/">Build a chat bot with the Azure Bot Service</a></li>
</ul>]]></content>
  </entry>
</feed>
