Posts tagged with “conference” on Mark van Lent’s weblog

PyGrunn 2024

2024-05-17T16:42:29Z

Notes from my day at the 12th edition of PyGrunn.

PyGrunn is a Python focussed, one day conference held in Groningen¹. Or as the organizers more eloquently phrase this:

PyGrunn is the “Python and friends” developer conference with a local footprint and global mindset. Firmly rooted in the open source culture, it aims to provide the leaders in advanced internet technologies a platform to inform, inspire and impress their peers.

Before I start with my notes I want to give a shout-out to Reinout van Rees. His (PyGrunn) summaries are excellent. I’m always impressed by the quality of them and how little time he needs to write them. Where I’m only able to make notes and need to write them out afterwards, Reinout has the summary (as a coherent story) ready before the speaker has unhooked their laptop. So if you are interested in one of the talks he has attended, head over there first.

Platform Engineering: Python Perspective — Andrii Mishkovskyi

Why do platform engineering teams exist? We’ve seen a “shift left” of responsibilities towards the developer e.g. QA, operations (DevOps). But we (software developers) are trained to write code, not to e.g. monitor it.

So where does a humble developer start? There is an abundance of choices on the tools to use. What do you pick for package management? Or continuous integration? Or deployment, code quality, observability, etc. This freedom of choice comes with a cost. We start with discussing which tools to use instead of the problem the customer is facing. And depending on which choice we make, the result may make reasoning about the software more complex.

So what should you do?

Andrii broke it down into three parts:

You observe (i.e. you read a lot to get the lay of the land)
You execute (this is the actual software development part)
And then you collect feedback (you reach out to teams, you observe how their work has changed)

Some of the platform engineering team deliverables:

Documentation
Self service portal (tip: look into Backstage)
Boilerplates
APIs

The “consumers” are developers, compliance teams and other platform teams. The goal is to:

Have reasonable defaults
Remove redundancy
Keep things consistent

To get a feel for the scale of things: at Andrii’s company there are over 160 services, developed on by 300+ developers in 500+ repositories. They total up to 3 million lines of code and 6 million lines of YAML.

Templates (boilerplates) provide a paved path. The goal is to have teams spend as little time as possible when starting a project. The templates use a certain set tools that are supported. Teams are free to use different tools though if they want to.

Andrii uses cookiecutter templates at work. It’s not his choice perse, but it’s what was already in place when he joined. There are currently three templates in use. They have evolved over time. For example in the last nine years, over 800 changes have been made (that is more than one change per week on average).

The evolution has left its marks: the templates have a lot of code duplication. There is also code specific to a tiny minority of projects (only about 8 out of the 500+). This means that most projects start with deleting code after using the boilerplate.

And that also relates the downside of using cookiecutter the way they do. Instead of just using it to get started with a project, they also use it incrementally. But cookiecutter does not have versioning built-in. So if you remove a file and then reapply cookiecutter again, the file is happily created again.

While Andrii is aware of the issues (and thinks copier might be a better alternative), it is hard to replace practices that are already in use. And cookiecutter is great for getting started with a project.

With regard to the standardization, they use the following in the templates:

pyproject.toml for all projects
Poetry (instead of setuptools + pip-tools)
sprinkle Renovate on top for automatically updating dependencies

As it currently stands, there are 99 project that migrated to pyproject.toml and Poetry in the last 2 years. It makes sense because it takes time for projects to transition. Plus they are not required to migrate; again: the template are there to help, not to limit the users. Renovate has been adopted more quickly.

Migrating from e.g. pkg_resources to pkgutil or PEP-420 for namespaces packages is hard. Templates can help with that. However, cookiecutter does not actually manage files. So if a file has been removed from a template, rerunning cookiecutter does not remove the file from the project. So that requires some care.

When they migrated from a monolithic application to a microservices architecture, authentication/authorization became an issue. There was no visibility for teams, no transparency and no accountability. To combat this, they created an API where applications can declare the required access and scopes in a YAML file. Maintainers can approve this access. And this also allows for CI/CD to check access. A CLI tool can verify the validity of the YAML and check if access is actually approved.

Securing your team, solution and company to embrace chaos — Edzo Botjes

Edzo started by sharing a link to his slides and warned us that it usually takes two weeks for people to digest the contents of his talk. He would overload us with information. And that’s where I decided to solely concentrate on the talk and not on note taking.

Nobody knows what they are doing. Embrace this.

Even a simple, deterministic system like a double pendulum has a nondeterministic outcome. In other (my) words: the whole world is in chaos and unpredictable.

When presented with information everyone processes it differently and understands something else (also see viral phenomenon of the dress).

What worked for Edzo was to embrace chaos. To do this he let go of his desire to control and trying to create a predictable outcome.

Descriptors: Decoding the Magic — Alex Dijkstra

Many people have used descriptors without them even being aware of it.

From the documentation:

Descriptors let objects customize attribute lookup, storage, and deletion.

—Descriptor Guide

You can view descriptors as reusable @properties. A descriptor implements the __get__ and __set__ methods (and when needed __delete__).

Alex showed a bunch of examples. This is the template he showed to introduce descriptors:

class MyDescriptor:

    def __get__(self, obj, owner):
        # owner is the class to which the instance belongs
        return obj.__dict__.get(self.private_name)

    def __set__(self, obj, val):
        # self is descriptor instance.
        obj.__dict__[self.private_name] = val

    def __set_name__(self, owner, name):
        self.public_name = name
        self.private_name = f'_{name}'

His example of how to use this:

class MyClass:
    value = MyDescriptor()


myinstance = MyClass()
myinstance.value = 5  # MyDescriptor.__set__
myinstance.value      # MyDescriptor.__get__

Using descriptors you can do things when getting or setting the value. E.g. in a class you can enforce that an attribute has a certain type.

The __set_name__ method was introduced in Python 3.6. It is not needed to use this in all of your descriptors, but also doesn’t hurt.

Do you want to use descriptors all over the place? No. They can be useful: you can create a clean APIs with them and this helps if the API is used frequently. However, it does create some overhead and the code is a bit more complex.

Resources:

https://docs.python.org/3/howto/descriptor.html
Luciano Ramalho’s book Fluent Python (Luciano also did a few talks about descriptors)

Release the KrakenD — Erik-Jan Blanksma

KrakenD is an API gateway product. Erik-Jan likes it so much he wanted to share his experience with it.

Projects can start out simple, with a monolith that is accessed via a web client. Before you know it, there are several services and multiple types of clients. The solution is to introduce an API Gateway in the middle. It can then handle the incoming requests.

API Gateway in short:

It sits between clients and backend (as a sort of portal).
It hides internal complexity of backend for the clients.
The gateway is a great place to introduce things like authorization/authentication, logging, load balancing, caching, etc.

KrakenD is one of the available API Gateways. It is open source, but there’s also an enterprise version with extra features. KrakenD is implemented in Go and offers a bunch of features out of the box (monitoring, throttling, request and response manipulation). KrakenD offers integrations with e.g. tools (Jaeger, Grafana, the Elastic stack), authorization/authentication services and queues (RabbitMQ).

KrakenD is a stateless process, so no database is needed. It takes JSON (or YAML) config. It can combine the results of multiple backends API calls and return it as a single response.

Tips:

Use the KrakenDesigner (makes it easy to explore what’s possible). Note that you do not want to use this in production.
You’ll want to split up the configuration when it grows. By using flexible configuration you can combine so called partials, settings and templates.
KrakenD can check and even audit you configuration.
Since the configuration is in JSON, you can generate OpenAPI.json from the KrakenD config. You can use this for Swagger.

KrakenD is a great tool to manage your APIs. It is lightweight, fast, easy to configure and has lots of functions out of the box. It is versatile and extensible. By using it you can make your architecture more agile.

However, it also means that you will have to manage the API Gateway configuration. And a change in the configuration means you will have to restart the process.

General tips

Some general notes:

Have a look at:
- Pydantic
- Chalice
- Xata
It can be helpful use functional programming (e.g. using closures) instead of by default using classes and methods.

“Grunn” is what Groningen is called in the regional language ↩︎

All Day DevOps 2022

2022-11-10T21:32:11Z

Today was the 7th All Day DevOps. Just like last year the organizers managed to get 180 speakers, spread over 6 tracks, to inform, teach and entertain us. As usual I made some notes of the talks that I attended.

Keynote: The Rise of the Supply Chain Attack — Sean Wright

Dependency managers (like Apache Maven) changed the way software is being developed: using those makes it much easier to use dependencies in your application (you don’t have to fetch them manually yourself). As a result, building your own libraries is now less common, using an available (open source) library is much easier.

To get a feel for the scale: NPM serves 2.1 trillion annual downloads (requests). This obviously includes automatic builds, but the growth over the last couple of years has been exponential none the less. And using third party libraries can be a very good thing. Good libraries can be much more secure and have less bugs than some library developed in house.

Types of supply chain attacks:

Dependency confusion: Register a private package (which you know is used by a company) in a public repository. If the package manager searches for the package in the public repository before the private one, the malicious package is used.
Typo squatting / masquerading: Use misspelled or “familiar” names in the hopes someone uses the wrong package.
Malicious package: Malicious payload in a package that promises (and perhaps even delivers) useful functionality.
Malicious code injection: Compromise of an existing package (e.g. by compromising the build system or repository).
Account takeover: Malicious actor can log in as the original author.
Exploiting vulnerabilities in packages: Using a known or unknown vulnerability in a library.
Protestware: For example a legitimate library with notes protesting against the war in Ukraine.

After this Sean unfortunately dropped from the session.

Taking Control Over Cloud Costs — Amir Shaked

Reasons to analyze your cloud costs:

Costs translate to profit
Usage is directly tied to costs (and if you don’t properly monitor, your spend could increase unexpected)
Be in control: by monitoring you can determine what impact certain actions have on your business
Be intelligent: by knowing what you spend where, you can connect it to your business KPIs
Save money

You’ll want to be aware when you scale (hyper growth) if your costs grow faster than the number of users/sales/etc. You don’t want to end up in a situation where more success means less profit.

Things to look for in your environment:

Unused resources
Over commitment
Under commitment
Price per unit we care about (e.g. users, sales, etc)
Bugs

Best practices you can arrange from day one:

Export billing data and create e.g. dashboards
Structure and organize resources for fine-grained cost reporting
Configure policies on who can spend and who has administrator permissions
Have a culture of “showback” in your organization. Make your employees aware of the costs, have budgets and alerts.

Beyond Monitoring: The Rise of Observability Platform — Sameer Paradkar

Customer experience is important for your revenue. You need to deliver your service and make sure you know how well you’re able to do so. Observability helps you find the needle in the haystack, identify the issue and respond before your customers are affected.

Pillars of observability:

Metrics: Numeric values measured over time. Easy to query and retained for longer periods.
Logs: Records of events that occurred at a particular time (plain text, binary, etc). Used to understand the system and what’s going on.
Traces: Represent the end-to-end journey of a user request through the subsystems of your architecture.

Relevant key performance indicators (KPIs) and key result areas (KRAs):

Customer experience
Mean time to repair (MTTR)
Mean time between failure (MTBF)
Reliability and availability
Performance and scalability

An observability platform gives you more visibility into your systems health and performance. It allows you to discover unknown issues. As a result you’ll have fewer problems and blackouts. You can even catch issues in the build phase of the software development process. The platform helps understand and debug systems in production via the data you collected.

AIOps applies machine learning to the data you’ve collected. It’s a next stage of maturity. Its goal is to create a system with automated functions, freeing up engineers to work on other things. Automating remediation of issues can also greatly reduce response time and mean time to repair. This means that the customer experience is restored faster (or is never degraded to begin with).

Failure Is Not an Option. It’s a Fact — Ixchel Ruiz

Failure can cause a deep emotional response, we can get depressed and it can make us physically sick. On one side of the spectrum, a failure can cause harm to other people. On the other side we could embrace failure and make things safe to fail. Failure in IT on the project level is quite common. Failure can also happen on a personal level.

Not all failures are created equally. There are three types:

Preventable: These are the “bad” failures. There’s no reason to allow these.
Complexity related: These failures are due to the inherent uncertainty of work. Usually it is a particular combination of needs, people and problems that cause them.
Intelligent: The “good” failures, since they provide new knowledge.

Steps to learn from failures:

Recognize
Understand the cause
Extract lessons to prevent future failure
Share the information
Practice for the next failure in a safe and controlled setting

Increase return by learning from every failure, share the lessons and review the pattern of failures. Do note that none of this can happen in an environment without psychological safety. You need to feel safe to discuss your failure, doubts or questions to be able to learn.

Accurate Metrics — Christina Zeller and Marcus Crestani

A manufacturing process was monitored and there was a nice dashboard to show whether there were any problems. However, at a certain moment there was a problem, but the dashboard was still claiming everything was fine.

What was going on? Unreliable network? Erratic monitoring system? Flawed collection of metrics? It turned out to be all of them.

Sampling rates, retention policy and network issues can cause missing measurements in your time series database. This missing information can cause a drop in failure rate if you are unlucky enough that the missing samples are failed ones. So you think everything is fine, but there is something wrong in your environment.

Modelling metrics differently can help. One possible improvement is to have a duration sum in your metrics instead of just the duration. If you now miss a sample, the sum will still indicate that there has been a failure.

Using histograms is even better since you place values in buckets, e.g. failures in our case. A disadvantage however is that the metrics creation system must now also know what qualifies as a success or failure.

Takeways:

Always collect metrics
Check that your metrics match reality
Always consider missing data
For duration metrics always use histograms

After improving their metrics, the monitoring system matches the actual state of the manufacturing environment again.

Tip: look into hexagonal architecture, also known as “ports and adapters architecture.”

Why Is It Always DNS, TLS, and Bad Configs? — Philipp Krenn

DNS, TLS and bad config are where failures are waiting to haunt us when we least expect it. We need to have a tool to find the issues in our system early on. Health checks can be this essential tool to alert us.

You are able to narrow down where the issue is if you structure your health checks like this:

Outside the network (different provider): This allows you to detect issues with e.g. DNS, the uplink, a firewall, a load balancer, service availability and latency.
On the network (different AZ): If you are on the network, you could see issues with the network, firewall, TLS, service availability and latency. You can compare your measurements from outside with what you see on the inside.
On the instance: Again, by comparing the local point of view with the outside, you can detect issues with service availability, proxy vs service, latency, dependencies (e.g. a database that is slow or the service cannot reach)

Examples of tests you can use:

Ping a host
Setup a TCP connection
Do an HTTP request
POST data to a service
Synthetic monitoring where you simulate button clicks

Health checks are cheap to run and give you a fast overview. They do not replace observability and only tell you something is broken, not what is going on. Start simple and only add synthetics when/where needed since they more complex.

Comprehending Terraform Infrastructure as Code: How to Evolve Fast & Safe — Anton Babenko

By using Terraform AWS modules you’ll have to write less Terraform code yourself, compared to using the AWS provider resources directly.

For example: if you write your own Terraform code from scratch for an example infrastructure with 40 resources, you’ll need about 200 lines of code. Once you introduce variables, that code base will grow to 1000 lines of code. When you then split it up into modules, you’ll need even more code.

Instead, if you use terraform-aws-modules you’ll have more features than your own modules would only need about 100 lines of code.

Questions with regard to terraform-aws-modules:

Why?: Understandability is an important issue. To help with these questions (like “why is it done this way?") there are autogenerated documentation and diagrams. A visualization of the dependencies also helps.
Does it work?: The project focusses on static analysis (using tflint and terraform validate) and running Terraform on examples. Anton thinks that testing code should be for humans (so HCL is better than Go).
How to get feedback quicker?: Options: run Terraform locally instead of in CI/CD pipeline, simplify the code, and restrict surface area of your code using policies, guardrails, etc. Split up your Terraform code, with different states, to speed up Terraform runs.

Most important word for this talk: understandability.

50-67% of time for software projects spent on maintenance.

Some useful links Anton listed:

Keynote: Journey to Auto-DevSecOps at Nasdaq — Benjamin Wolf

Why DevOps? It can be pretty complicated to explain, even though it is an obvious choice for Nasdaq. For most people Nasdaq is a stock exchange but it is actually a global technology company. Sure, they run the stock exchange, but also provide capital access platforms and protect the financial system.

Nasdaq develops and delivers solutions (value) for their users. They manage and operate complex systems. It has been around for a while. So again, why DevOps? The answer is: to get better at their practices.

They want to deliver solutions (value) to their users efficiently, reliably and safely.
They want to manage and operate complex systems efficiently, reliably and safely.

Years ago they had manually configured static servers and the development teams were growing. They automated software deployment to a point where the product owners could trigger the deployment, and even pick which branch to deploy. This was an important first evolution. They had a “DevOps team” to handle this automation.

The second evolution for Nasdaq was moving from a data center to the cloud, using infrastructure as code (IaC). The question they asked themselves was what to do first: migrate to cloud or get their data center infrastructure 100% managed via IaC? They made the ambitious decision to do both at once.

By turning your infrastructure into code, you can create and destroy the environment as many times as you like. And this was welcome: after about 2100 times they “got it right” and were able to move over the production environment to the cloud. Without IaC this would not have been possible as flawlessly as it did.

The cloud and IaC brought them:

Efficiency: maintenance, patching, capacity
Reliability: self-healing, immutable
Safety: immutability, destroy/create

Over time the DevOps team started to handle a lot more work. The team consisted of system administators, but they were required to work as developers (make code reusable, use git, etc). The DevOps team started to complain about being overloaded and they became a bottleneck since a lot of development teams came to them with problems (failing builds, cloud questions).

On the other side, the development teams stared to complain because they are dependent on the DevOps team but that team had become the bottleneck. And “just” scaling up the DevOps team would not solve the problem.

Where the second evolution was about the technology, the third evolution was about efficiency. They moved to a “distributed DevOps” model. Developers were empowered: access to logs and metrics, training (cloud, Terraform, Jenkins). By creating a central observability platform, developers could get insight in what is going on, without the need to have access to the production environment.

This resulted in more deployments and enhanced reliability of the deployments because of the observability platform.

A year or three later, new cracks appeared. Standards were diverging because teams were allowed to pick their own path (libraries, databases, pipelines, Terraform code, etc). It also lead to practices that needed to be fixed (e.g. lack of replication). Standardizing this led to quite a burden at the start of a project: lots of basic stuff to setup.

Developers needed to be experts in a lot of technology, from JavaScript and the JavaScript framework in use, via multiple .NET versions to Terraform and other deployment related tech. An easy way to solve this situation was to flip back to the previous situation with a single team responsible for deployment and such. But this would basically mean recreating the bottleneck.

The ownership itself was not the problem. The efficiency was, because of the boilerplate needed for teams. Stuff you want to do the same across the teams. They wanted to empower the development teams, but also give them the standards for databases, messaging, etc. Instead of copying a template and have teams diverge afterwards, they looked into packaging to also make it easier to update afterwards.

This lead to evolution four with marker files, packages, code generators and auto-devops pipelines. The pipeline looks at the markers (“hey, this is a .NET app”) and can then apply a standard pipeline. Nasdaqs code generators create the boilerplate for the teams so within two minutes of starting with a new application you’re able to write code to solve your business problem, instead of having to create boilerplate code yourself first.

The developers can get up and running quickly, but in a safe way.

The development teams are all DevOps teams now, but Nasdaq also has a specialized team for the complex areas (hardware, networking, etc). There is also a “developer experience” team that focusses on the tools for the developers, like the code generators.

Current status with regard to our three key areas:

Efficiency: code generators, CLI, package setup
Reliability: package standards, pipelines
Safety: code scanning, package scanning, disaster recovery preconfigured

Varieties of Incident Response — Kurt Andersen

No matter how reliable our systems are, they are never 100% — an incident can always happen. When the pager goes off, the first step is to recruit a response team. This team can then observe what is going on. They need to figure out what this means (orient themselves) and decide what to do. And finally they can act to resolve the problem. (The OODA loop.)

Getting the right people involved can be hard for the technical responder; they themselves might want to dive into the technical stuff first. This is where the “incident commander” role comes in. The incident responder will recruit the team to get the right people involved, coordinate who does what and handle communication with people outside of the response team. (The latter can also be handled by a dedicated “comms lead” if needed.)

But how does the incident commander get involved?

A fairly standard approach for an on-call system will be to have a tiered model: tier 1 (NOC), tier 2 (people generally familiar with the system) and tier 3 (the experts of the system having the issue). The problem with this model: where does the incident commander come from? Tier 1? If so, can the incident commander follow through if the issue is handed over to the next tier?

Another model (“one at a time”): team A gets involved, decides it is not their responsibility, hands over to team B, which kicks it to team C, etc. Where does the incident commander come from in this model?

The aforementioned models only work in the simplest cases. They share a few big problems: handoffs are hard and there is no ownership, which results in loss of context. To mitigate this, some teams have an “all hands” approach where everyone is paged and everyone swarms into the incident response. However, most people on the call (or in the war room) cannot contribute. This leads to a mentality of “how quickly can I get out of here?”

Yet another approach is an Incident Command System (ICS), which comes from emergency services. In this approach the alert goes to the incident commander who then involves the team. While this works in some organizations, in tech it’s usually a bit too regimented.

The ICS morphed to an “adaptive ICS” where the technical team has more autonomy, but the incident commander is still involved. This system can be scaled up to where there’s an “area commander” role which coordinates separate teams (via their respective incident commanders).

Summarizing the roles of the parties in the “response trio”:

incident commander: coordination
tech team: nitty-gritty to solve the problem
comms lead: maintain contact with stakeholders

Each role will perform their own OODA loop from their own perspective.

But we started the story in the middle. We need to get back to the beginning and ask the question “why is the pager making noise?” Perhaps the first question one should ask is: “is this something actionable?” If it is not or if it is something you can handle in the morning, perhaps you do not have to respond in the middle of the night.

Cut down the noise and focus on the signal.

The E Stands for Enablement - Modern SRE at PagerDuty — Paula Thrasher

PagerDuty had an idea what SRE meant: they were enablers. You can hit them up on Slack and they help you out with a problem. Having an SRE team initially reduced the total minutes in incident in a year. But when PagerDuty grew further, the number went up again. Oops.

The “get well” project required teams to have the following:

Fast rollbacks: all rollbacks have to happen within 5 minutes. The SRE team provided guidelines to help teams achieve this.
Canary deploys: teams must test changes (canary) first. The SRE team enabled this via tooling.
Product limits: reasonable limits. The SRE team made this possible via telemetry and monitoring.

Results:

Lower number of minutes in major incident
Lower mean time to resolve
More “incidents averted”: four minor incidents were resolved before they caused real harm.

Important elements that made this “get well” project possible:

Clear, measurable goals
Enabled by tools and templates
Gave teams a path to do it
Experts were available to coach
The teams owned the implementation themselves
The teams were held accountable

PagerDuty used Backstage as an internal “one stop shop” developer portal with documentation and insights. It also integrates with the development systems.

All Day DevOps 2021

2021-12-09T21:14:34Z

All Day DevOps, the free online DevOps conference that goes on for 24 hours, was held for the 6th time today. With a total of 180 speakers spread over 6 tracks, there’s even more content than the last time I attended. These are the notes I took during the day.

Derek Weeks opening the day for All Day DevOps

Keynote: CAMS Then and Now and the Path Forward — Thomas “VikingOps” Krag

When people think about DevOps, they think of CAMS, which stands for:

Culture
Automation
Measurement
Sharing

The term CAMS was formalized by John Willis and he wrote down his idea in the article What Devops Means to Me. This talk will focus on the most important aspect: culture, and specifically organizational learning.

Culture

Culture as described by Simon Sinek:

A group of people with a common set of values and beliefs. When we’re surrounded by people who believe what we believe something remarkable happens. Trust emerges.

—Simon Sinek

The 5 disciplines from organizational learning:

Shared vision: Have people play the game not just because of the rules, but because they feel responsible for the game.
Mental Models: Breaking your own assumptions and how they can influence your actions. You need to self-reflect on your own beliefs.
Personal Mastery: This is about owning yourself, learning and achieving your personal goals. And to do the latter you first need to define what is important for yourself.
Team learning: Effective teamwork leads to results that persons cannot achieve on their own. And individuals working as a team can learn faster than they would by themselves.
Systems Thinking: Looking into patterns that emerge inside your organization from a holistic viewpoint instead of just looking at your own team.

Automation

Automation is about automating the right things. You don’t want to do it just to automate things, but it has to fit in the system. So you have to start with culture before you think about automation. The ultimate goal is GitOps where everything happens via a pull request.

Measurement

Measurement started out with a focus on the tooling. But measuring how you work is as important as measuring your infrastructure.

Important key metrics, from DORA:

deployment frequency (how often do you release to production)
lead time for changes (the amount of time for a change to get into production)
time to restore service (how long does it take to recover from a failure in production)
change failure rate (the percentage of deployments causing a failure in production)

Share how you are doing (in) DevOps. This is how we ended up here now. Share how you are improving your own organization. But also share information within your organization: documentation, videos, presentations, open spaces, lean coffee sessions

Three ways

CAMS originated in 2010. The three ways are principles that came out of the book The Phoenix Project. They are:

1st way: create flow (systems thinking)
2nd way: feedback loops
3nd way: experimentation, risks, learning

If we map these to CAMS:

Culture: 3rd way
Automation: 1st way
Measurement: 2nd way
Sharing: 3rd way

Working on your culture is as important as doing your actual work.

So what’s next? CAMS is still as applicable as 10 years ago. It has always been important, but it was only put into words in 2010. We need to continue sharing to get the full value out of it.

Gamification of Chaos Testing — Bram Vogelaar

We think of Usain Bolt as the record breaking athlete, but he’s also the person that worked really hard to get there. It takes a lot of time and effort to become good at something. Pilots and firemen spend most of their time training and not doing what you expect them to do; just to make sure they perform well under pressure. Also note that pilots use a lot of checklists to prevent mistakes.

We should do the same: train for when our platform is in an error state. We should not just be able to detect it, but also solve the problem.

Chaos engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. This practice started at Netflix with Chaos Monkey.

We need to become comfortable with experimenting. Have game day exercises and analyze what happened, to improve your training. Do not just focus on the result of the exercise itself, but also ask questions like “was it the right experiment?”

Now that we use containers, add sidecar containers with tools to get metrics or detect errors. Or to do chaos engineering e.g. with Toxiproxy.

Since checklists are boring, we can use gamification to spice things up. Celebrate failure, and learn from it!

Living in the year 3000: breaking production on purpose on Saturdays and have the system remedy the problem itself.

Convince management that failure is normal and expected behaviour. Promising 100% uptime is not realistic. Large, complex systems will always be in a (somewhat) degraded state.

Let engineers be scientists to deal with this complex environment. Give them training, allow them to do tests (experiments), which results in having valid monitoring that lead to actionable alerts. Get the engineers in a state where they are comfortable with failures.

(Slides)

Watch Your Wallet! Cost Optimizations in AWS — Renato Losio

Each AWS service has it’s own price components. It’s a complex subject. Even a simple service like a load balancer has multiple components. It looks simple with the “$0.008 per LCU-hour” price tag, but now you have to figure out what an LCU-hour is. Then you learn it has four dimensions that are measured: number of new connections per second, active connections, processed bytes and rule evaluations. Good luck predicting the costs.

This presentation only sticks to the basics since cost optimization it such a big topic.

To start to manage/reduce your costs, you need to enable billing and costs for your DevOps team. If you cannot measure your costs, you cannot manage it. Note that you do not need an excessive amount of tags for cost management. First you need to figure out what you are going to change and how it’s going to affect the bill for your company.

When savings can be measured, they can be recognized, and cost efficiency projects become exciting opportunities. As of early 2021, the most viewed dashboard at Airbnb is a dashboard of AWS costs.

—Anna Matlin, Airbnb

What patterns can we avoid? In most organizations, the most expensive parts of your bill will be:

Compute
Storage
Data transfer

So we will dive into these subjects.

Compute

Tips to reduce costs:

Avoid fixed IP addresses where possible. This is not so much about the costs of the IP address itself, but mostly because architectures that require a fixed IP tend to be complex and more expensive.
Use Graviton (ARM) instances. They have a better price to performance ratio. You can also migrate managed services (RDS, Lambda functions) to Graviton.
Check out Lightsail. It’s less flexible than using EC2 instances, but cheaper.

Data transfer

We usually forget to take data transfer costs into account upfront. It’s also a complex subject and there are a lot of considerations to make.

Tips:

Multi AZ is always good, but are 3 zones always better than 2 zones?
Multi region is easy to do nowadays, but expensive with regard to data transfer. Ask yourself if you really need it.
If you want to use multiple cloud providers always think about the data transfer costs. Perhaps the storage costs are lower at provider X, but if you have to transfer data, you’ll probably pay an egress cost.
Using a CDN (CloudFront) might be cheaper than paying for the egress data transfer otherwise.

Storage

Initially it was simple: there were only two storage classes. Currently there are 6 different classes with their own prices and characteristics.

The best option is to use lifecycle rules to move data between different classes. Note that you in a lifecycle policy you cannot filter the objects based on an extension (e.g. *.jpg) but instead you need to think “from left to right.” So you need to think upfront about the prefixes you are going to want to use in your bucket.

Managed services can offer you automatic and manual backups, which can be great. But what is the cost of that? Check how much retention you need for example.

With regard to EBS: for most use cases gp3 is better and cheaper than gp2 (except for very large volumes). Note that you can change the EBS volume type without stopping the machine. In most cases you can have a 20% cost saving without affecting your performance.

AWS changes quickly

After each AWS re:Invent, your deployment is probably outdated with regard to cost optimizations. Examples:

Use gp3 instead of gp2 for your EBS volumes.
The instance type m6 might be more interesting than the m5 or m4 you may currently be using.
Dublin was the cheapest region in Europe, but for most things Stockholm is cheaper than Dublin at the moment.

So keep up to date with the offerings.

Keynote: Call for Code with The Linux Foundation: Contributing to Tech-for-Good Even if You’re Not Techincal — Daniel Krook and Demi Ajayi

Call for Code is a multi year program launched in 2018 to address humanitarian issues and help bridge potential solutions. Last year the global challenge was around climate change and a track was added for the social and business impact of the COVID-19 pandemic.

It’s not just about generating ideas to take on the issues. It should eventually also lead to an adopted open source solution that is sustainable.

The 14 projects discussed today can be found on Call for Code page on the Linux Foundation website. You can also read about them via the IBM developer site. GitHub is central for how they iterate on the features. The related organizations are :

The key takeaway for this session is to make us help improve how the projects do DevOps. The goal is to ensure that everyone can contribute to the projects, can do so with confidence and the projects can be deployed with speed.

The Call for Code for Racial Justice open source projects are categorised in three pillars:

Police reform and judicial accountability:
Diverse representation:
- Take Two
Policy and legislative reform:

Demi talked about each of these projects in the program and their tech stacks. You can read more about them on the Call for Code for Radical Justice section on the IBM developer site.

Daniel in turn talked about other Call for Code projects:

Even if you cannot code, there are numerous ways you can contribute, e.g. conducting user research, write/review documentation, do design work, advocacy like speaking at conferences.

There are multiple ways to get involved:

Via the virtual community (join slack, join events)
Directly via the projects on GitHub
Conduct outreach (recruit friends, host events, etc)

DevSecOps Culture: Laughing Through the Failures — Chris Romeo

Chris shared 10 DevSecOps failures and talked about how to change the culture and turn these failures into successes.

#1 Name and brand

This quote nicely sums it up:

Can we stop the {Sec}Dev{Sec}Ops{Sec} naming foolishness?

Just call it DevOps and focus on making security a natural part of building stuff.

—Julian Vehent, @jvehent

To change the culture:

Embrace security and make it part of DevOps
Teach this new definition to everyone involved with your project (developers, testers, product managers, executives, etc)
Create a “DevSecOps” swear jar ;-)

#2 The infinity graph

The problem is that nothing ever gets done with this infinity graph. It’s not an accurate representation.

The solution is to talk about pipelines instead and integrating security into them. Code review should include security, vulnerability scanning should be part of the pipeline, etc. Ban the infinity graph.

#3 Security as a special team

Creating a specific security team is the opposite of what DevOps is about. It’s about working together. Security isn’t a specialty, it is the responsibility of everybody. This requires knowledge and expertise.

The other way around is also true: teach security people to code. They don’t have to become great coders, but it would be nice if they can review code and make suggestions to make things more secure.

#4 Vendor defined DevOps

Sometimes we let vendors define what DevOps and security are for us, via the products that they offer. It would be better to find the best of breed outside of the offering of cloud provider. Take a vendor independent approach and determine what DevOps means to you.

#5 Big company envy

Looking at the big companies can be discouraging. You most likely have not invested the same time in it as e.g. Netflix, Etsy, etc. have. So while you won’t be at the same level, don’t see this as an excuse to give up. Do the DevOps that you do. Don’t fixate on the top of the class. Get on that path and make incremental progress.

Use the OWASP DevSecOps Maturity Model to create a roadmap.

#6 Overcomplicated pipelines and doing everything now

This can be a complicated subject (see for example the DevSecOps Reference Architecture from Sonatype).

But keep it simple! Start with a small subset of security tools. Everybody related to your project should be able to explain the build pipeline. Don’t try to solve all problems immediately. Take a phased approach.

#7 Security as gatekeeper

Security might want to slow down the pipeline and act as a gatekeeper. Don’t say “no,” but “yes, if…” For example: “yes, that would be a great feature if you enable multifactor authentication.”

Practice empathy. Both security people for developers, but also the other way around.

#8 Noisy security tools

You buy a tool and enable every option to “get your money’s worth.” The result is 10,000 JIRA tickets of things that need to be fixed. This does not help.

It would be better to tune the tools and don’t waste time with security findings that do not matter.

Start with a minimal policy focussing on the largest issue. Developers will then start to trust the tool and then you can slowly increase the policy.

#9 Lack of threat modelling

You scanning tools cannot find business logic flaws; there’s no pattern to it.

Perform threat modelling outside of the pipeline. It should be done when new feature assignments go out.

#10 Vulnerable code in the wild

There are lots of vulnerabilities in open source software; this is a supply chain problem.

To improve this: embed software composition analysis (SCA) in all your pipelines. Set the SCA policy to fail when a vulnerability is detected. If you filter out a vulnerability (e.g. because there’s no fix yet), make sure the filter will not be active forever.

Successes

The 10 DevOps successes we can distil from the failures above:

Just call it DevOps
Pipelines
Security for everyone
Embrace your DevOps
Be content with your DevOps
Simple and staged pipeline
Security as trusted partner
Tuned and valuable security tools
Threat modelling for everyone
Breaking the build for vulnerabilities

Key takeaways

Hopefully the real impact of DevOps (when looking back 50 years from now) is going to be security culture related.
Some of the failures were outside of your control, but you can change them.
Everybody has to code, choose your DevOps, keep it simple, lower the noise, add the best practices we discussed, do some threat modelling, break the build for vulnerabilities.
Embrace and laugh at the failures.

Keynote: Managing Risk with Service Level Objectives and Chaos Engineering — Liz Fong-Jones

Besides being a principal developer advocate at Honeycomb, Liz is also a member of the platform on-call rotation. Honeycomb deploys with confidence up to 14 times a day, every day of the week—so also on Fridays. How do they manage to (mostly) meet their Service Level Objectives (SLOs) while also scaling out their user traffic?

Their confidence recipe:

Quantify the amount of reliability.
Be able to identify risk areas that might prevent them from fulfilling their targets.
Test to verify that the assumptions about the systems are correct.
Respond to that feedback to address the problems found via those experiments or though natural outages.

How to measure reliability?

You need to know how broken is “too broken.” You don’t have to alert on all problems when working at scale. You need to measure success of the service and define SLOs. These are a way to measure and quantify your reliability.

Honeycomb’s jobs is to reliably ingest telemetry, index it, store it safely and let people query it in near-real-time. Honeycomb’s SLOs measure the things that their customers care about.

For example, they have set an SLO that the homepage needs to load quickly (within a few hundred milliseconds) in 99.9% of the times. User queries need to be run successful “only” 99% of the time and are allowed to take up to 10 seconds. On the other hand: ingestion needs to succeed in 99.99% of the time since they only have one shot at it.

Services are not just 100% down or 100% up (most of the time).

These metrics help Honeycomb make decisions about reliability and product velocity. If the service is down too much, they need to invest in reliability (since having features that cannot be used does not add value). On the other hand: if they exceed the SLO, they can move faster.

Recipe for shipping reliably and quickly

Practices used by Honeycomb:

Code is instrumented. Think beforehand what a success or failure in production would look like.
Functional and visual testing using libraries that create snapshots.
The design for feature flag deployment (only making a feature available to a small percentage of users initially).
Practice automated integration plus human review. (There’s an SLO for how long the tests are allowed to take. Code reviews are high priority tasks since you are blocking someone else by not reviewing their code.)
The main branch is safe to release any time.
Automatically roll out the changes.
After the deployment the engineers have to observe what is happening with their code in production. Only after they confirm that everything is okay, they can go home. So you are free to deploy on Friday at 19:30 as long as you are willing to stick around until your code is deployed and you have confirmed it is not causing issues.

For infrastructure Honeycomb also use infrastructure as code practices:

They can use CI and feature flags for their infrastructure.
They can automatically provision fleets if needed.
They can automatically quarantine certain paths to keep the main fleet from crashing or do performance profiling.

Chaos Engineering

Left-over error budget is used for chaos engineering experiments. This is something where you go test a hypothesis. You need to control the percentage of users affected by it and be able to revert the impact you are causing.

Chaos engineering is engineering. It’s not pure chaos.

This works well for stateless things, but how does it work for stateful things? In the case of the Honeycomb infrastructure, they make sure to only restart one server or service at a time. They do not introduce too much chaos to reduce the likelihood that something goes catastrophically wrong.

Two reasons why you will want to do these experiments at 3 PM and not at 3 AM:

You want to test at peak traffic instead of low traffic, since the latter could give you a false sense of security in situations when everything may still look normal even though this would have been a problem in a high traffic situation.
When doing things in the afternoon, there are more people available to deal with something than there would be in the middle of the night.

With the experiments they measure if they had an impact on the customer experience. If they cause a change, does the telemetry reflect this? (Is the node indeed reported as being offline, for example?) When you fix things, you need to repeat the experiment and make sure the change indeed fixed the issue.

When you burn the error budget, the SRE book states that you should freeze deploys. Liz disagrees. If you freeze deploys, but continue with feature development, the risk of the next deployment only increases. Instead, Liz advocates for using the team’s time to work on reliability (i.e. change the nature of the work instead of stopping work).

Fast and reliable: pick both!

You don’t have to pick between fast and reliable. In a lot of ways fast is reliable. If you exercise your delivery pipelines every hour of every day, stopping becomes the anomaly instead of deploying.

Takeaways

By designing the delivery pipeline for reliability, Honeycomb can meet their SLOs.
Feature flag can reduce blast radius, and keep you within your SLO.
And when they cannot: there are other ways to mitigate the risks.
By discovering risks at 3 PM and not 3 AM, you improve the customer experience since the system is more resilient.
If something does go catastrophically wrong, remember that the SLO is a guideline not a rule. SLOs are for managing predictable-ish unknown-unknowns and not things that are completely outside of your control.
We are all part of sociotechnical systems. Customers, engineers and stakeholders alike.
Outages or failed experiments are learning opportunities, not reasons to fire someone.
SLOs are an opportunity to have discussions about trade-offs between stability and speed.
DevOps is about talking to each other and talking to our customers.

Common Pitfalls of Infrastructure as Code (And how to avoid them!) — Tim Davis

We start with the basics: what is Infrastructure as Code (IaC)? With the advent of cloud providers, you no longer use hardware, but a UI to stand up infrastructure. This led to shadow IT since developers ran off with a credit card to provision what they needed themselves, instead of using slow, internal IT systems.

Developers however rather write code and use developer methodologies than click through a UI. This is where IaC started. With it you can create and manage infrastructure by writing code.

What are he pitfalls? The bad news: you get all the pitfalls of infrastructure and all the pitfalls of code. But you’ve probably already got a lot of experience with those issues and teams to handle them. You just use a different methodology.

The first pitfall is not fostering the communication between the groups that have experience and tools.

Infrastructure pitfalls

Which framework/tool do you pick? There are basically two categories: multi-cloud or cloud agnostic tools on the one hand (like Terraform and Pulumi) and cloud specific tools on the other (like CloudFormation). Note that for example with Terraform and Pulumi you still have to rewrite code when switching from one cloud provider to another, but at least the tool is familiar.

Security is a huge thing. You still need to know how to design your VPC, IAM policies, security policies, etc. You still need to communicate with all the teams that have the experience. It’s not just Dev and Ops. With tools like Terrascan and Checkov you can shift-left the security aspect instead of trying to bolt it on afterwards.

Code pitfalls

The biggest thing issue is with default values. If you use the UI, there are a lot of boxes that may be blank or have stuff in them. Some of the boxes can be left blank, for some you need to specify what you want. The UI is going to yell at you; if you use IaC things may be less in your face.

You don’t want to deploy something with an open policy. Open Policy Agent can really help you to make sure you stay within your allowed parameters. For instance you can write a policy to make sure you are only use a specific region, don’t deploy an open S3 bucket or that you only use certain sizes of EC2 instances.

If you hard code certain values in Terraform or other IaC tools, you might need to copy/paste a lot of code if you want to create e.g. a test, acceptance and production environment. To mitigate these DRY (don’t repeat yourself) issues you can for instance use Terraform modules or Terragrunt.

State size can become a problem. If you want to, you can put all of your infrastructure in the same state file (which is where your tool stores the state of the infrastructure). However it means the tool will have to check all resources in the state file to detect if it needs to do something. To mitigate this, you can again use Terraform modules. It helps with performance and makes the codebase more manageable.

Wrap-up

The conference is still ongoing while I publish this post. However, this is it for me for All Day DevOps for this year. I learned new things and got inspired.

Thanks to the organizers, moderators and speakers for hosting another great event.

Devopsdays oNLine 2021

2021-11-10T20:39:28Z

Since COVID-19 is still around, devopsdays Amsterdam was an online event again. These are the notes I took while watching the talks.

How cognitive biases and ranking can foster an ineffective DevOps culture — Kenny Baas & Evelyn van Kelle

In practice, not all team members are treated equally although we might think they are.

How to make sure everyone said what has to be said?

In each team ranking takes place. It determines who takes the lead, who expresses an opinion, etc.

We need to make sure ranking is made explicit. Your job title or position in the org chart is explicit ranking. But things like your gender, skin colour and level of charisma are part of the implicit ranking.

The result of this ranking can be a situation like this: a woman on the team just explained something on a certain topic, but questions about it are addressed to the white man in the team.

We all have a list of traits we associate with higher rank. So if someone ticks more of those boxes, they are seen as higher in rank. You can also rank yourself lower compared to others. As a result you might not express your opinion because someone you regard as having a higher rank said something that conflicts with what you wanted to say.

The person with the higher rank should be aware of their rank and “share it.” Ask yourself the question “what don’t I know?” in a discussion. Ask the question the other (with a lower rank) is afraid to ask.

Own, play and share your rank.

How can we create and include new insights?

We use cognitive bias to make decisions. Usually this helps us, but sometimes it works against us. What we know hinders us from taking on a new perspective and it limits us in our thinking. We get stuck in what we know (functional fixedness).

Be aware of your biases and try to break free from them.

Who makes decisions and how to get everyone onboard with the decision?

Have less “corporate” meetings and more “campfires” where we share ideas and talk about the why.

There are 4 ways to go into a decision:

Idea
Suggestion
Proposal
Command

If you have a proposal or command, be open about it: “this is what we are going to do, what does it take to go along with it?” This is better than pretending it is completely open for discussion.

Getting Started: Embrace Your Inner Child — Rain Leander

Learning to ride a bike when you are older is harder than when you are a child. This is not so much a physical thing, but we do not want to be embarrassed when we are older. This goes for everything new we do: we want to be the best—instantly. However you will fall down and fail. You will make mistakes. Embrace that, be brave and learn.

Bravery is something you have to develop. It is hard! And it is not a lack of fear, it is moving forward despite of fear.

The steps to become more brave that worked for Rain:

Embrace vulnerability
Admit you have fear
Do it anyway
When you fall down, get back up

Play!

Playing is instrumental to life. No one can tell you how to play. Do what is fun for you. And when it stops being fun, just stop doing it. Book time in your agenda to play, each day! (And napping counts.)

Explore! Some of the things Rains uses to stay curious and explore:

Listen
Ask questions
Avoid assumptions
Listen
Embrace learning as fun
Develop a beginner’s mind
Listen

You are a unicorn. Your experience, your background, your education, etc. makes you unique.

Be willing to fall down, and get back up. Be brave. Play! Conquer the world.

(The original blog post for this talk: Getting Started: Embrace Your Inner Child)

Prevent Heroism: How to Work Today to Reduce Work Tomorrow — Quintessence Anx

We might be familiar with “the angry sysadmin.” It is the person who is highly skilled, has lots of knowledge, gets asked all the questions and puts in overwork to get their own work done. This is basically the description of the character Brent from the book The Phoenix Project.

Being a Brent-like person is stressful and may result in anger, resentment and strained relationships. Brent-like persons typically have issues with things like time management, maintaining focus and allostatic load (where you’re too tired to be awake and too stressed to sleep). But there are business implications as well: diminishing returns, high turnover, reputational damage and financial performance goes down.

Instead of having—or being—a Brent-like person, we rather want no silos and no vacuums (the latter is where you, for instance, get no response to questions).

Steps to get out of this situation and do things different:

Brainstorm the work (every task done during the day, planned or unplanned)
Determine which “stream” each task is in: reactive or proactive
Look for patterns
Switch to “upstream mode” to prevent work in the future (short term/long term vs permanent/temporary matrix)
Plan and Prioritize work: upstream work is more important
Do the work
Iterate (what works, what does not, what can be improved)

Upstream work is the proactive and preventative work that reduces the need for reactive, or downstream, work

But how does one measure things? Things to keep in mind:

What are your needs?
Learn how to measure them
Establish baselines
Know your capacity
Track the rate of change
Set goals
Iterate

Slides and additional resources can be found at https://noti.st/quintessence

Real-world Continuous Delivery: Learn, Adapt, Improve — Michiel Rook

This talk is a real life case study, focussing on the deployment of the platform of a customer of Michiel.

There was a pretty long release check list with manual steps. Especially when under pressure steps were forgotten or done incorrectly. Ideally they released every two weeks (after each sprint). It took a team 2–3 days of manual work. Time not spent on features.

To improve this situation, the following goals were set:

Reduce costs by letting people do more valuable work
Faster feedback and ability to do more experiments
Reduce toil

For continuous delivery to work, you have to make things small. This is a key difference between high and low performing teams (see Accelerate).

Although the platform was a monolith, they deferred splitting it up. If they would have broken down the monolith into smaller pieces, they would have more manual steps for each release and made the situation worse. So the strategy was to automate first and then split.

Phase 1

The teams increased the release cadence. They switched to a weekly release cycle instead of a release every two weeks (on good weeks). Releasing more often makes problems visible faster.

The release process was also automated. They created a pipeline to build and deploy the platform. In this phase there was still a manual step between the deployment to acceptance and production. They didn’t trust the system enough yet.

They also made changes in their way of working. The big one: pair programming. This is superior to code reviews, for example because you also discuss code that does not get written and you have architecture discussions.

Other changes:

A “do not leave broken windows” mentality. This is from the book The Pragmatic Programmer) which states: neglect accelerates the rot faster than any other factor.
Measure how they were doing in terms of their continuous delivery journey. The book Measuring Continuous Delivery has useful information about this.

Phase 2

In the next phase they had a robot press the button to deploy to production. This required zero downtime deploys. They solved this by doing rolling updates (a load balancer in front of a couple of backend servers and then upgrade one server at a time until they are all up-to-date).

Pipeline failures come in all forms, e.g.:

Flaky tests
Timeouts
Network stability issues
External dependencies in tests

If you cannot trust a test, remove it. Otherwise you’ll probably work around it, which is more dangerous.

This level of automation also meant they treated pipeline failures as priority 1 issues (which warrants extreme feedback to notify people of a broken build: lights, sounds).

Phases 3

In the third phase they made it faster:

Scale vertically and horizontally
Parallelize tests

The Adjacent Possible: Evolution, Innovation & Catastrophe — Jason Yee

The four cornerstones of resilience:

Monitoring: knowing what to look for
Responding: knowing what to do
Learning: knowing what has happened
Anticipating: knowing what to expect

The term “adjacent possible” comes from biology in the context of evolution. Evolution requires a chain of changes to happen.

Complex systems usually do not have a single root cause, only contributing factors. Perhaps there’s an “adjacent possible” of failure. We cannot anticipate all failures, but perhaps we can explore them.

How do you explore adjacent possible failures?

To move from known knowns to known unknowns, we can use chaos engineering (thoughtful, planned experiments to improve our understanding of how systems work (and fail), so that we can improve them). This way we can explore the known unknowns and these then become the known knowns. From that standpoint, the original unknown unknowns become the known unknowns.

Scientific process:

Observe (what is normal)
Hypothesize (how do we think it will react to failure)
Experiment (inject failure)
Analyze (what did actually happen)
Share knowledge

Explore the adjacent possible!

Minimal viable presentations — Mark Smalley

There are four important points to take into consideration when creating a presentation:

Accept that it is not about you

It is the audience’s presentation. What you want to tell might not be what they want to hear. Kill your darlings.

Anticipate silent questions

These questions are:

Huh? What is he talking about? (Solution: be clear about the self evident stuff)
Really? (Solution: be clear about the evidence)
So? How is this relevant for me? What’s in it for me?
What’s next?

Think of learning objectives

Think about what the audience should know, believe, be able to do and feel after the presentation.

Dare to choose your ideal audience

Accept that the presentation will not be for the majority of the people listening.

Mark Smalley applies the advice given in his talk to this talk itself

Engineering Metrics That Matter — Dan Lines

Before we dive in, let’s start with some KPIs you want to avoid:

Lines of code
Number of commits
Individual stack ranking
Anything with story points

So what do you want to measure? What will improve your process?

Couple pull request size with cycle time (how much time from coding to deployment): What you’ll see is that bigger PR size takes longer to review and thus increase cycle time.
Deployment frequency: How often can we get new value into the hands of customers.
Idle time: Waiting for e.g. a release or review. Waiting means context switches which means less productivity.
Code churn: How much code are we reworking in a short period of time? High code churn indicates delays and quality issues.
Review depth: The amount of feedback per PR.
Mean time to restore: How long it takes to fix an incident in production.
Investment profile: New functions for our customers vs bug fixing, backend or non-functional work.

DevOps is no Walk in the Park — Sabine Wojcieszak

In a park you are safe, you do not need a map, can wander where you want and you need no preparation. DevOps is not like that.

The term CALMS has been coined by John Willis to explain DevOps. It stands for:

Culture
Automation
Lean
Measurement
Sharing

Most people talk a lot about the automation and some about the measurements part, but very few about sharing and even less about culture.

There is a similarity with Agile where the term began as a niche but became hyped and then mainstream, where people talk about Agile without knowing what it is. We should prevent this from happening with DevOps.

It’s no longer business and IT, but IT is part of business. One could even say IT is business.

DevOps is not cherry picking what we like (automation) or only picking the low hanging fruit and calling it “DevOps.” It needs an holistic approach. But this needs a lot of ingredients. In the right mix.

You should always ask yourself why you are doing DevOps. Do you think it is cheaper because of the level of automation? Because everyone is doing it? To deliver better software?

You need to understand the whole approach.

DevOps means to Sabine: deliver valuable products with better quality sooner to customers/users. This means we need to know what is valuable for our customers, what our customers think of as quality, etc.

The term CI/CD in context of DevOps traditionally stands for continuous integration / continuous delivery. But it could also stand for continuous improvement / continuous development, if we think of products and people. But we can also talk about continuous improvement and continuous discovery of opportunities and outcomes.

You cannot buy DevOps—neither with tools, nor with certifications. But it’s not for free either. You will fail for example.

Some anecdotes of where it went wrong:

“We are now the DevOps department.” Which means there is still a silo with a lack of communication.
“We have a DevOps team that is setting up the pipelines.” Again no communication, no transparency.
“We have a linter but it gave to many warnings so we turned it off.” No understanding about why it is done, no trust that it helps to grow, sign of fear to make mistakes.
“We are measured by the number of developed features, but not allowed to talk to sales.” It’s a feature factory, measuring the wrong things. Again also not enough communication and silos.
“IT has not reported any problems.” Be open to the obvious: if customers complain, there is a problem. Ask yourself what is actually monitored? Not business relevant metrics obviously otherwise the problem would have been detected by the company instead of the customers.

DevOps is more than automation, it is CALMS.

If you don’t get the C then don’t bother with the A, L, M or S.

—John Willis

Devopsdays oNLine 2020

2021-11-10T20:38:54Z

This year, due to the COVID-19 pandemic, the Dutch devopsdays are combined into a single day, online event.

These are my notes from the keynote sessions.

Why change matters now more than ever: the importance of DevOps in this new world — Michael Ducy

This year has challenged us in many ways. Many people got sick, many people have died. Our world has been turned upside-down. Some responses to this have been good, some bad. We’re more divided, communities have been shattered.

DevOps and digital transformation came about due to the rise of cloud and mobile. The relationship between businesses and consumers has changed. Businesses interact with us differently than before. DevOps is a response to serve digital transformation.

DevOps is about community and empathy. Digital transformation is really about capitalism, about finding new ways to exploit capital and resources. This is an interesting contrast since DevOps is to support digital transformation.

Are the digital possibilities we have nowadays, which have helped us in these challenging times (Netflix, order food online, etc), contributing for the good? Or do they divide the people that are fortunate enough to be able to use them and those that are not? What does digital transformation do to serve the underprivileged, the poor, the oppressed?

Be very aware that there are many people that are being oppressed. Many systems in our world are built upon the “technical debt” of not treating people as equals. We intentionally do not go into the bad neighbourhoods in our cities. We avoid them (at least in part) to not be confronted with the conditions that people from those parts of the city have to live in. We don’t want to be reminded about what we are not doing to help these people out, while enjoying our own lives.

In the DevOps community we talk about diversity and inclusion. We have the option to close a number of gaps. It is not just about the difference in pay between genders. It is also about us, privileged people, to use our voices to lift people up. We must invest in lifting up the lowest and destroy our systems of oppression that were setup hundreds of years ago.

Think about how you can use your voice to destroy systems of oppression.

There is a lot that is built into society and culture that needs to be changed. We have an opportunity to lead the change.

From waterfall to agile. Microsoft’s not-so-easy evolution into the world of DevOps — Abel Wang

The definition of DevOps used by Microsoft:

DevOps is the union of people, process and products to enable continuous delivery of value to our end users.

—Donovan Brown

The first lesson Microsoft had to learn in their evolution from shipping a boxed software product to a service: if it hurts, don’t avoid it but do it more often. By doing it more often, you get better at it over time until eventually it stops hurting.

The way you organize your company is very important. You don’t want different teams competing with each other, you want them to collaborate, to be one company.

To support that idea Microsoft started to use one engineering system instead of allowing each team to pick their own tools. Now Microsoft is using their own tools (Azure DevOps services: Azure Boards, Azure Pipelines, etc).

Microsoft’s definition of done:

Live in production
Collecting telemetry

To get there, dramatic changes in the team structure was needed. Previously there were basically three teams: program management, development and testing. There was a toxic culture between the development and the QA teams. QA was seen as a second class citizen.

They transitioned to an engineering team where the development and QA teams became one team: QA had to learn the codebase and the developers needed to learn how to test. Another transition they made was to create a “feature team,” which combined engineering, ops and program management. This makes it easier for customers to communicate with Microsoft since one team is responsible for one feature, instead of different teams, depending on where in the development phase a feature was at a certain time.

Engineers are allowed to pick a different team to work in each year. This makes them very happy even though only about 20% of the developers actually changes to a different team.

Another change is that every team has the same sprint cadence now. All sprints start on the same day and have the same duration (3 weeks). Sprints start with an email to the engineers and end with a summary of what has been accomplished plus a short video. At the end of every 4 sprints, all team leads (of the teams that are related to each other) have a planning meeting. Every 6 months the 12 month strategy is reviewed.

Leadership wanted to micromanage, but didn’t have enough context. Providing them the right context to make sound decisions took too much time. In the end it was decided that leadership was no longer allowed to look at the backlog and only operate at the 6-month and 12-month planning level.

The teams also introduced a new rule: if you find a bug, you have to fix it in that sprint. Teams are only allowed to carry four bugs to the new sprint. This is called the “bug cap.” If you are over this number, you are not allowed to build new features the next sprint but first have to reduce the number of open bugs. This is not meant as a punishment, but as a way to keep quality at a more stable level.

What to measure? Be very careful because people tend to want to game the system. Is code coverage important for your bonus? Be prepared to have 100% code coverage without actually improving the quality of the code. It is better to measure outcome, e.g. how much time it takes to fix a bug once found. Metrics are not used to punish or reward engineers—they are used to help the engineers and answer questions like “do you need training on the new framework?”

How do you get your developers to do the right thing? Make them feel the pain of their own code if they don’t. Make them part of the call to solve a live site incident for example.

Securing your DevOps transformation — April Edwards

(For the record: April also works for Microsoft and uses the same definition for DevOps as Abel.)

Why is DevOps important? Because your competition is also using it. It results in high performance companies that deploy more often, have a lower change failure rate and respond to the market faster (shorter time between commit and deployment).

95% of cloud breaches occur due to human errors such as configuration mistakes

—Gartner Inc.

We are constantly sacrificing security for speed. Things that hinder security:

Manual processes that can be automated
Friction between teams
A “we’ll do it later” approach

Ops and dev teams don’t always have the same goals. We need to align that and own a feature together.

(Due to a technical issue the talk could not be completed as planned and April had only a few moments to wrap up.)

The main takeaway: “shift left” because most defects are introduced during the code writing phase. Address these things earlier and tackle security during development. Add security checks to your pipelines.

BizDevOps: bridging the dominant divide between business and IT — Henk van der Schuur

There shouldn’t be a divide between the business and IT. Both sides should cross over to the other side. IT needs someone from business, and business should take more ownership over tech decisions and understand what they mean. Both sides should be part of one team. It should be a partnership.

You can start a project with a “pressure cooker session” to get a good start. Such a session consists of a number of phases:

Prepare: what is this (really) about?
Ideate: do we completely understand the problem?
Sketch: is this what you mean and what should be changed?
Prototype: what are the most important requirements and who is the audience?
Decide: was value delivered? what’s next?

If this session ends with a “go” all disciplines should be involved: business, operations and development.

DevOps needs a sequel. Let’s repeat the trick and involve the business. Get the business, operations, developers and product owners into the same room. The business wants to get involved and wants us to realise that it’s about delivering value. Technology is everyone’s business.

All Day DevOps

2021-11-09T20:09:33Z

All Day DevOps is an online conference which lasts for 24 hours. With 150 sessions across 5 tracks, there’s enough content to consume.

(As with all my recent conference notes: these are just notes, not complete summaries.)

#adidoescode Where DevOps Meets The Sporting Goods Industry — Fernando Cornago

Adidas is a big company, about 57,000 employees. Software is also having an impact on sports, e.g. registering your performance, etc. Is Adidas already a software company? Definitely not, but they should start behaving as if they are. They are building their applications 10,000 times a day.

New book from Gene Kim (the author of The Phoenix Project): The Unicorn Project. Expected end of November 2019.

The platform you need to be able to accelerate you business should be:

on demand and self-service
scalable and elastic
pay per use (which helps with the efficiency mindset)
transparent and measurable (observability)
open source + inner source

To improve sales, Adidas' platform can run where their customers are (globally) and can scale up if needed and scale down again when possible.

Making use of data is key. The data is seen as a baton in a relay race and makes its way through the teams. A large part of time consumed in creating a service is spent at looking at the data. The data is used to train their AI systems.

You want to deliver value and make sure that things are working. Testing needs to be done on the physical level (the actual products that are being sold), but also on the virtual level: the IT platform also needs to perform and work as expected.

At Adidas they went from having a couple of “DevOps engineers” to having a DevOps culture in all teams. This removed bottlenecks and allowed them to grow faster. They have substantially reduced costs and manual testing and improved drastically on time to production.

Multi Cloud “day-to-day” DevOps Power Tools — Ronen Freeman

About the relation between SRE (Site Reliability Engineering) and DevOps: SRE is an implementation of DevOps.

What defines a great SRE:

Quality
- Be curious
  - Stay up to date
  - Attend events
  - Tinker with things
- Adapt to change (change will happen, anticipate it, monitor it, adapt to it and prepare for the next change)
- Learn: learn fast so you may promptly begin again
- Laziness: automate, but ask your self the question whether it will actually save time or not
Support
- Googling is an art form
- You’ll learn the answer itself but also discover things in the process
- There are numerous forums (for example: Stack Overflow, GitHub, Slack, etc)
- Have a daily standup
- Pass on knowledge
Tools: there are a number of areas to consider:
- Code: Bash
- Package: Docker
- Deliver: Concourse
- Platform:
  - Infrastructure: Kubernetes (relatively simple to get started with); GKE is the most mature
  - Configuration tool: Terraform
- Monitor: Google Stackdriver (since already running on GKE)

There’s a demo app: https://github.com/Darillium/addo-demo

How to Run Smarter in Production: Getting Started with Site Reliability Engineering — Jennifer Petoff

Software engineering is focussed on design and building, not about running the application.

There are multiple places in organizations where friction occurs:

Business vs development: this is addressed by Agile
Development vs operations: this friction is addressed by DevOps

SRE is a bridge between business, development and operations.

Four key SRE principles:

SRE needs Service Level Objectives (SLOs) with consequences
SREs must have time to improve the world
SRE teams must be able to regulate their workload
Failure is an opportunity to learn (having a blamelessness culture)

Some terminology:

Service Level Objectives (SLOs): a goal you want to achieve. For example, you may want to set an uptime SLO, but typically you want to dig deeper, like 99.99% of HTTP requests should succeed with a “200 OK” response. You want to measure something your user cares about.
Service Level Agreements (SLAs): these are “handshakes between companies,” contractual guarantees.
Error budget: the gap between your SLO and perfect reliability. E.g. the allowed downtime to still match your 99.9% uptime SLO.
Error budget policy: this policy describes what do you do when you have spent your error budget. Example: no new features until within your error budget again.

100% is the wrong reliability target for basically everything

SRE is about balancing between reliability, engineering time, delivery speed and costs.

Note that you can have an error budget policy without even hiring a single SRE. You can already start with an SLO.

SLOs and error budgets are a necessary first step if you want to improve the situation.

When talking about making tomorrow better than today, you need to address toil (work that needs to be done that is manual, repetitive, automatable and does not add value). If your SRE team is spending time on toil, they are basically an ops team. Instead you want them to make the system more robust and improve the operability of the system.

One way to be able to have your SRE team regulate their workload is to put them in a different part of the organization.

If you want to start with SRE, start with SLOs and let your SRE team own the most critical system first. Don’t make them responsible for all systems at once. Let them deal with the most important systems first. Once they have that under control (removed toil, etc), they can take on more work.

You need leadership buy-in for every aspect of SRE work.

Aim for reliability and consistency upfront. SRE teams should consult on architectural discussions to get to resilient systems.

SRE teams can benefit from automation:

Eliminate toil
Capacity planning (auto scaling instead of forecasting)
To fix issues: if you can write a playbook, you can automate it.

SRE teams embrace failure:

Setting SLO less than 100%
Blamelessness at all levels
Learning from failure
Make the postmortems available so other teams can also learn from them

Resources (books):

Everybody Gets A Staging Environment! — Yaron Idan

How can you give an isolated and secure staging environment for each developer? If you follow the path Yaron’s company (Soluto) took, prepare for a rocky transition, getting used to new terminology when moving to Kubernetes and users that can get upset at first.

Their first step: automate the entire process. The realized they needed onboarding documentation and templates. They used Helm for the templates.

Deploy feature branches to the production environment. When developers open a pull request, the CI/CD tool adds a link to where this PR is deployed. This empowers the developers and provides a sense of security. This accelerated the way the developers built features and the adoption of the platform by the company.

The benefits:

Code is fit for production during entire life cycle of the feature branch.
Makes performance testing easier (after all: it’s deployed on the production environment).
Easier to share things with the stakeholders and thus improves collaboration.
You can deploy an entire set of microservices.

The tools used for their implementation:

GitHub (VCS)
Docker (code as configuration)
CodeFresh (CI)
Helm (CD)
Kubernetes (production platform)

You can swap the aforementioned tools with other tools, but you must be able to automate them and make sure that they work with the push of a single button.

GitHub repository for the demo: https://github.com/yaron-idan/staging-environments-example

DevOps to the Next Level with Serverless ChatOps — Jan de Vries

What DevOps is about:

Shipping code
Adding value
Do this continuously

It’s not about:

Adding additional load to your team
Lowering quality
More support calls
Slow response times when there are errors

One option is to send emails if something is wrong. But these emails are slow (knowing that there was an issue yesterday is too late) and it generates a lot of noise. Monitoring dashboards are nice, but you have to actively look at them to detect if there’s something wrong.

A better solution is using chat applications (like Microsoft Teams or Slack).

Your first step is to emit events when something fails (in Azure you can sent such events to Event Grid). Next, you have to subscribe to these events/topics. You can use an Azure Function for this and then have it post an actionable message to Teams.

These messages can have buttons to take immediate action (e.g. post to a web API like an Azure Function) to recover from the problem. The output from the recover action can also send events, so you can get another notification in your chat application about the result.

GitHub repository with demo code: https://github.com/Jandev/ServerlessDevOps

Building Modular Infrastructure in Code — Fergal Dearle

Why use infrastructure as code (IaC)?

Fergal is a fan of immutable infrastructure and the pets vs cattle analogy (which goes further than just servers)
Everything as code
You don’t have to log into a server via SSH to fix things
Better testability and maintainability

The DevOps handbook (part III, chapter 9) talks about on demand spinning up a production-like environment. The only way to do this is to use IaC.

First thing is understanding your infrastructure (VPC, networking, DNS, services, database). Look at patterns like requestor/broker. Use these patterns to create modular stacks.

Patterns:

Singleton stack pattern: one stack per instance. But this is actually an anti-pattern because you are mixing configuration and infrastructure.
Multiheaded stack pattern: multiple stacks in single project, config built in.
Template stack pattern: single stack, config separate, multiple instances from that template

Stack templates & stack instances:

Reusable template that implements building blocks
Appropriate tags
Must be able to uniquely identify instances (namespacing)

Stay clear of monoliths

Monoliths are stacks with everything in them. This is one end of the spectrum. The other end is to have one service per stack. You’ll probably will want to end up somewhere in between, where you have a component stack (e.g. with a requestor/broker combination). And then you can combine those stacks.

Static stack inputs: environment names, app names. You should externalize this config from your stack, e.g. by feeding these parameters via the command line or use the SSM Param Store and Secrets Manager.

(Note to self: Fergal mentioned Sceptre. Check what this is about.)

Common problems:

Drift can happen, especially if you also make manual changes (don’t!)
Updates are not forward compatible
Region incompatibilities

Best practices:

Only use template stacks
Use layered and component
Externalize config
Test, test, test!
Only use automation, no manual steps

Crossing The River By Feeling The Stones — Simon Wardley

In The Art of War Sun Tzu discusses five fundamental factors which you should consider:

Purpose
Landscape
Climate
Doctrine
Leadership

John Boyd developed the OODA loop, which cycles through the following:

Observe
Orient
Decide
Act

You can combine these cycles:

By moving we learn, as long as we can observe the environment.

Maps are ways to:

explore
communicate
share understanding
de-personalize

Graphs are not maps. Identical graphs can spatially be different. On a map, space has meaning, in a graph not. And because space has meaning, we can explore, e.g. by asking ourselves the question “what if we go in that direction?”

What makes a map?

Anchor (compass)
Position (places, relative to the compass)
Consistency of movement (going to right below means go to south-east, assuming on this map “up” is north)

In business there are a lot of maps, e.g. a systems diagram. But almost everything in business we call a map actually is a graph. Which means we don’t have a sense of the landscape.

How do you create a map for a business? Start with the anchors, e.g. business and customers. Work from there to get all related components and express them in a chain of needs.

But how do you position those components? We can use “visibility” as an analogy of distance. You can use this to create a “partial ordered chain of needs.”

And what about movement? You can describe movement as a chain of changes. But how do you do that? You can order components based on their evolutionary stage. Are they in the genesis stage, do we use custom built components, do we have standard products or can we consider it a commodity?

By combining these we can create a map of components by using the position (visibility) on the y-axis and the movement (evolution) on the x-axis.

Simon Wardley shows a map of a fictional company selling cups of tea

You can add intent (evolutionary flow) to your map, e.g. move a component from custom built to using a commodity.

By using such a map, you can de-personalize discussions since now you talk about the map, not a story.

Simon told a story of a company that wanted to invest in robots to do manual labour. After putting their business in a map, the problem became clear: they were using custom built racks and as a result needed to customize their servers because they would not fit their non-standard rack. Because the market had changed since the company started, they could switch to commodity (cloud) and as a result also did not need robots.

But keep this in mind:

All maps are imperfect representations of a space

Deploying Microservices to AWS Fargate — Ariane Gadd

At KPMG they moved a project from using Amazon EC2 to AWS Fargate. They also replaced CloudFormation with Terraform for standardization.

Why managed container services and not EKS?

Immutable deployments
Cluster provisioning handled by Fargate
Only pay for what you execute
No OS patching
Smaller attack service

They used ECS deploy to deploy containers to Fargate.

Why was the customer okay with this change?

They were already familiar with Docker
Fargate has integrations with ECS and ECR.
Overhead of managing EC2 vs cost of AWS Fargate

Fargate runs in your own VPC so you own the network, etc.

The microservices for this project are placed behind an application load balancer (ALB) using AWS Web Application Firewall (WAF) and TLS termination. Logs are sent to AWS CloudWatch Logs. They used Anchor for end-to-end container security and compliance.

Benefits of AWS for KPMG:

Cost reduction (automation tools were already in place, they already used Lambda for account creation and had reserved instances)
Speed of delivery
Allows collaboration with developers (automation, DevOps).
70% of the engineers were already AWS certified.

Automate Everyday Tasks with Functions — Sean O’Dell

Common use cases for serverless applications are things like web applications (such as static websites), data processing, Amazon Alexa (skills) and chatbots.

Not every workload should run in a serverless fashion

Organizational and application context are relevant. Serverless is just another option. You pick what is right for your use case as serverless is not a silver bullet.

AWS Lambda in a nutshell: there is an event source (e.g. a state change in data or a resource), which triggers a function, which in turn interacts with a service.

GitHub repository with examples: https://github.com/vmwarecloudadvocacy/cloudhealth-lambda-functions

(Note to self: check Amplify)

Beyond dev & ops — Patrick Debois

The DevOps pipeline usually starts at the backlog. But what happens before? And how does that influence people outside the engineering department?

Patrick joined a startup five years ago. He fixed the DevOps pipeline. After a while the organization wanted to sell a product.

The book The Machine by Justin Roff-March describes “the agile version of sales.”

Sales had more in common with IT than Patrick expected. The sales pipeline looked like a Kanban board. Sales people are also on call, especially if you do international sales. If we can deliver faster, this increases the chance of a sale. Sales persons are actually mind readers; this reminded Patrick how IT people feel about tickets in the backlog: “what is meant with this ticket, what do the customers actually want?”

Thinking about the life of the project after it has been delivered, in terms of maintenance and operational costs, is important. Doing this upfront, is beneficial for the project.

The biggest impact sales had on IT was the never ending hammer of “can you make things simpler?” Sales can help you with making your design less complex. Together you can determine what actually adds value for the customer.

Having publicly available documentation really helps and makes it easier to sell the product. Getting the customers on Slack also helped. This showed that the company was willing to listen and help. This also helps sales.

After sales, the marketing department got focus. Patrick has seen the most interesting automation within the marketing department (e.g. email sending). Marketing has a lot of fancy metrics and tools to measure stuff; did they invent metrics? The IT department can help marketing by providing information. Conferences are also a marketing instrument. Marketing also has CALMS; they do very similar things.

We’ve got sales and marketing covered, but for a lot of things we still need budgets. To budget things, you also need to estimate things. Finance knows what it is to be on the receiving end of tickets where you have to guess what is going on.

Sometimes you need to buy things. Patrick looked into agile procurement. Lots of stuff has to be figured out: is it a fit? is there a way to be partners?

Serverless is a manifestation of servicefull, where you use a lot of services, so you can focus on your domain instead of having to build everything yourself.

With cloud services there doesn’t have to be communication initially: you read the website, check the documentation, pull your credit card and start with a proof of concept. Communication is still important. You have to treat your suppliers with respect.

HR: how does your company support hiring people? Have people join the team for some actual work to see if they are a good fit or not.

Do you know what pair programming is? Well, mob programming is a whole other level of collaborating. (Note that it’s more than just have one person typing and a bunch of other people commenting—it has a lot of patterns going on.) Teams can get very enthusiastic about this and can be very productive the first week. But if they are not careful, the team is exhausted the next week. You need to develop a sustainable pace.

Patrick learned a lot from talking to the different departments. They are solving similar problems as IT is.

Reading tip: Company-wide Agility with Beyond Budgeting, Open Space & Sociocracy by Jutta Eckstein and John Buck.

The Open Source Observability Toolkit — Mickey Boxell

Some characteristics of a modern application:

Microservices
Distributed
Multiple programming languages
Scalable
Ephemeral

This brings new challenges compared to the old situation with a monolithic application running on a single machine. You still want to address threats to customer satisfaction. But the old debugging solutions we used for a monolith will not work.

Observability means designing and operating a more visible system. This includes systems that can explain themselves without you having to deploy new code. The tools help you understand the difference between a healthy and unhealthy system.

Modern, distributed systems will experience failures.

Observability takes a holistic approach that recognizes the impact an issue has on the business as a whole. Outages affect the whole business, which should be able to react. Observability gives tools to address issues when they arrive.

External outputs:

Logs
Metrics
Traces

These outputs don’t make your system more observable, but they can be part of the holistic approach. They can help you with e.g. a root cause analysis.

Monitoring tells you whether a system is working, observability lets you ask why it isn’t working.

—Baron Schwartz

Key SRE concepts:

Service Level Indictors (SLIs; what are you measuring), Service Level Objectives (SLOs; what should the values be), Service Level Agreements (SLAs; defines the planned reaction if the objectives are not met).
Subset of Google’s golden SRE signals: RED (Rate, Errors, Duration).
Mean Time To Failure (MTTF) and Mean Time To Repair (MTTR)

Users are comfortable with a certain level of imperfection. A site that is a bit slow sometimes, is less annoying than a shopping cart that sometimes looses its content. Use this to determine what you’ll spend your time on first.

Logging

Logging is the most fundamental pillar of monitoring. Logging records events that took place at a given time. Supported by most libraries because it is such a fundamental thing.

You only get logs if you actually put them in your code. You also have to make sure you don’t lose them, for instance by aggregating them.

Tools:

Fluentd: scrape, process and ship logs
Elasticsearch: a data store
Kibana: interact with your logs

Metrics

Metrics are a numeric aggregation of data describing the behavior of a component measured over time. They are useful to understand typical system behavior.

Tools:

Prometheus: scrape data, store the data and query the data
Grafana: visualize the metrics, can aggregate data from different sources

Alerts are notifications to indicate that a human needs to take action.

Tracing

Tracing is capturing a set of causally related events. Helpful for debugging. Example: each request has a global ID and if you insert this ID as metadata at each step, you can trace what happens.

Tools: Jaeger and Zipkin can visualize and inspect traces.

The challenge is that tracing is hard to retrofit into an existing application. You need to instrument all component. Service meshes can make it easier to retrofit tracing.

Service meshes

A service mesh is a configurable infrastructure layer for microservice applications. It can monitor and control the traffic in your environment. It can be a great approach for observability. You will get less information from a service mesh compared to adding tracing to all of your components, but on the other hand it takes less effort to implement than instrumenting your existing code base.

Tools:

Using service meshes won’t eliminate the need to instrument your code, but it can make your life more simple.

Called Traefik Mesh these days. ↩︎

Devopsdays Ghent 2019: random notes

2021-11-09T20:09:33Z

Where my previous articles were focussed on the notes I took of the talks, this article is a mix of random notes and observations I made throughout the conference.

The conference

First of all: this devopsdays conference—once again—inspired me. It refreshed my desire to make the world (or at least a part of it) a better place.

A big change in the format was that each speaker only had a 15 minute time slot. This was done to allow more people to speak. While the organization certainly succeeded in that regard, I felt that most talks were too short, that is: I would have loved them to last longer.

It felt like the conference was less technical. Perhaps this was because of the length of the talks, which meant the speakers could not go as deep into a subject as with a 30-minute talk? Perhaps it was because there were no workshops where I usually get (more than) my dose of technical deep dives? Perhaps it was just me?

Note that this is not a complaint about the speakers. Each and every one of them did a great job and had an inspiring talk. I really appreciate them taking the stage and sharing their knowledge, experience and opinions with us. Thank you all!

Other conferences

During the ignites on the second day, a few other conferences were mentioned:

Delivery Conf: 21 & 22 January 2020, Seattle, WA, USA
FOSDEM: 1 & 2 February 2020, Bruxelles, Belgium
Configuration Management Camp: 3–5 February 2020, Ghent, Belgium

Home lab

One of the open spaces I joined was about home labs. I only have a small setup at home, but I did get a few ideas out of this session that may be worth checking out.

Nextcloud
Have my Synology NAS backup stuff to AWS Glacier
Plex; perhaps I can use this to make my DVD collection (which is now gathering dust) accessible again?

But the most important question, in my opinion, that was asked during this session: do you have a disaster recovery plan for your home lab?

What if I’m not around and something breaks down, like the WiFi, how does my family get things up and running again? Spoiler: they probably won’t. And this is not their fault. I made things more complex than needed (from their perspective that is) and failed to provide instructions on how to solve issues.

Miscellaneous

On meetings:

Lean Coffee: an interesting, lightweight meeting format
POWER start: a meeting facilitation technique to increase the effectiveness of meetings

On toil:

Read up on the subject of toil: Google’s SRE book, chapter 5: Eliminating toil
Learn to better recognize toil and record it (what is it, what does it cost in time, energy, etc and what is the frequency)
Make time to eliminate toil
Read up on, and think about, immutable infrastructure and see if we can use it to remove toil

Random, unrelated stuff:

Look at CloudFormation Lint
Read the GitOps article by Weaveworks

Something that was said during an open space about testing infrastructure-as-code:

You deploy infrastructure not to have that infrastructure, but to run an application on top of.

I don’t recall who mentioned it, when and what the context was, but r/devops might be fun to read.

Devopsdays Ghent 2019: day two

2021-11-09T20:09:33Z

Devopsdays Ghent is a two day event. These are the notes I took at this second day of the ten year anniversary edition of devopsdays.

DevOps beyond dev and ops — Patrick Debois

This is a story about busting silos and making new friends.

Patrick retired from organizing devopsdays five years ago. His next challenge was helping a company with their DevOps journey.

After about a year, DevOps was a thing at the company. The development and operations teams worked together and Patrick started to look for the next bottlenecks.

For a lot of things they used services instead of building things themselves (monitoring, etc). Patrick calls this “servicefull”: use a lot of services. Make the suppliers of the services your friends.

There was internal DevOps and collaboration, but they did not have the same communication with their suppliers. Communication works differently there. As a customer they used documentation, which is a way of communication from the supplier to the customer. To communicate back, they went to conferences, used Slack and reported findings. Suppliers welcomed this feedback since it is apparently hard to get. This feedback might even help developers make a case for an issue with their managers since there is now proven customer demand.

Patrick went to other departments of their own company next:

Marketing: Your biggest fans: each new feature means something new they can talk about.
HR: They have their own SLA: paying employees which, by the way, was something that could be automated.
Sales: These people have enormous pressure from the customer, the speed needed in deployment is the speed needed by sales. They also pressure engineering to make things more simple and not over-architect things.
Legal: They are like ops: nobody sees them unless there is a problem.
Finance: In a way they are testers: if there’s no money in the bank, you have failed. Although they play in the shadows, they enable you to do the job you like.

Unfortunately they could not convince enough customers to buy the product they made. The company shut down, but it was an awesome experience.

DevOps was not the bottleneck. You need the rest of the company; every part is needed to succeed.

By the way: the reason Patrick used horses on the slides? He learned that unicorns do not exist.

Pipelines to Production: Detangling the DevOps Web (of Lies) — Jody Wolfborn and Kimball Johnson

Creating a DevOps team does not solve your problems. You need trust, which you’ll have to earn. You can start earning that trust by automating things and showing that the automation is reliable and helps with the repetitive stuff.

You cannot just throw tools at it. You’ll have to find the right tools and keep evaluating your tool set.

By deploying small, iterative changes, and tracking and versioning them, it is easier to find out where the problem lies. Would you rather debug all changes made in a month, of just a small set of changes?

You don’t want to react to issues, you want to proactively monitor production to detect broken stuff.

Also make small, iterative changes to the processes. Include the rest of the organization.

DevOps is about communication. Breakdowns in DevOps are breakdowns in communication.

Fix breakdowns by communicating your needs, but mostly by listening to the needs from others and responding with empathy.

Different paths, same place - human case studies — Michael Ducy

The people are what makes this community so special. They have had a tremendous impact on Michael. There’s a diversity of … all the things: culture, language, background, etc.

Background is special in this list. It’s not something you are “born into.” Some people have a background of drug addiction, have been in prison or are high school dropouts. That does not mean they cannot be successful.

Have empathy: we don’t come from the same place and have not had the same opportunities.

Never criticize a person until you’ve walked a mile in their shoes.

The Perfect Storm - How We Talk About Disasters — George Miranda

Super visual outages can cause a lot of reputational damage to a company. You should contain the damage and prevent it from getting worse.

It is important to have a plan beforehand. Your plan should include the way things are coordinated between the different departments in case of a severe technical incident. This plan should be well documented and practised regularly.

When PagerDuty had a visible, multiple hours outage, there was a lot of chaos in the business instead of calm. While the engineering department was used to having and solving these kind of issues (albeit not on this scale), for the majority of the company this was new—which lead to chaos. Engineering needed to share how to deal with these kind of technical problems with departments like marketing, legal, PR, finance, etc.

Most important takeaway: look at an incident response system, see for example the cut-down version of the PagerDuty Incident Response process. (Disclaimer: George works for PagerDuty.)

In case of a severe incident, the whole company should know what is happening in engineering. For example: the marketing department should stop campaigns, sales people should not try to live demo the system, etc. There needs to be a broad cross-functional response to the incident.

We often talk about a T-shaped individual: a person with broad knowledge but also a (single) deep expertise. When you put a lot of these individuals together, each person still has their own speciality, but the group as a whole covers a much larger area. With that in mind: DevOps does not really remove all silos, but we have learned to communicate better.

Where do we go from here? The fundamentals will stay the same, but the scope will be different. This is an opportunity to spread what we have learned as engineering to other parts of the company (and vice versa).

Engineering practices that make their way to other departments:

Chaos engineering
Blameless postmortems
Adoption of retrospectives (also after practising an outage)
Visiting other silos

There is an opportunity for us to learn more from the business. They are curious about learning and adopting practices from us. We should also be open and curious to learn from them.

More resources:

Sysadmin to DevOps: Where we came from and where we are going — Bryan Liles

We used to be able to save the day with some C or C++ code. In the nineties dev and ops belonged together. Somehow these two grew apart. Then we had a situation where parties were frustrated with each other: Why don’t developers understand networking or Unix/Linux? Why don’t sysadmins understand software that is more complicated than something in a shell script?

If we break the problem down, we have people that write software that solves a (business) problem. That software most likely runs on a server somewhere. You need a team to understand the full stack. We thought we had dev plus ops. But developers did not think about ops. Instead we had dev versus ops. With DevOps we are back where we started.

The good parts of DevOps:

automating and removing manual failures, infrastructure as code
empathy
collaboration
“little a” agile (not the certifiable, “big A” agile)

The bad parts:

DevOps Engineers – you can do DevOps, not have DevOps
SRE

Our progression:

We learned how to automate all the things. (But we take ourselves too seriously.)
We created new synergies (DevSecOps).

We’ve been doing this for 10 year now. It’s safe to say DevOps is no longer a fad. Dev{*}Ops might be still a fad. Time will tell.

DevOps, distributed — Bridget Kromhout

If you look at the number of devopsdays events, you can see quite a jump between 2015 and 2016. This is because in 2015 documentation was put up on the devopsdays website. Information on how to run such an event was made more accessible to others.

The main takeaway is that you can grow much more than you thought you could, if you put in the effort to allow others to do what you can do.

If you worry about the semantics of the word “DevOps,” you are are focusing on the wrong thing. Worry about impact instead.

Ignites

Ignites are short talks. Each speaker provides twenty slides which automatically advance every fifteen seconds.

Metatalk: An ignite about what I’ve learned giving ignites — Jason Yee

Some tips about ignite talks:

Don’t introduce yourself
No title slide, tell them directly what you want to tell, get straight to it
The big challenge is the timing; don’t wait for your slides but keep talking and they will eventually catch up
Don’t memorize your script, but get a feel for it.
You will feel nervous, but that is okay. Everybody wants you to succeed.
Remember the letters C,A,L,M and S: This spells “SCAML”: Spreading Culture As Messy, heavily indented text fiLes.

How Observability is not killing Monitoring — Blerim Sheqa

Monitoring: black box
Observability: look inside the box

Observability is not a replacement, it is an addition to your monitoring

The end

And that’s it for the devopsdays Ghent 2019 talks. Well… almost.

There’s one sheet I want to share. It’s not a great picture and out of context here (it’s one of the sheets Jason used in his ignite talk), but I think it’s a great way to close off.

Jason Yee with an ancient DevOps proverb: You don’t just do DevOps, you must feel DevOps

Devopsdays Ghent 2019: day one

2021-11-09T20:09:33Z

This year I was lucky enough to attend devopsdays for a second time. This time the conference was held in the beautiful Ghent, Belgium.

I’ve been to devopsdays Amsterdam a couple of times, but I never went to one in a different city. Because ten years ago the first devopsdays was organized in Ghent, I thought this was a great moment to visit the “birth place” of devopsdays.

These are the notes I made during/after the talks.

Enterprise DevOps 10 years on: What went wrong? and what went right? — Nigel Kersten

Some enterprises are cargo culting DevOps. They have shiny, polished processes and tools which are not in place out of experience, but out of an idea that that is how you are supposed to do DevOps.

Processes often are like scar tissue—adhesions formed after trauma. That is: something went wrong in the past and a process was created to prevent that thing from going wrong again. Unfortunately these processes are not always reviewed to see if they still match a world that has changed since then.

Most companies are not changing enough. Teams, and perhaps departments, can be successful in adopting DevOps, but often it does not scale up to the whole company.

So what went wrong? When you look at the CALMS framework (which stands for culture, automation, lean, measurement and sharing), culture and sharing are hard. This leaves you with automation, lean and measurement. But:

DevOps should be more than optimizing and automating builds!

The things we build need to be more accessible and more easily shareable.

What does DevOps mean? Often the picture of several blindfolded people examining (parts of) an elephant is shown to demonstrate that nobody is right.

But this approach misses the point. Language changes all the time. Words might loose some of its original meaning and start meaning something else. “Awful” originally was used to describe something worthy of respect or fear. Nowadays it is used to make clear something is extremely bad.

Arguing about the definition of DevOps does not help us reach consensus and also does not learn others what we want to convey about DevOps.

Teaching

Education can be used as a tool of oppression—forcing your ideas onto others. One way of doing this is by being selective in what we teach and what we omit. But how we teach can also make a difference.

Traditional approach in tech uses one-way broadcasts. The teacher fills up the student with knowledge. But people are not empty vessels, we already have experience and knowledge.

Instead of teaching being one-way communication, you can also explore solutions together with the student (problem-posing education). You can work together on a problem and find truth. Students can contribute to the knowledge.

Ways to improve sharing concepts like DevOps:

Focus on concepts over language. Set a context and move on.
Pose more problems and offer fewer solutions. Show the problems more clearly instead of focussing on the solutions.
Create more spaces for collaborative learning.

Reading tip: Pedagogy of the Oppressed by Paulo Freire.

Everything I need to know about DevOps I learned in The Marines — Ken Mugrage

Disclaimer: there are things in Marine boot camp that are counter productive when applied at DevOps, like instructors yelling at recruits.

Words matter a lot. In the marines they have very specific words for things. For example: a bed is a rack, a hat is a cover, a toilet is the head. In organizations it’s similar: a “unit test” means something specific. By having a shared, well defined vocabulary communication is more effective.

Boot camp is full of creating habits. For example: recruits have to sit cross legged on a concrete floor in a “classroom.” Later on in the course it becomes clear that this position is a stable position to fire a gun. By the time the recruits get to actually work with a gun, they are already used to the position.

I’m not a great programmer. I’m just a good programmer with great habits

—Kent Beck

Having muscle memory, ingraining habits, means the quality goes up.

A different level of risk means a different level of testing. This does not have to imply that deployments take longer. While the developers are running their functional tests, you can also already run additional tests that are required to promote that code to the next stage. So by the time the developers know that the functional tests pass, we also know whether there are security or performance issues that need to be addressed.

The military is not all about command and control. High level objectives are set by the highest in command and these objectives get more detailed in lower levels. In business: people share “corporate goals,” but the product team decides what the best implementation is.

Takeaways:

Words have meaning
Train like you play, play like you train
Diversity and inclusion makes us stronger

Supporting devops with technology procurement — Ramon van Alteren

Procurement is about obtaining stuff. In a modern context this means answering question like “which cloud vendor?” and “why that cloud vendor?”

Spotify was using their own data centers with compute, network and storage. Because provisioning new hardware takes time (e.g. weeks to order a new server), you start managing buffers and flow to be able to fulfil a demand.

Moving to the cloud puts you into a transactional mode, you pay directly for what you use. Moving to the cloud in a way where every team can spin up their own infrastructure also means there is no longer a single place of costs. However, cost is a factor in how you design and run systems. By decentralizing this place of costs, each team has to consider them.

And don’t forget that rules like writing off infrastructure does not work. In the cloud, compute is not “free” anymore after a number of years. When a project has a budget for infrastructure, you have to take into account that in a cloud setting you have to keep paying once the work on the project is finished and it goes to maintenance mode (and the costs shifts from engineering to an operations department).

Lessons:

Create a safety net. Predicting costs is not easy. Allow people to experiment, but make it safe for them. (E.g. alert on spikes in costs.)
Look at global cost optimization before local. E.g. look at using reserved instances instead of having a team of developers optimize a small thing while they also could be adding value to the company.
Haven an explicit frame of reference. Make it clear if cost are high or not.
Most importantly: cost reduction should not be a goal! Otherwise you end up with a cheap, very specific service; you’d rather focus on value for users.

You can’t buy DevOps — Julie Gunderson

You cannot buy it, but people will try to sell it to you.

You cannot change culture with a credit card.

Cultural changes rarely work top-down. Instead start small and prove its usefulness.

DevOps is not things like:

Moving to the cloud
Forming a “DevOps team”
Getting rid of the Ops team
A prescriptive checklist

DevOps is about culture, people over processes. Research by Google about what makes a successful team, shows that psychological safety fosters a culture of empowerment. Having psychological safety means you can embrace failure and learn from your mistakes.

We should stop talking about “a root cause,” but instead discuss “contributing factors.” With our complex systems, there often no longer is a single root cause. There are always contributing factors that allowed e.g. a person to push the button that took down the whole company.

Some practices you can adopt:

Configuration management
Automation
Use the right tools (the actual users of the tools should be able to decide which tools to use)
Working in small branches instead of massive integration (this makes it easier to rollback if needed)
Testing

You can start small and prove success. This can affect the entire organization.

What MMA taught me about working in tech — Daniel Maher

Mixed martial arts (MMA) is about mixing techniques.

Fundamentals win fights: there are some basic things (tools) you have to learn. Learn about the individual tools and how to put them together. But don’t be dogmatic about fundamentals; they are just tools. What matters is how you use the tools.

There are more ways than one to throw a jab. It’s the diversity that makes you better. You have to embrace diversity. MMA is all about the mixing, right? How does this relate to tech? You come up with new ideas by surrounding yourself with people who are different and have different experience, knowledge, etc. Fresh eyes and fresh ideas.

Tech needs diversity to succeed

By having a diverse team, you’ll be prepared for things you would not have thought about by yourself.

Admitting that hard problems are hard — Sasha Rosenbaum

Have you ever been at a meeting where your boss explains this horribly complex architecture he wants for a project? We reward complexity. (Resume driven development: use tech because it looks good on your resume, not because you actually needed it.) There is a cost to admitting ignorance and pushing back on this idea.

We should stop pretending that:

distributed systems are easy (running Kubernetes is not easy, it is error prone and hard)
people can be available 24/7
unicorn companies don’t have technical debt
that new product will solve all of your problems

There are costs to complexity:

time to learn
time to build
maintenance
increased chances of failure

This costs us money and productivity.

We use heuristics to navigate the world. And we need to because otherwise we would not last very long. Problem is that our current assumptions are not very good (e.g. “women and persons of color are not very smart”). These stereotypes are persistent and actually hurt people. Same resume: 6–14% more interviews for males vs female candidate, 50% more callbacks if the name of the candidate sounded white vs black-sounding names.

A consequence is also that the cost for admitting ignorance and errors are much higher for women, people of color and junior engineers.

Building effective teams (according to the same Google study that was mentioned by Julie Gunderson) is not related to background, personality types or skills, but psychological safety.

But psychological safety does not just happen. It has to be built by leaders.

Make it safe to express opinions
Make it safe to admit mistakes (e.g. admit mistakes yourself as a leader)
Realise that you do not know everything

This isn’t easy. This is a hard problem. But it is going to be worth solving it.

Reading tip: Mindset: The New Psychology of Success by Carol S. Dweck.

Ignites

Ignites are short talks. Each speaker provides twenty slides which automatically advance every fifteen seconds.

The danger of DevOps Certifications — Ken Mugrage

You cannot certify empathy. Most certifications only certify attendance, not actual knowledge or experience.

DevOps is about learning from failure, but companies that hire you to “implement DevOps” because you have a “DevOps certificate” are not going to give you the time to learn. If you fail, the CEO gets to say “see, DevOps doesn’t work (for us).” This can even be a danger for your career.

Training, courses, etc are useful. And if you want to get a certificate, that is also fine. But it’s not the thing that will make your career.

Certification on tools is ok. You cannot certify culture.

Some companies will want certification, fine, but understand that it’s not the key to your success.

4 ways to help people in their careers — Emanuil Tolev

Coaching: shorter term relationship to learn a specific skill
Mentoring: longer, exploratory relationship
Sponsorship: improve how others will think of you (potential, skills)
Career counselling: less formal version of the previous ways

The stories we tell about ourselves are powerful. Melancholy and contemplation are good things. You should prevent self-reinforcing feelings of doubt though.

For the coaches, mentors, etc: you cannot take the journey of others, but you can help them with their journey.

Serverless changes nothing — Ant Stanley

Serverless does not bring new ideas. Thinks like the 12 factor app, the Unix philosophy or microservices already existed. Operation practices are also the same, often even the same tools are used.

If you want to succeed in serverless, you need to adopt a DevOps way of working first; do the cultural transformation first. You cannot do serverless without DevOps.

Devopsdays Amsterdam 2019: day one

2021-11-09T20:09:33Z

This is the seventh time devopsdays is organized in Amsterdam. The event was sold out twee weeks in advance and has the highest attendance rate thus far. I was fortunate enough to attend again this year.

Observability for emerging infra: what got you here won’t get you there — Charity Majors

[Slides]

Charity has been a systems engineer since was 17 and she’s done her fair share of being on call. As a result her approach to software contains stuff like:

The only good diff is a red diff.
Junior engineers ship features, senior engineers kill features.

She co-founded Honeycomb.io a few years ago. Her company wants to help understanding and debugging complex systems.

Vendors often talk about the “three pillars of observability”, meaning metrics, logs (which is where data goes to die) and tracing. Which happen to be the categories of products they are trying to sell you.

The architecture that is being used today is much more complex than the humble LAMP stack we used years ago. Because of this increased complexity you go from “known unknowns” to “unknown unknowns.” Testing is also harder since you cannot simply “spin up a copy” if you want to test something. As a result, monitoring isn’t enough—you need observability since you cannot predict what you need to monitor. Plus: you should fix each problem. So every page should be something new; it should be an engineering problem, not a support problem.

Due to the infinitely long list of almost-impossible failures, your staging environment becomes pretty useless. Even if you capture all traffic from production and replay it in your staging environment, the problem you saw in production does not happen.

Good engineers test in production. Bad engineers also do this, but are just not aware of it. You cannot prevent failure. You can provide guardrails though.

Operational literacy is not a nice-to-have

You’ll need observability before you can think about chaos engineering. Otherwise you only have chaos. If it takes more than a few moments to figure out where a spike in a graph is coming from, you are not yet ready for chaos engineering.

Monitoring is dealing with the known unknowns. Great for the things you already know about. This means it is good for 80% of the old systems, but only 10% of the current systems.

Your system is never entirely “up”

The game is not to make something where nothing breaks. The game is to make something where as many things as possible can break without people noticing it. As long as you are within your SLOs, you don’t have to be in firefighting mode.

Not every system has to be working. As long as the users get the experience they expect, they are happy.

Engineers should know how to debug in production.

Everyone should know what they are doing. If an engineer has a bad feeling about deploying something, investigate that. We should trust the intuition of our engineers. However, the longer you run in staging, the longer you are training your intuition on wrong data. You should be knee-deep in production. That is a way better way to spend your time. Watch your system with real data, real users, etc.

Don’t accept a pull request if you don’t know how you would know if it breaks. And when you deploy something, watch it in production. See what happens and how it behaves. Not just when it breaks, but also when it’s working as intended. This catches 80% of the errors even before your users see them.

Dashboard are answers. You should not flip through them to find the one matching the current scenario. It’s the wrong way of thinking about problems. Instead you need an interactive system to be able to debug your system. Invest time in observability, not monitoring.

By putting software engineers (SWEs) on call, you can make the system better. The SWEs are the ones that can fix things so they don’t have to be woken up in the middle of the night again. If you do it right, and have systems that behave themselves, being on call doesn’t have to interfere with your life or sleep. This does require organizations to allow time to fix things and make things better instead of only adding new stuff.

Teams are usually willing to take responsibility, but there is an “observability gap.” Giving the SWEs the ops tools with things like uptime percentage and free memory is not enough. Most SWEs cannot interpret the dashboards to answer the questions they have. In other words, they don’t have the right tools to investigate the problems they encounter. You need to give them the right tooling to become fluent in production and to debug issues in their own systems.

High cardinality is no longer a nice-to-have. Those high cardinality things (e.g. request ID) are the things that are most identifying. You’ll want this kind of data when you are investigating a problem.

The characteristics of tooling that will get you where you need to go:

Well instrumented
High cardinality
High dimensionality
Event-driven
Structured
Well-owned
Sampled
Tested in production

By having this you get drastically less pager alerts.

Great talk by Beau Lyddon: What is Happening: Attempting to Understand Our Systems

You can use logs to implement observability as long as they are structured and have a lot of context (request IDs, shopping cart IDs, language runtime specifics, etc).

Senior software engineers should have proven to be able to own their code in production. They should notice and surface things that are not noticeable. Praise people for what they are doing good and make sure that those are also the things people get promoted for.

How Convenience Is Killing Open Standards — Bernd Erk

Thanks to the POSIX standards, Solaris admins could still do things on HP-UX systems. Then the GNU toolset came along. And Linux. Communities. Diversity.

As an industry we have a talent to forget everything we learned before. We jump from one new thing to another, but we keep making the same mistakes over and over.

We have more open source software than ever before. But we also have a dramatic change in the infrastructure. Public cloud becomes more popular. But is this a problem? Isn’t this just evolution?

Open source does not mean open standards however. An API is also not a standard, even if it is open. Standards are defined by organizations like IETF and W3C.

We have an unbalanced market. AWS and Azure are much bigger than the others. They all have their own APIs, but there is no open standard! This is why Terraform is so popular: it solved the missing standard problem for you. But you still need to be aware that an EC2 instance is different from an Azure machine though.

Open source has an impact on open standards.

Amazon is offering services that are based on open source software (e.g. Amazon Elasticsearch Service and Amazon ElastiCache). As a result customers don’t talk to Elastic, but to Amazon. Google does this better and (at least for some products) forwards issues to the developers.

Some companies that create open source software have now come up with new licenses to prevent e.g. Amazon offering their product as a service. This creates uncertainty for companies that want to use the technology.

Open source and open standards need support (money). As an industry we need to make sure that they have chance to have a place in the market.

Customers should ask their vendors to contribute back to the creators of the software.

For a lot of companies infrastructure is overhead. Using cloud services poses a risk, but a risk you perhaps can accept given the convenience those cloud services bring along. Lock-in is a risk, as is slowness.

Bernd himself would also use the cloud for a startup to get off the ground quickly. After one to two years you’ll probably have to redo your setup. Perhaps this is a good time to think about how to move forward? What do you want to be flexible about?

Economics in an unbalanced marked are not good. See the pharmaceutical market where there are a few big players and prices of medicines sometimes explode. Being able to work with different providers and be interoperable is important. We need variety.

Unfortunately the cloud providers don’t have an incentive to work towards open standards and to make it easier for customers to hop around from one provider to another.

Diversity is up to us. Not just on the work floor, but also in the technology we use.

Come listen to me, I’m a fraud! — Joep Piscaer

This talk about impostor syndrome. 70% of people experience it.

Joep’s “definition” of it:

Your internal measuring stick is broken.

Everybody is unsure, but nobody talks about it. In IT this problem becomes worse. We talk about “best practices,” but there are so many ways to do things, how do you know what is the “right” way? Perhaps the way you are doing things is perfectly fine.

Joep talked about the Dunning-Kruger effect. But in IT the technology landscape changes in a fast pace. The goalpost is moving constantly.

What can you do about it? How do you recalibrate your internal measuring stick?

Say positive things about yourself in front of a mirror to diminish the negative thoughts about yourself and strengthen the positive ones.
Don’t compare yourself to other people, especially people on the internet.
Learn how to receive compliments.
- Write down compliments and accomplishments and look at the list regularly. This also learns you about the perspective of others.
- Brag about one or two occasionally.
Learn from others and teach others about your thinking process (pair programming, pair review (instead of peer review), speak publicly, etc).
Trust others, you are not alone. Talk to others, participate in conversations. Show yourself.
Get hobbies. A hobby is something you like to do and where you can fuck up without consequences.

To discover why you have impostor syndrome, you have to identify your limiting beliefs.

A number of TED talks you should see:

The power of vulnerability, by Brené Brown (Tip: watch this in a comfortable setting with your partner nearby to support you if needed.)
Inside the mind of a master procrastinator by Tim Urban
Imposter Syndrome by Mike Cannon-Brookes

Using Design Methods to Establish Healthy DevOps Practices — Aras Bilgen

Companies that want to do DevOps are usually enthusiastic about tools and processes. Culture is a harder subject… Even if you show data that demonstrates the importance of the cultural aspect of DevOps, there’s not much understanding.

Most people think about design as making things beautiful. But it also makes things work better. It makes things more visible and matches the conceptual design with the mental model.

Design can reframe problems.

Uber is not just about getting a ride, but it is rethinking the way getting from A to B works. Airbnb is not only about getting a nice room, but it is rethinking the hospitality industry overall. Starbucks sells coffee, but also turns community spaces into places where you can hang out and have conversations over coffee.

Good designers:

Work with actual end users (not managers or product owners)
Welcome ambiguity
Give form to ideas
Co-create in a safe setting
Experiment and revise

Instead of talking to customers about their tools, processes and culture, you could start with interviews and having collaborative process maps workshops where you look at the entire flow from idea to when it explodes in production.

Don’t just offer the out-of-context “best practices” from the DORA State of DevOps report, but have challenge mapping workshops. This leads to solutions that are applicable and often don’t even need to be imposed but are willingly implemented.

Do diary studies: what do people actually do and what makes working hard for them?

Two takeaways:

Mindset matters more than background. You don’t have to be a designer to do this.
Stop and listen with an open heart and no agenda.

Serverless is more FinDev than DevOps — Yan Cui

(This was a fast-paced talk and I did not have time to make many notes.)

Serverless means that:

you only pay for what you actually use
you don’t need to worry about scaling
you don’t need to provision and manage servers

Function-as-a-service: upload code to the cloud which gets run when called and your function then handles the events.

Serverless is interesting because:

scalable
pricing (can be cheaper, but not always!)
resilience
security
greater velocity from idea to product
less ops responsibility (no servers to manage)

With regards to the velocity: with serverless you can remove a bunch of steps from your process, like figuring out deployment, configuring an AMI, ELB or autoscaling.

Return on investment

Companies are looking for a return on their investment.

We tend optimize what we can easily measure: operational stuff. But most of the budget is going to engineers, managers, tools, etc.

Serverless might cost just as much as e.g. running EC2 instance and perhaps it costs even more, but you can get so much more done.

With a server you have the same costs no matter if you have only a single transaction or 100. This means that the transaction costs (costs divided by the number of transactions) is variable. To make matters worse: there are often multiple services running on a since server. This makes it even harder to estimate the transaction cost or the cost of a feature.

The pay-per-use characteristic of serverless is useful in this contest. It makes it much more clear what the cost per transaction is.

Ignites

Ignites are short talks. Each speaker provides twenty slides which automatically advance every fifteen seconds.

The IJ & the shape of DevOps — Jason Yee

The shape of the IJ has changed though the years. Has the shape of DevOps changed over the years?

Well, now there is CLAMS (or CALMS), which stands for:

Culture
Lean
Automation
Metrics
Sharing

If you would plot each of these on an axis, you can plot an organization on it.

Now imagine the circumference is an elastic band: if you pull it on one axis, you’ll decrease the value on the other axes.

Work around the circle. If DevOps means automation for you, this helps being lean, which shapes your culture, which helps sharing, etc.

Evaluate your organization: what is it good at, what are you working on? It’s a natural process, a dynamic movement. Work on all axes.

What you see is what you get for AWS infrastructure — Anton Babenko

Cloud architects and DevOps engineers want to have a faster conversion from idea to working project.

With Cloudcraft you can visualize your AWS infrastructure. With modules.tf you can export that diagram to Terraform code. And with Terraform you can manage your infrastructure. And we have come full circle.

YAML Considered Harmful — Philipp Krenn

Why do we use YAML in the first place? We want to have a human readable format (so XML is out), and we want to be able to place comments in our code (which means JSON is out of the picture).

There are a few issues with YAML though. To name a few:

Suppose you have a port mapping in your docker-compose.yml:

ports:
  - 80:80
  - 22:22

Looks fine? Except it is not fine, because YAML will parse numbers separated by a colon as base 60 (sexagesimal) numbers if the numbers are smaller than 60. So the second line is not parsed as the string “22:22” but as the integer 80520, which is something Docker (understandably) cannot handle.

Another example is this:

countries:
  - DK
  - NO
  - SE
  - FI

However, “NO” is a falsy value, so the countries list will contain three strings and a boolean.

If you want a DSL, please use a DSL, not YAML.

Alternatives:

Maybe jsonnet?
XML: problematic for humans, it’s meant to be parsed by computers computers
JSON: doesn’t feel right
INI: no hierarchy, parsers on different platforms behave differently
TOML: indentation is optional
JSONx: worst of all worlds

When you want to write YAML:

indent properly
quote just about everything

Posts tagged with “conference” on Mark van Lent’s weblog

PyGrunn 2024

Platform Engineering: Python Perspective — Andrii Mishkovskyi

Securing your team, solution and company to embrace chaos — Edzo Botjes

Descriptors: Decoding the Magic — Alex Dijkstra

Release the KrakenD — Erik-Jan Blanksma

General tips

All Day DevOps 2022

Keynote: The Rise of the Supply Chain Attack — Sean Wright

Taking Control Over Cloud Costs — Amir Shaked

Beyond Monitoring: The Rise of Observability Platform — Sameer Paradkar

Failure Is Not an Option. It’s a Fact — Ixchel Ruiz

Accurate Metrics — Christina Zeller and Marcus Crestani

Why Is It Always DNS, TLS, and Bad Configs? — Philipp Krenn

Comprehending Terraform Infrastructure as Code: How to Evolve Fast & Safe — Anton Babenko

Keynote: Journey to Auto-DevSecOps at Nasdaq — Benjamin Wolf

Varieties of Incident Response — Kurt Andersen

The E Stands for Enablement - Modern SRE at PagerDuty — Paula Thrasher

All Day DevOps 2021

Keynote: CAMS Then and Now and the Path Forward — Thomas “VikingOps” Krag

Culture

Automation

Measurement

Sharing

Three ways

Gamification of Chaos Testing — Bram Vogelaar

Watch Your Wallet! Cost Optimizations in AWS — Renato Losio

Compute

Data transfer

Storage

AWS changes quickly

Keynote: Call for Code with The Linux Foundation: Contributing to Tech-for-Good Even if You’re Not Techincal — Daniel Krook and Demi Ajayi

DevSecOps Culture: Laughing Through the Failures — Chris Romeo

#1 Name and brand

#2 The infinity graph

#3 Security as a special team

#4 Vendor defined DevOps

#5 Big company envy

#6 Overcomplicated pipelines and doing everything now

#7 Security as gatekeeper

#8 Noisy security tools

#9 Lack of threat modelling

#10 Vulnerable code in the wild

Successes

Key takeaways

Keynote: Managing Risk with Service Level Objectives and Chaos Engineering — Liz Fong-Jones

How to measure reliability?

Recipe for shipping reliably and quickly

Chaos Engineering

Takeaways

Common Pitfalls of Infrastructure as Code (And how to avoid them!) — Tim Davis

Infrastructure pitfalls

Code pitfalls

Wrap-up

Devopsdays oNLine 2021

How cognitive biases and ranking can foster an ineffective DevOps culture — Kenny Baas & Evelyn van Kelle

How to make sure everyone said what has to be said?

How can we create and include new insights?

Who makes decisions and how to get everyone onboard with the decision?

Getting Started: Embrace Your Inner Child — Rain Leander

Prevent Heroism: How to Work Today to Reduce Work Tomorrow — Quintessence Anx

Real-world Continuous Delivery: Learn, Adapt, Improve — Michiel Rook

Phase 1

Phase 2

Phases 3

The Adjacent Possible: Evolution, Innovation & Catastrophe — Jason Yee

Minimal viable presentations — Mark Smalley

Engineering Metrics That Matter — Dan Lines

DevOps is no Walk in the Park — Sabine Wojcieszak

Devopsdays oNLine 2020

Why change matters now more than ever: the importance of DevOps in this new world — Michael Ducy

From waterfall to agile. Microsoft’s not-so-easy evolution into the world of DevOps — Abel Wang

Securing your DevOps transformation — April Edwards

BizDevOps: bridging the dominant divide between business and IT — Henk van der Schuur

All Day DevOps

#adidoescode Where DevOps Meets The Sporting Goods Industry — Fernando Cornago

Multi Cloud “day-to-day” DevOps Power Tools — Ronen Freeman

How to Run Smarter in Production: Getting Started with Site Reliability Engineering — Jennifer Petoff

Everybody Gets A Staging Environment! — Yaron Idan

DevOps to the Next Level with Serverless ChatOps — Jan de Vries