Posts tagged with “observability” on Mark van Lent’s weblog

All Day DevOps 2022

2022-11-10T21:32:11Z

Today was the 7th All Day DevOps. Just like last year the organizers managed to get 180 speakers, spread over 6 tracks, to inform, teach and entertain us. As usual I made some notes of the talks that I attended.

Keynote: The Rise of the Supply Chain Attack — Sean Wright

Dependency managers (like Apache Maven) changed the way software is being developed: using those makes it much easier to use dependencies in your application (you don’t have to fetch them manually yourself). As a result, building your own libraries is now less common, using an available (open source) library is much easier.

To get a feel for the scale: NPM serves 2.1 trillion annual downloads (requests). This obviously includes automatic builds, but the growth over the last couple of years has been exponential none the less. And using third party libraries can be a very good thing. Good libraries can be much more secure and have less bugs than some library developed in house.

Types of supply chain attacks:

Dependency confusion: Register a private package (which you know is used by a company) in a public repository. If the package manager searches for the package in the public repository before the private one, the malicious package is used.
Typo squatting / masquerading: Use misspelled or “familiar” names in the hopes someone uses the wrong package.
Malicious package: Malicious payload in a package that promises (and perhaps even delivers) useful functionality.
Malicious code injection: Compromise of an existing package (e.g. by compromising the build system or repository).
Account takeover: Malicious actor can log in as the original author.
Exploiting vulnerabilities in packages: Using a known or unknown vulnerability in a library.
Protestware: For example a legitimate library with notes protesting against the war in Ukraine.

After this Sean unfortunately dropped from the session.

Taking Control Over Cloud Costs — Amir Shaked

Reasons to analyze your cloud costs:

Costs translate to profit
Usage is directly tied to costs (and if you don’t properly monitor, your spend could increase unexpected)
Be in control: by monitoring you can determine what impact certain actions have on your business
Be intelligent: by knowing what you spend where, you can connect it to your business KPIs
Save money

You’ll want to be aware when you scale (hyper growth) if your costs grow faster than the number of users/sales/etc. You don’t want to end up in a situation where more success means less profit.

Things to look for in your environment:

Unused resources
Over commitment
Under commitment
Price per unit we care about (e.g. users, sales, etc)
Bugs

Best practices you can arrange from day one:

Export billing data and create e.g. dashboards
Structure and organize resources for fine-grained cost reporting
Configure policies on who can spend and who has administrator permissions
Have a culture of “showback” in your organization. Make your employees aware of the costs, have budgets and alerts.

Beyond Monitoring: The Rise of Observability Platform — Sameer Paradkar

Customer experience is important for your revenue. You need to deliver your service and make sure you know how well you’re able to do so. Observability helps you find the needle in the haystack, identify the issue and respond before your customers are affected.

Pillars of observability:

Metrics: Numeric values measured over time. Easy to query and retained for longer periods.
Logs: Records of events that occurred at a particular time (plain text, binary, etc). Used to understand the system and what’s going on.
Traces: Represent the end-to-end journey of a user request through the subsystems of your architecture.

Relevant key performance indicators (KPIs) and key result areas (KRAs):

Customer experience
Mean time to repair (MTTR)
Mean time between failure (MTBF)
Reliability and availability
Performance and scalability

An observability platform gives you more visibility into your systems health and performance. It allows you to discover unknown issues. As a result you’ll have fewer problems and blackouts. You can even catch issues in the build phase of the software development process. The platform helps understand and debug systems in production via the data you collected.

AIOps applies machine learning to the data you’ve collected. It’s a next stage of maturity. Its goal is to create a system with automated functions, freeing up engineers to work on other things. Automating remediation of issues can also greatly reduce response time and mean time to repair. This means that the customer experience is restored faster (or is never degraded to begin with).

Failure Is Not an Option. It’s a Fact — Ixchel Ruiz

Failure can cause a deep emotional response, we can get depressed and it can make us physically sick. On one side of the spectrum, a failure can cause harm to other people. On the other side we could embrace failure and make things safe to fail. Failure in IT on the project level is quite common. Failure can also happen on a personal level.

Not all failures are created equally. There are three types:

Preventable: These are the “bad” failures. There’s no reason to allow these.
Complexity related: These failures are due to the inherent uncertainty of work. Usually it is a particular combination of needs, people and problems that cause them.
Intelligent: The “good” failures, since they provide new knowledge.

Steps to learn from failures:

Recognize
Understand the cause
Extract lessons to prevent future failure
Share the information
Practice for the next failure in a safe and controlled setting

Increase return by learning from every failure, share the lessons and review the pattern of failures. Do note that none of this can happen in an environment without psychological safety. You need to feel safe to discuss your failure, doubts or questions to be able to learn.

Accurate Metrics — Christina Zeller and Marcus Crestani

A manufacturing process was monitored and there was a nice dashboard to show whether there were any problems. However, at a certain moment there was a problem, but the dashboard was still claiming everything was fine.

What was going on? Unreliable network? Erratic monitoring system? Flawed collection of metrics? It turned out to be all of them.

Sampling rates, retention policy and network issues can cause missing measurements in your time series database. This missing information can cause a drop in failure rate if you are unlucky enough that the missing samples are failed ones. So you think everything is fine, but there is something wrong in your environment.

Modelling metrics differently can help. One possible improvement is to have a duration sum in your metrics instead of just the duration. If you now miss a sample, the sum will still indicate that there has been a failure.

Using histograms is even better since you place values in buckets, e.g. failures in our case. A disadvantage however is that the metrics creation system must now also know what qualifies as a success or failure.

Takeways:

Always collect metrics
Check that your metrics match reality
Always consider missing data
For duration metrics always use histograms

After improving their metrics, the monitoring system matches the actual state of the manufacturing environment again.

Tip: look into hexagonal architecture, also known as “ports and adapters architecture.”

Why Is It Always DNS, TLS, and Bad Configs? — Philipp Krenn

DNS, TLS and bad config are where failures are waiting to haunt us when we least expect it. We need to have a tool to find the issues in our system early on. Health checks can be this essential tool to alert us.

You are able to narrow down where the issue is if you structure your health checks like this:

Outside the network (different provider): This allows you to detect issues with e.g. DNS, the uplink, a firewall, a load balancer, service availability and latency.
On the network (different AZ): If you are on the network, you could see issues with the network, firewall, TLS, service availability and latency. You can compare your measurements from outside with what you see on the inside.
On the instance: Again, by comparing the local point of view with the outside, you can detect issues with service availability, proxy vs service, latency, dependencies (e.g. a database that is slow or the service cannot reach)

Examples of tests you can use:

Ping a host
Setup a TCP connection
Do an HTTP request
POST data to a service
Synthetic monitoring where you simulate button clicks

Health checks are cheap to run and give you a fast overview. They do not replace observability and only tell you something is broken, not what is going on. Start simple and only add synthetics when/where needed since they more complex.

Comprehending Terraform Infrastructure as Code: How to Evolve Fast & Safe — Anton Babenko

By using Terraform AWS modules you’ll have to write less Terraform code yourself, compared to using the AWS provider resources directly.

For example: if you write your own Terraform code from scratch for an example infrastructure with 40 resources, you’ll need about 200 lines of code. Once you introduce variables, that code base will grow to 1000 lines of code. When you then split it up into modules, you’ll need even more code.

Instead, if you use terraform-aws-modules you’ll have more features than your own modules would only need about 100 lines of code.

Questions with regard to terraform-aws-modules:

Why?: Understandability is an important issue. To help with these questions (like “why is it done this way?") there are autogenerated documentation and diagrams. A visualization of the dependencies also helps.
Does it work?: The project focusses on static analysis (using tflint and terraform validate) and running Terraform on examples. Anton thinks that testing code should be for humans (so HCL is better than Go).
How to get feedback quicker?: Options: run Terraform locally instead of in CI/CD pipeline, simplify the code, and restrict surface area of your code using policies, guardrails, etc. Split up your Terraform code, with different states, to speed up Terraform runs.

Most important word for this talk: understandability.

50-67% of time for software projects spent on maintenance.

Some useful links Anton listed:

Keynote: Journey to Auto-DevSecOps at Nasdaq — Benjamin Wolf

Why DevOps? It can be pretty complicated to explain, even though it is an obvious choice for Nasdaq. For most people Nasdaq is a stock exchange but it is actually a global technology company. Sure, they run the stock exchange, but also provide capital access platforms and protect the financial system.

Nasdaq develops and delivers solutions (value) for their users. They manage and operate complex systems. It has been around for a while. So again, why DevOps? The answer is: to get better at their practices.

They want to deliver solutions (value) to their users efficiently, reliably and safely.
They want to manage and operate complex systems efficiently, reliably and safely.

Years ago they had manually configured static servers and the development teams were growing. They automated software deployment to a point where the product owners could trigger the deployment, and even pick which branch to deploy. This was an important first evolution. They had a “DevOps team” to handle this automation.

The second evolution for Nasdaq was moving from a data center to the cloud, using infrastructure as code (IaC). The question they asked themselves was what to do first: migrate to cloud or get their data center infrastructure 100% managed via IaC? They made the ambitious decision to do both at once.

By turning your infrastructure into code, you can create and destroy the environment as many times as you like. And this was welcome: after about 2100 times they “got it right” and were able to move over the production environment to the cloud. Without IaC this would not have been possible as flawlessly as it did.

The cloud and IaC brought them:

Efficiency: maintenance, patching, capacity
Reliability: self-healing, immutable
Safety: immutability, destroy/create

Over time the DevOps team started to handle a lot more work. The team consisted of system administators, but they were required to work as developers (make code reusable, use git, etc). The DevOps team started to complain about being overloaded and they became a bottleneck since a lot of development teams came to them with problems (failing builds, cloud questions).

On the other side, the development teams stared to complain because they are dependent on the DevOps team but that team had become the bottleneck. And “just” scaling up the DevOps team would not solve the problem.

Where the second evolution was about the technology, the third evolution was about efficiency. They moved to a “distributed DevOps” model. Developers were empowered: access to logs and metrics, training (cloud, Terraform, Jenkins). By creating a central observability platform, developers could get insight in what is going on, without the need to have access to the production environment.

This resulted in more deployments and enhanced reliability of the deployments because of the observability platform.

A year or three later, new cracks appeared. Standards were diverging because teams were allowed to pick their own path (libraries, databases, pipelines, Terraform code, etc). It also lead to practices that needed to be fixed (e.g. lack of replication). Standardizing this led to quite a burden at the start of a project: lots of basic stuff to setup.

Developers needed to be experts in a lot of technology, from JavaScript and the JavaScript framework in use, via multiple .NET versions to Terraform and other deployment related tech. An easy way to solve this situation was to flip back to the previous situation with a single team responsible for deployment and such. But this would basically mean recreating the bottleneck.

The ownership itself was not the problem. The efficiency was, because of the boilerplate needed for teams. Stuff you want to do the same across the teams. They wanted to empower the development teams, but also give them the standards for databases, messaging, etc. Instead of copying a template and have teams diverge afterwards, they looked into packaging to also make it easier to update afterwards.

This lead to evolution four with marker files, packages, code generators and auto-devops pipelines. The pipeline looks at the markers (“hey, this is a .NET app”) and can then apply a standard pipeline. Nasdaqs code generators create the boilerplate for the teams so within two minutes of starting with a new application you’re able to write code to solve your business problem, instead of having to create boilerplate code yourself first.

The developers can get up and running quickly, but in a safe way.

The development teams are all DevOps teams now, but Nasdaq also has a specialized team for the complex areas (hardware, networking, etc). There is also a “developer experience” team that focusses on the tools for the developers, like the code generators.

Current status with regard to our three key areas:

Efficiency: code generators, CLI, package setup
Reliability: package standards, pipelines
Safety: code scanning, package scanning, disaster recovery preconfigured

Varieties of Incident Response — Kurt Andersen

No matter how reliable our systems are, they are never 100% — an incident can always happen. When the pager goes off, the first step is to recruit a response team. This team can then observe what is going on. They need to figure out what this means (orient themselves) and decide what to do. And finally they can act to resolve the problem. (The OODA loop.)

Getting the right people involved can be hard for the technical responder; they themselves might want to dive into the technical stuff first. This is where the “incident commander” role comes in. The incident responder will recruit the team to get the right people involved, coordinate who does what and handle communication with people outside of the response team. (The latter can also be handled by a dedicated “comms lead” if needed.)

But how does the incident commander get involved?

A fairly standard approach for an on-call system will be to have a tiered model: tier 1 (NOC), tier 2 (people generally familiar with the system) and tier 3 (the experts of the system having the issue). The problem with this model: where does the incident commander come from? Tier 1? If so, can the incident commander follow through if the issue is handed over to the next tier?

Another model (“one at a time”): team A gets involved, decides it is not their responsibility, hands over to team B, which kicks it to team C, etc. Where does the incident commander come from in this model?

The aforementioned models only work in the simplest cases. They share a few big problems: handoffs are hard and there is no ownership, which results in loss of context. To mitigate this, some teams have an “all hands” approach where everyone is paged and everyone swarms into the incident response. However, most people on the call (or in the war room) cannot contribute. This leads to a mentality of “how quickly can I get out of here?”

Yet another approach is an Incident Command System (ICS), which comes from emergency services. In this approach the alert goes to the incident commander who then involves the team. While this works in some organizations, in tech it’s usually a bit too regimented.

The ICS morphed to an “adaptive ICS” where the technical team has more autonomy, but the incident commander is still involved. This system can be scaled up to where there’s an “area commander” role which coordinates separate teams (via their respective incident commanders).

Summarizing the roles of the parties in the “response trio”:

incident commander: coordination
tech team: nitty-gritty to solve the problem
comms lead: maintain contact with stakeholders

Each role will perform their own OODA loop from their own perspective.

But we started the story in the middle. We need to get back to the beginning and ask the question “why is the pager making noise?” Perhaps the first question one should ask is: “is this something actionable?” If it is not or if it is something you can handle in the morning, perhaps you do not have to respond in the middle of the night.

Cut down the noise and focus on the signal.

The E Stands for Enablement - Modern SRE at PagerDuty — Paula Thrasher

PagerDuty had an idea what SRE meant: they were enablers. You can hit them up on Slack and they help you out with a problem. Having an SRE team initially reduced the total minutes in incident in a year. But when PagerDuty grew further, the number went up again. Oops.

The “get well” project required teams to have the following:

Fast rollbacks: all rollbacks have to happen within 5 minutes. The SRE team provided guidelines to help teams achieve this.
Canary deploys: teams must test changes (canary) first. The SRE team enabled this via tooling.
Product limits: reasonable limits. The SRE team made this possible via telemetry and monitoring.

Results:

Lower number of minutes in major incident
Lower mean time to resolve
More “incidents averted”: four minor incidents were resolved before they caused real harm.

Important elements that made this “get well” project possible:

Clear, measurable goals
Enabled by tools and templates
Gave teams a path to do it
Experts were available to coach
The teams owned the implementation themselves
The teams were held accountable

PagerDuty used Backstage as an internal “one stop shop” developer portal with documentation and insights. It also integrates with the development systems.

All Day DevOps

2021-11-09T20:09:33Z

All Day DevOps is an online conference which lasts for 24 hours. With 150 sessions across 5 tracks, there’s enough content to consume.

(As with all my recent conference notes: these are just notes, not complete summaries.)

#adidoescode Where DevOps Meets The Sporting Goods Industry — Fernando Cornago

Adidas is a big company, about 57,000 employees. Software is also having an impact on sports, e.g. registering your performance, etc. Is Adidas already a software company? Definitely not, but they should start behaving as if they are. They are building their applications 10,000 times a day.

New book from Gene Kim (the author of The Phoenix Project): The Unicorn Project. Expected end of November 2019.

The platform you need to be able to accelerate you business should be:

on demand and self-service
scalable and elastic
pay per use (which helps with the efficiency mindset)
transparent and measurable (observability)
open source + inner source

To improve sales, Adidas' platform can run where their customers are (globally) and can scale up if needed and scale down again when possible.

Making use of data is key. The data is seen as a baton in a relay race and makes its way through the teams. A large part of time consumed in creating a service is spent at looking at the data. The data is used to train their AI systems.

You want to deliver value and make sure that things are working. Testing needs to be done on the physical level (the actual products that are being sold), but also on the virtual level: the IT platform also needs to perform and work as expected.

At Adidas they went from having a couple of “DevOps engineers” to having a DevOps culture in all teams. This removed bottlenecks and allowed them to grow faster. They have substantially reduced costs and manual testing and improved drastically on time to production.

Multi Cloud “day-to-day” DevOps Power Tools — Ronen Freeman

About the relation between SRE (Site Reliability Engineering) and DevOps: SRE is an implementation of DevOps.

What defines a great SRE:

Quality
- Be curious
  - Stay up to date
  - Attend events
  - Tinker with things
- Adapt to change (change will happen, anticipate it, monitor it, adapt to it and prepare for the next change)
- Learn: learn fast so you may promptly begin again
- Laziness: automate, but ask your self the question whether it will actually save time or not
Support
- Googling is an art form
- You’ll learn the answer itself but also discover things in the process
- There are numerous forums (for example: Stack Overflow, GitHub, Slack, etc)
- Have a daily standup
- Pass on knowledge
Tools: there are a number of areas to consider:
- Code: Bash
- Package: Docker
- Deliver: Concourse
- Platform:
  - Infrastructure: Kubernetes (relatively simple to get started with); GKE is the most mature
  - Configuration tool: Terraform
- Monitor: Google Stackdriver (since already running on GKE)

There’s a demo app: https://github.com/Darillium/addo-demo

How to Run Smarter in Production: Getting Started with Site Reliability Engineering — Jennifer Petoff

Software engineering is focussed on design and building, not about running the application.

There are multiple places in organizations where friction occurs:

Business vs development: this is addressed by Agile
Development vs operations: this friction is addressed by DevOps

SRE is a bridge between business, development and operations.

Four key SRE principles:

SRE needs Service Level Objectives (SLOs) with consequences
SREs must have time to improve the world
SRE teams must be able to regulate their workload
Failure is an opportunity to learn (having a blamelessness culture)

Some terminology:

Service Level Objectives (SLOs): a goal you want to achieve. For example, you may want to set an uptime SLO, but typically you want to dig deeper, like 99.99% of HTTP requests should succeed with a “200 OK” response. You want to measure something your user cares about.
Service Level Agreements (SLAs): these are “handshakes between companies,” contractual guarantees.
Error budget: the gap between your SLO and perfect reliability. E.g. the allowed downtime to still match your 99.9% uptime SLO.
Error budget policy: this policy describes what do you do when you have spent your error budget. Example: no new features until within your error budget again.

100% is the wrong reliability target for basically everything

SRE is about balancing between reliability, engineering time, delivery speed and costs.

Note that you can have an error budget policy without even hiring a single SRE. You can already start with an SLO.

SLOs and error budgets are a necessary first step if you want to improve the situation.

When talking about making tomorrow better than today, you need to address toil (work that needs to be done that is manual, repetitive, automatable and does not add value). If your SRE team is spending time on toil, they are basically an ops team. Instead you want them to make the system more robust and improve the operability of the system.

One way to be able to have your SRE team regulate their workload is to put them in a different part of the organization.

If you want to start with SRE, start with SLOs and let your SRE team own the most critical system first. Don’t make them responsible for all systems at once. Let them deal with the most important systems first. Once they have that under control (removed toil, etc), they can take on more work.

You need leadership buy-in for every aspect of SRE work.

Aim for reliability and consistency upfront. SRE teams should consult on architectural discussions to get to resilient systems.

SRE teams can benefit from automation:

Eliminate toil
Capacity planning (auto scaling instead of forecasting)
To fix issues: if you can write a playbook, you can automate it.

SRE teams embrace failure:

Setting SLO less than 100%
Blamelessness at all levels
Learning from failure
Make the postmortems available so other teams can also learn from them

Resources (books):

Everybody Gets A Staging Environment! — Yaron Idan

How can you give an isolated and secure staging environment for each developer? If you follow the path Yaron’s company (Soluto) took, prepare for a rocky transition, getting used to new terminology when moving to Kubernetes and users that can get upset at first.

Their first step: automate the entire process. The realized they needed onboarding documentation and templates. They used Helm for the templates.

Deploy feature branches to the production environment. When developers open a pull request, the CI/CD tool adds a link to where this PR is deployed. This empowers the developers and provides a sense of security. This accelerated the way the developers built features and the adoption of the platform by the company.

The benefits:

Code is fit for production during entire life cycle of the feature branch.
Makes performance testing easier (after all: it’s deployed on the production environment).
Easier to share things with the stakeholders and thus improves collaboration.
You can deploy an entire set of microservices.

The tools used for their implementation:

GitHub (VCS)
Docker (code as configuration)
CodeFresh (CI)
Helm (CD)
Kubernetes (production platform)

You can swap the aforementioned tools with other tools, but you must be able to automate them and make sure that they work with the push of a single button.

GitHub repository for the demo: https://github.com/yaron-idan/staging-environments-example

DevOps to the Next Level with Serverless ChatOps — Jan de Vries

What DevOps is about:

Shipping code
Adding value
Do this continuously

It’s not about:

Adding additional load to your team
Lowering quality
More support calls
Slow response times when there are errors

One option is to send emails if something is wrong. But these emails are slow (knowing that there was an issue yesterday is too late) and it generates a lot of noise. Monitoring dashboards are nice, but you have to actively look at them to detect if there’s something wrong.

A better solution is using chat applications (like Microsoft Teams or Slack).

Your first step is to emit events when something fails (in Azure you can sent such events to Event Grid). Next, you have to subscribe to these events/topics. You can use an Azure Function for this and then have it post an actionable message to Teams.

These messages can have buttons to take immediate action (e.g. post to a web API like an Azure Function) to recover from the problem. The output from the recover action can also send events, so you can get another notification in your chat application about the result.

GitHub repository with demo code: https://github.com/Jandev/ServerlessDevOps

Building Modular Infrastructure in Code — Fergal Dearle

Why use infrastructure as code (IaC)?

Fergal is a fan of immutable infrastructure and the pets vs cattle analogy (which goes further than just servers)
Everything as code
You don’t have to log into a server via SSH to fix things
Better testability and maintainability

The DevOps handbook (part III, chapter 9) talks about on demand spinning up a production-like environment. The only way to do this is to use IaC.

First thing is understanding your infrastructure (VPC, networking, DNS, services, database). Look at patterns like requestor/broker. Use these patterns to create modular stacks.

Patterns:

Singleton stack pattern: one stack per instance. But this is actually an anti-pattern because you are mixing configuration and infrastructure.
Multiheaded stack pattern: multiple stacks in single project, config built in.
Template stack pattern: single stack, config separate, multiple instances from that template

Stack templates & stack instances:

Reusable template that implements building blocks
Appropriate tags
Must be able to uniquely identify instances (namespacing)

Stay clear of monoliths

Monoliths are stacks with everything in them. This is one end of the spectrum. The other end is to have one service per stack. You’ll probably will want to end up somewhere in between, where you have a component stack (e.g. with a requestor/broker combination). And then you can combine those stacks.

Static stack inputs: environment names, app names. You should externalize this config from your stack, e.g. by feeding these parameters via the command line or use the SSM Param Store and Secrets Manager.

(Note to self: Fergal mentioned Sceptre. Check what this is about.)

Common problems:

Drift can happen, especially if you also make manual changes (don’t!)
Updates are not forward compatible
Region incompatibilities

Best practices:

Only use template stacks
Use layered and component
Externalize config
Test, test, test!
Only use automation, no manual steps

Crossing The River By Feeling The Stones — Simon Wardley

In The Art of War Sun Tzu discusses five fundamental factors which you should consider:

Purpose
Landscape
Climate
Doctrine
Leadership

John Boyd developed the OODA loop, which cycles through the following:

Observe
Orient
Decide
Act

You can combine these cycles:

By moving we learn, as long as we can observe the environment.

Maps are ways to:

explore
communicate
share understanding
de-personalize

Graphs are not maps. Identical graphs can spatially be different. On a map, space has meaning, in a graph not. And because space has meaning, we can explore, e.g. by asking ourselves the question “what if we go in that direction?”

What makes a map?

Anchor (compass)
Position (places, relative to the compass)
Consistency of movement (going to right below means go to south-east, assuming on this map “up” is north)

In business there are a lot of maps, e.g. a systems diagram. But almost everything in business we call a map actually is a graph. Which means we don’t have a sense of the landscape.

How do you create a map for a business? Start with the anchors, e.g. business and customers. Work from there to get all related components and express them in a chain of needs.

But how do you position those components? We can use “visibility” as an analogy of distance. You can use this to create a “partial ordered chain of needs.”

And what about movement? You can describe movement as a chain of changes. But how do you do that? You can order components based on their evolutionary stage. Are they in the genesis stage, do we use custom built components, do we have standard products or can we consider it a commodity?

By combining these we can create a map of components by using the position (visibility) on the y-axis and the movement (evolution) on the x-axis.

Simon Wardley shows a map of a fictional company selling cups of tea

You can add intent (evolutionary flow) to your map, e.g. move a component from custom built to using a commodity.

By using such a map, you can de-personalize discussions since now you talk about the map, not a story.

Simon told a story of a company that wanted to invest in robots to do manual labour. After putting their business in a map, the problem became clear: they were using custom built racks and as a result needed to customize their servers because they would not fit their non-standard rack. Because the market had changed since the company started, they could switch to commodity (cloud) and as a result also did not need robots.

But keep this in mind:

All maps are imperfect representations of a space

Deploying Microservices to AWS Fargate — Ariane Gadd

At KPMG they moved a project from using Amazon EC2 to AWS Fargate. They also replaced CloudFormation with Terraform for standardization.

Why managed container services and not EKS?

Immutable deployments
Cluster provisioning handled by Fargate
Only pay for what you execute
No OS patching
Smaller attack service

They used ECS deploy to deploy containers to Fargate.

Why was the customer okay with this change?

They were already familiar with Docker
Fargate has integrations with ECS and ECR.
Overhead of managing EC2 vs cost of AWS Fargate

Fargate runs in your own VPC so you own the network, etc.

The microservices for this project are placed behind an application load balancer (ALB) using AWS Web Application Firewall (WAF) and TLS termination. Logs are sent to AWS CloudWatch Logs. They used Anchor for end-to-end container security and compliance.

Benefits of AWS for KPMG:

Cost reduction (automation tools were already in place, they already used Lambda for account creation and had reserved instances)
Speed of delivery
Allows collaboration with developers (automation, DevOps).
70% of the engineers were already AWS certified.

Automate Everyday Tasks with Functions — Sean O’Dell

Common use cases for serverless applications are things like web applications (such as static websites), data processing, Amazon Alexa (skills) and chatbots.

Not every workload should run in a serverless fashion

Organizational and application context are relevant. Serverless is just another option. You pick what is right for your use case as serverless is not a silver bullet.

AWS Lambda in a nutshell: there is an event source (e.g. a state change in data or a resource), which triggers a function, which in turn interacts with a service.

GitHub repository with examples: https://github.com/vmwarecloudadvocacy/cloudhealth-lambda-functions

(Note to self: check Amplify)

Beyond dev & ops — Patrick Debois

The DevOps pipeline usually starts at the backlog. But what happens before? And how does that influence people outside the engineering department?

Patrick joined a startup five years ago. He fixed the DevOps pipeline. After a while the organization wanted to sell a product.

The book The Machine by Justin Roff-March describes “the agile version of sales.”

Sales had more in common with IT than Patrick expected. The sales pipeline looked like a Kanban board. Sales people are also on call, especially if you do international sales. If we can deliver faster, this increases the chance of a sale. Sales persons are actually mind readers; this reminded Patrick how IT people feel about tickets in the backlog: “what is meant with this ticket, what do the customers actually want?”

Thinking about the life of the project after it has been delivered, in terms of maintenance and operational costs, is important. Doing this upfront, is beneficial for the project.

The biggest impact sales had on IT was the never ending hammer of “can you make things simpler?” Sales can help you with making your design less complex. Together you can determine what actually adds value for the customer.

Having publicly available documentation really helps and makes it easier to sell the product. Getting the customers on Slack also helped. This showed that the company was willing to listen and help. This also helps sales.

After sales, the marketing department got focus. Patrick has seen the most interesting automation within the marketing department (e.g. email sending). Marketing has a lot of fancy metrics and tools to measure stuff; did they invent metrics? The IT department can help marketing by providing information. Conferences are also a marketing instrument. Marketing also has CALMS; they do very similar things.

We’ve got sales and marketing covered, but for a lot of things we still need budgets. To budget things, you also need to estimate things. Finance knows what it is to be on the receiving end of tickets where you have to guess what is going on.

Sometimes you need to buy things. Patrick looked into agile procurement. Lots of stuff has to be figured out: is it a fit? is there a way to be partners?

Serverless is a manifestation of servicefull, where you use a lot of services, so you can focus on your domain instead of having to build everything yourself.

With cloud services there doesn’t have to be communication initially: you read the website, check the documentation, pull your credit card and start with a proof of concept. Communication is still important. You have to treat your suppliers with respect.

HR: how does your company support hiring people? Have people join the team for some actual work to see if they are a good fit or not.

Do you know what pair programming is? Well, mob programming is a whole other level of collaborating. (Note that it’s more than just have one person typing and a bunch of other people commenting—it has a lot of patterns going on.) Teams can get very enthusiastic about this and can be very productive the first week. But if they are not careful, the team is exhausted the next week. You need to develop a sustainable pace.

Patrick learned a lot from talking to the different departments. They are solving similar problems as IT is.

Reading tip: Company-wide Agility with Beyond Budgeting, Open Space & Sociocracy by Jutta Eckstein and John Buck.

The Open Source Observability Toolkit — Mickey Boxell

Some characteristics of a modern application:

Microservices
Distributed
Multiple programming languages
Scalable
Ephemeral

This brings new challenges compared to the old situation with a monolithic application running on a single machine. You still want to address threats to customer satisfaction. But the old debugging solutions we used for a monolith will not work.

Observability means designing and operating a more visible system. This includes systems that can explain themselves without you having to deploy new code. The tools help you understand the difference between a healthy and unhealthy system.

Modern, distributed systems will experience failures.

Observability takes a holistic approach that recognizes the impact an issue has on the business as a whole. Outages affect the whole business, which should be able to react. Observability gives tools to address issues when they arrive.

External outputs:

Logs
Metrics
Traces

These outputs don’t make your system more observable, but they can be part of the holistic approach. They can help you with e.g. a root cause analysis.

Monitoring tells you whether a system is working, observability lets you ask why it isn’t working.

—Baron Schwartz

Key SRE concepts:

Service Level Indictors (SLIs; what are you measuring), Service Level Objectives (SLOs; what should the values be), Service Level Agreements (SLAs; defines the planned reaction if the objectives are not met).
Subset of Google’s golden SRE signals: RED (Rate, Errors, Duration).
Mean Time To Failure (MTTF) and Mean Time To Repair (MTTR)

Users are comfortable with a certain level of imperfection. A site that is a bit slow sometimes, is less annoying than a shopping cart that sometimes looses its content. Use this to determine what you’ll spend your time on first.

Logging

Logging is the most fundamental pillar of monitoring. Logging records events that took place at a given time. Supported by most libraries because it is such a fundamental thing.

You only get logs if you actually put them in your code. You also have to make sure you don’t lose them, for instance by aggregating them.

Tools:

Fluentd: scrape, process and ship logs
Elasticsearch: a data store
Kibana: interact with your logs

Metrics

Metrics are a numeric aggregation of data describing the behavior of a component measured over time. They are useful to understand typical system behavior.

Tools:

Prometheus: scrape data, store the data and query the data
Grafana: visualize the metrics, can aggregate data from different sources

Alerts are notifications to indicate that a human needs to take action.

Tracing

Tracing is capturing a set of causally related events. Helpful for debugging. Example: each request has a global ID and if you insert this ID as metadata at each step, you can trace what happens.

Tools: Jaeger and Zipkin can visualize and inspect traces.

The challenge is that tracing is hard to retrofit into an existing application. You need to instrument all component. Service meshes can make it easier to retrofit tracing.

Service meshes

A service mesh is a configurable infrastructure layer for microservice applications. It can monitor and control the traffic in your environment. It can be a great approach for observability. You will get less information from a service mesh compared to adding tracing to all of your components, but on the other hand it takes less effort to implement than instrumenting your existing code base.

Tools:

Using service meshes won’t eliminate the need to instrument your code, but it can make your life more simple.

Called Traefik Mesh these days. ↩︎

Podcasts I listen to — 2019 edition

2021-11-09T20:09:33Z

Most of the time I use my commute to listen to podcasts. Because they reflect my interests at this point in time, I thought it would be nice to share my current list.

The last time I posted my selection of podcasts was back in 2012. Since then, a lot has changed:

Only two out of the six podcasts on the 2012 list are still in my feed. But there are quite a few new podcasts to compensate.
My commute has reduced from six hours/week to about four.
Out of necessity I no longer listen to all episodes of all podcasts—I just do not have enough time to listen to them. But all podcasts in the list below have interesting episodes, so unsubscribing from them is also not an option. (I know, first world problems…)

My podcast feed with the reason why I listen to each show:

AWS Morning brief: I use this podcast to quickly get up to speed on AWS related developments. Oh, and I like the snarky way Corey Quinn summarises the news.
Command line heroes: a fun and educational podcast about topics that interest me as a developer. For example season 2 was focussed open source development and season 3 was all about (the history of) programming languages.
Darknet Diaries: interesting stories about the dark side of the internet. I love the stories and the way Jack Rhysider tells them.
Freakonomics Radio: a completely different show because it is not technology focussed. Stephen Dubner discusses topic I often don’t know much about or have not thought about much. I think the episodes are not only educational but also fun to listen to.
O11ycast: this one I recently discovered. Since I’m interested in the subject of observability, this podcast gives me new ideas to think about.
Real World DevOps: I really like Mike Julian’s interviews with people from the tech industry about DevOps related topics. Unfortunately there has not been a new episode for a few months; I hope Mike will start podcasting again in the future.
Reply All: a brilliant show. I had lost track of the hosts, PJ Vogt and Alex Goldman, after they stopped with the TLDR podcast. Unfortunately I did not search for them, otherwise I would have found Reply All a lot sooner! I listen to this podcast primarily to get amused; learning something from it is a nice side effect though.
Screaming in the cloud: another podcast from Corey Quinn which I listen to, to learn more about cloud related topics from the interviews on this show.
Security Now: I listen to this podcast for the security related news and an occasional deep dive into a technical topic.
Shop Talk Show: although these days I’m more into infrastructure and automation, I like listening to this show allow to keep in touch with web development.
Talk Python to me: more or less the same as with Shop Talk Show: I like this podcast for its (Python) development related topics.
Tech45: a podcast I’ve listened to for quite some years now and one of the few shows I want to listen to every single episode of. I enjoy the way the panel discusses the (tech related) news each week. Informative and “gezellig.”
The privacy, security & OSINT show: the most recent addition to my list. I have only listened to a few shows yet, but the topics seem interesting. We’ll see if this one sticks or not.

And there you have it: my list of podcasts of October 2019. (Also available in OPML format.)

Devopsdays Amsterdam 2019: day one

2021-11-09T20:09:33Z

This is the seventh time devopsdays is organized in Amsterdam. The event was sold out twee weeks in advance and has the highest attendance rate thus far. I was fortunate enough to attend again this year.

Observability for emerging infra: what got you here won’t get you there — Charity Majors

[Slides]

Charity has been a systems engineer since was 17 and she’s done her fair share of being on call. As a result her approach to software contains stuff like:

The only good diff is a red diff.
Junior engineers ship features, senior engineers kill features.

She co-founded Honeycomb.io a few years ago. Her company wants to help understanding and debugging complex systems.

Vendors often talk about the “three pillars of observability”, meaning metrics, logs (which is where data goes to die) and tracing. Which happen to be the categories of products they are trying to sell you.

The architecture that is being used today is much more complex than the humble LAMP stack we used years ago. Because of this increased complexity you go from “known unknowns” to “unknown unknowns.” Testing is also harder since you cannot simply “spin up a copy” if you want to test something. As a result, monitoring isn’t enough—you need observability since you cannot predict what you need to monitor. Plus: you should fix each problem. So every page should be something new; it should be an engineering problem, not a support problem.

Due to the infinitely long list of almost-impossible failures, your staging environment becomes pretty useless. Even if you capture all traffic from production and replay it in your staging environment, the problem you saw in production does not happen.

Good engineers test in production. Bad engineers also do this, but are just not aware of it. You cannot prevent failure. You can provide guardrails though.

Operational literacy is not a nice-to-have

You’ll need observability before you can think about chaos engineering. Otherwise you only have chaos. If it takes more than a few moments to figure out where a spike in a graph is coming from, you are not yet ready for chaos engineering.

Monitoring is dealing with the known unknowns. Great for the things you already know about. This means it is good for 80% of the old systems, but only 10% of the current systems.

Your system is never entirely “up”

The game is not to make something where nothing breaks. The game is to make something where as many things as possible can break without people noticing it. As long as you are within your SLOs, you don’t have to be in firefighting mode.

Not every system has to be working. As long as the users get the experience they expect, they are happy.

Engineers should know how to debug in production.

Everyone should know what they are doing. If an engineer has a bad feeling about deploying something, investigate that. We should trust the intuition of our engineers. However, the longer you run in staging, the longer you are training your intuition on wrong data. You should be knee-deep in production. That is a way better way to spend your time. Watch your system with real data, real users, etc.

Don’t accept a pull request if you don’t know how you would know if it breaks. And when you deploy something, watch it in production. See what happens and how it behaves. Not just when it breaks, but also when it’s working as intended. This catches 80% of the errors even before your users see them.

Dashboard are answers. You should not flip through them to find the one matching the current scenario. It’s the wrong way of thinking about problems. Instead you need an interactive system to be able to debug your system. Invest time in observability, not monitoring.

By putting software engineers (SWEs) on call, you can make the system better. The SWEs are the ones that can fix things so they don’t have to be woken up in the middle of the night again. If you do it right, and have systems that behave themselves, being on call doesn’t have to interfere with your life or sleep. This does require organizations to allow time to fix things and make things better instead of only adding new stuff.

Teams are usually willing to take responsibility, but there is an “observability gap.” Giving the SWEs the ops tools with things like uptime percentage and free memory is not enough. Most SWEs cannot interpret the dashboards to answer the questions they have. In other words, they don’t have the right tools to investigate the problems they encounter. You need to give them the right tooling to become fluent in production and to debug issues in their own systems.

High cardinality is no longer a nice-to-have. Those high cardinality things (e.g. request ID) are the things that are most identifying. You’ll want this kind of data when you are investigating a problem.

The characteristics of tooling that will get you where you need to go:

Well instrumented
High cardinality
High dimensionality
Event-driven
Structured
Well-owned
Sampled
Tested in production

By having this you get drastically less pager alerts.

Great talk by Beau Lyddon: What is Happening: Attempting to Understand Our Systems

You can use logs to implement observability as long as they are structured and have a lot of context (request IDs, shopping cart IDs, language runtime specifics, etc).

Senior software engineers should have proven to be able to own their code in production. They should notice and surface things that are not noticeable. Praise people for what they are doing good and make sure that those are also the things people get promoted for.

How Convenience Is Killing Open Standards — Bernd Erk

Thanks to the POSIX standards, Solaris admins could still do things on HP-UX systems. Then the GNU toolset came along. And Linux. Communities. Diversity.

As an industry we have a talent to forget everything we learned before. We jump from one new thing to another, but we keep making the same mistakes over and over.

We have more open source software than ever before. But we also have a dramatic change in the infrastructure. Public cloud becomes more popular. But is this a problem? Isn’t this just evolution?

Open source does not mean open standards however. An API is also not a standard, even if it is open. Standards are defined by organizations like IETF and W3C.

We have an unbalanced market. AWS and Azure are much bigger than the others. They all have their own APIs, but there is no open standard! This is why Terraform is so popular: it solved the missing standard problem for you. But you still need to be aware that an EC2 instance is different from an Azure machine though.

Open source has an impact on open standards.

Amazon is offering services that are based on open source software (e.g. Amazon Elasticsearch Service and Amazon ElastiCache). As a result customers don’t talk to Elastic, but to Amazon. Google does this better and (at least for some products) forwards issues to the developers.

Some companies that create open source software have now come up with new licenses to prevent e.g. Amazon offering their product as a service. This creates uncertainty for companies that want to use the technology.

Open source and open standards need support (money). As an industry we need to make sure that they have chance to have a place in the market.

Customers should ask their vendors to contribute back to the creators of the software.

For a lot of companies infrastructure is overhead. Using cloud services poses a risk, but a risk you perhaps can accept given the convenience those cloud services bring along. Lock-in is a risk, as is slowness.

Bernd himself would also use the cloud for a startup to get off the ground quickly. After one to two years you’ll probably have to redo your setup. Perhaps this is a good time to think about how to move forward? What do you want to be flexible about?

Economics in an unbalanced marked are not good. See the pharmaceutical market where there are a few big players and prices of medicines sometimes explode. Being able to work with different providers and be interoperable is important. We need variety.

Unfortunately the cloud providers don’t have an incentive to work towards open standards and to make it easier for customers to hop around from one provider to another.

Diversity is up to us. Not just on the work floor, but also in the technology we use.

Come listen to me, I’m a fraud! — Joep Piscaer

This talk about impostor syndrome. 70% of people experience it.

Joep’s “definition” of it:

Your internal measuring stick is broken.

Everybody is unsure, but nobody talks about it. In IT this problem becomes worse. We talk about “best practices,” but there are so many ways to do things, how do you know what is the “right” way? Perhaps the way you are doing things is perfectly fine.

Joep talked about the Dunning-Kruger effect. But in IT the technology landscape changes in a fast pace. The goalpost is moving constantly.

What can you do about it? How do you recalibrate your internal measuring stick?

Say positive things about yourself in front of a mirror to diminish the negative thoughts about yourself and strengthen the positive ones.
Don’t compare yourself to other people, especially people on the internet.
Learn how to receive compliments.
- Write down compliments and accomplishments and look at the list regularly. This also learns you about the perspective of others.
- Brag about one or two occasionally.
Learn from others and teach others about your thinking process (pair programming, pair review (instead of peer review), speak publicly, etc).
Trust others, you are not alone. Talk to others, participate in conversations. Show yourself.
Get hobbies. A hobby is something you like to do and where you can fuck up without consequences.

To discover why you have impostor syndrome, you have to identify your limiting beliefs.

A number of TED talks you should see:

The power of vulnerability, by Brené Brown (Tip: watch this in a comfortable setting with your partner nearby to support you if needed.)
Inside the mind of a master procrastinator by Tim Urban
Imposter Syndrome by Mike Cannon-Brookes

Using Design Methods to Establish Healthy DevOps Practices — Aras Bilgen

Companies that want to do DevOps are usually enthusiastic about tools and processes. Culture is a harder subject… Even if you show data that demonstrates the importance of the cultural aspect of DevOps, there’s not much understanding.

Most people think about design as making things beautiful. But it also makes things work better. It makes things more visible and matches the conceptual design with the mental model.

Design can reframe problems.

Uber is not just about getting a ride, but it is rethinking the way getting from A to B works. Airbnb is not only about getting a nice room, but it is rethinking the hospitality industry overall. Starbucks sells coffee, but also turns community spaces into places where you can hang out and have conversations over coffee.

Good designers:

Work with actual end users (not managers or product owners)
Welcome ambiguity
Give form to ideas
Co-create in a safe setting
Experiment and revise

Instead of talking to customers about their tools, processes and culture, you could start with interviews and having collaborative process maps workshops where you look at the entire flow from idea to when it explodes in production.

Don’t just offer the out-of-context “best practices” from the DORA State of DevOps report, but have challenge mapping workshops. This leads to solutions that are applicable and often don’t even need to be imposed but are willingly implemented.

Do diary studies: what do people actually do and what makes working hard for them?

Two takeaways:

Mindset matters more than background. You don’t have to be a designer to do this.
Stop and listen with an open heart and no agenda.

Serverless is more FinDev than DevOps — Yan Cui

(This was a fast-paced talk and I did not have time to make many notes.)

Serverless means that:

you only pay for what you actually use
you don’t need to worry about scaling
you don’t need to provision and manage servers

Function-as-a-service: upload code to the cloud which gets run when called and your function then handles the events.

Serverless is interesting because:

scalable
pricing (can be cheaper, but not always!)
resilience
security
greater velocity from idea to product
less ops responsibility (no servers to manage)

With regards to the velocity: with serverless you can remove a bunch of steps from your process, like figuring out deployment, configuring an AMI, ELB or autoscaling.

Return on investment

Companies are looking for a return on their investment.

We tend optimize what we can easily measure: operational stuff. But most of the budget is going to engineers, managers, tools, etc.

Serverless might cost just as much as e.g. running EC2 instance and perhaps it costs even more, but you can get so much more done.

With a server you have the same costs no matter if you have only a single transaction or 100. This means that the transaction costs (costs divided by the number of transactions) is variable. To make matters worse: there are often multiple services running on a since server. This makes it even harder to estimate the transaction cost or the cost of a feature.

The pay-per-use characteristic of serverless is useful in this contest. It makes it much more clear what the cost per transaction is.

Ignites

Ignites are short talks. Each speaker provides twenty slides which automatically advance every fifteen seconds.

The IJ & the shape of DevOps — Jason Yee

The shape of the IJ has changed though the years. Has the shape of DevOps changed over the years?

Well, now there is CLAMS (or CALMS), which stands for:

Culture
Lean
Automation
Metrics
Sharing

If you would plot each of these on an axis, you can plot an organization on it.

Now imagine the circumference is an elastic band: if you pull it on one axis, you’ll decrease the value on the other axes.

Work around the circle. If DevOps means automation for you, this helps being lean, which shapes your culture, which helps sharing, etc.

Evaluate your organization: what is it good at, what are you working on? It’s a natural process, a dynamic movement. Work on all axes.

What you see is what you get for AWS infrastructure — Anton Babenko

Cloud architects and DevOps engineers want to have a faster conversion from idea to working project.

With Cloudcraft you can visualize your AWS infrastructure. With modules.tf you can export that diagram to Terraform code. And with Terraform you can manage your infrastructure. And we have come full circle.

YAML Considered Harmful — Philipp Krenn

Why do we use YAML in the first place? We want to have a human readable format (so XML is out), and we want to be able to place comments in our code (which means JSON is out of the picture).

There are a few issues with YAML though. To name a few:

Suppose you have a port mapping in your docker-compose.yml:

ports:
  - 80:80
  - 22:22

Looks fine? Except it is not fine, because YAML will parse numbers separated by a colon as base 60 (sexagesimal) numbers if the numbers are smaller than 60. So the second line is not parsed as the string “22:22” but as the integer 80520, which is something Docker (understandably) cannot handle.

Another example is this:

countries:
  - DK
  - NO
  - SE
  - FI

However, “NO” is a falsy value, so the countries list will contain three strings and a boolean.

If you want a DSL, please use a DSL, not YAML.

Alternatives:

Maybe jsonnet?
XML: problematic for humans, it’s meant to be parsed by computers computers
JSON: doesn’t feel right
INI: no hierarchy, parsers on different platforms behave differently
TOML: indentation is optional
JSONx: worst of all worlds

When you want to write YAML:

indent properly
quote just about everything

Microsoft Ignite | The Tour: Amsterdam — day two

2021-11-09T20:09:33Z

On the second day of Microsoft Ignite | The Tour: Amsterdam I attended talks in the “Operating applications and infrastructure in the cloud” learning path.

(On the first day I followed the sessions of the “Building your application for the cloud” learning path.)

Modernizing your infrastructure: moving to Infrastructure as Code (IaC) — Emily Freeman

Operations is changing—modernizing. DevOps reduces or eliminates the friction between the development and operations teams.

One way of defining DevOps is with the CALMS acronym:

Culture
Automation
Lean
Measurement
Sharing

SRE has evolved beyond its original form. SRE can be seen as a prescriptive implementation of DevOps. It has a focus on deploying and maintaining applications.

About failure:

In the cloud, sometimes it is rainy.

If you are using the cloud, you are using APIs continuously. This allows us to automate a lot of things. And do configuration via (YAML) files which can be version controlled. This gives us repeatability and reliability.

Toil is defined as manual, repetitive work. Automation removes the toil. By creating code to do the work, we can reuse it.

Azure Resource Manager templates: you can manually deploy infrastructure and then view the template from within the Azure portal. This template is a configuration file which you can store and use over and over again.

Deploying small batches of code is key. The risk of small changes is less than having big (and many) changes. Small batches also make it simpler to identify bugs. Releasing often makes sure feedback on changes is received faster.

Microsoft works hard to make sure that Azure Pipelines work with the tools you already use.

Additional resources:

Monitoring your infrastructure and applications in production — Jason Hand

Why, what and how do we monitor?

Let’s first have a look at a definition of SRE:

Site Reliability Engineering is an engineering discipline devoted to helping an organization sustainably achieve the appropriate level of reliability in their systems, services, and products.

Note the terms “appropriate” and “reliability.” Hundred percent reliability is expensive but not always needed. For planes and pacemakers you want absolute reliability. For applications, this usually is not needed. Ask yourself what the parts of reliability are that you want to be concerned with.

Why monitor?

Are apps doing what we expect from them?
Are apps doing what others (customers) expect?
Are we meeting our goals or at least moving in the right direction?
Our systems are constantly in change. What is happening with these changes?

Monitoring is the most important part of reliability.

Source: sheets from this presentation

The difference between reacting and responding to incidents is being prepared or not. You’ll want to be prepared to be able to respond, not just react.

Some organizations perform root cause analyses after failures. Complex systems usually don’t have a single root cause though—more often than not it is a combination of causes. Since having more than one root cause is a contradiction in terms, Jason much rather talks about “post-incident reviews.”

Back to the term reliability. This can be measured in a lot of ways. To name a few:

Availability: will the system respond when I ask it something?
Latency: “slow is the new down” (we don’t have the time for slow applications).
Throughput: can you get the volume of data from one point to another in a way you expect?
Coverage: can a process plough through all the available data?
Correctness: did it do what it was supposed to do?
Fidelity: See for example Netflix: perhaps the search is temporarily unavailable but you can still scroll through the catalog and watch a movie.
Freshness: is the data you get the most recent stuff?
Durability: if you write data, can you retrieve it again?

It is not just about what we see, but also about what our customer experiences. We should put the customer first and look at things from their perspective.

Service Level Indicators (SLIs) are a ratio/proportion. For instance: the number of successful HTTP calls divided by the total number of HTTP calls.

But where do the numbers come from? A few options:

Reported by the server
Reported by clients
Reported by the application
Reported by the front-end (load balancer)
Reported by our monitoring / testing infrastructure

You need to include the way you measure in the SLI.

There is no “right way” to measure things, but there are trade-offs. And you need to be aware of them. Ideally you should monitor as close to the user as possible.

The SLIs are the measurements. The Service Level Objectives (SLOs) determine what is acceptable. If the SLI (indicator) dips below the SLO (objective), we need to get people involved.

How to build a good SLO? Take the thing to monitor (e.g. HTTP requests), look at the SLI proportions and add a time statement. An example: “90% of the HTTP requests as reported by the load balancer succeeded in the last 30 day window.” You can make your SLOs more complex if needed, e.g. by having a lower and upper bound.

Alert on actionable things. Alerts are not logs, notifications, heartbeats or normal situations. Alerts are about things a person needs to go do. And not just any person, but the right person. It should not be something that can be automated. Ideally alerts are pretty simple and include:

Where is this coming from?
Which SLO is breached?
What is the impact for the customer?
What systems are impacted?
What are the first couple of steps to take?

Possible future direction of monitoring:

There’s no one silver bullet. We’ll probably want to integrate multiple monitoring solutions.
Modular monitoring
Aggregate monitoring (measure the system as a whole)
Correlation and anomaly detection (AI / ML)
Monitoring the monitoring

Additional resources:

Troubleshooting failure in the cloud — Jason Hand

Our systems have become more complex than they have been in the past. We are now using microservices, containers, functions, etc. Our focus for this session: going from the detection phase (previous session) to the response phase.

Before talking about how to detect something is wrong, you need to define that you mean with “wrong.”

Azure has a nice toolbox to help you:

Azure monitor
Service health alerts
Azure Application Insights
Diagnostic logs (application and system)
Live debugging via log streams
Log Analytics

The first question is often: is it just me? Azure (and the other cloud providers) can also have problems. The Azure Resource Health page in the Service Health section shows you if your components have problems. It even has historical data. But because you don’t want to stare at a dashboard all day long, you’ll want to also have health alerts for when something goes wrong.

Guidance for creating service health alerts:

Don’t create too many, but also not too few. Find your sweet spot.
Make sure they don’t overlap (to avoid confusion).
Don’t alert on non-production environments.
Keep the responders in mind (what would they want).
Separate alerts for issues, planned maintenance and advisories.

Application Map shows components and how they interact with each other. Where is slowness? Where are problems?

You will need to enable diagnostic logging to get a bunch of logs. And you don’t have to go to the portal, you can use the CLI tool to download them. You can then use a separate tool to analyse the logs.

az webapp log download --resource-group  --name

Azure Cloud Shell allows you to have a terminal session in your browser. You don’t need your local CLI.

Use Log Analytics for more complex troubleshooting. It collects a lot of information from a log of sources.

You’ll need to use the Kusto Query Language to get to the information. Queries start with a table, then commands to filter, sort, specify time range, etc. Once you have queries that answer your questions, you can keep them handy by saving them right in the Azure portal.

Azure Monitor Workbooks and Troubleshooting Guides (a.k.a. runbooks) combine text, queries, metrics and parameters into reports. This helps with onboarding and will also help your first responders.

Workbooks are editable by others that have access to the same resources. Troubleshooting Guides are like workbooks, but for dealing with problems.

Future for troubleshooting:

Observability (OpenCensus and other distributed tracing tools)
Anomaly detection

Additional resources:

Scaling for growth and resiliency — Jeramiah Dooley

How do we take an application and how do we scale it? Bigger picture: asking a lot of “why?” questions.

Do we need to scale out our application and the underlying infrastructure? Maybe.

Perhaps we should first check if the application is buggy (memory leaks) or if we can optimize it for performance (e.g. database queries).
Is there a temporary (or artificial) reason the site has a spike?
Is there unused capacity we can move around?
Scaling costs money, but if we don’t do it we give the customers a bad experience. This is a risk vs expense issue.
When is the application no longer meeting our expectations (SLOs)?

We need to factor for both the anticipated and unanticipated spikes.

One can scale up (vertical) or out (horizontal). Scaling up is adding more resources (CPU, memory) to an instance. Scaling out is adding more machines. When scaling out, the application needs to be aware of the change (additional or less instances).

You might not want to scale just one portion of an app since it can cause problems in another part of the app.

Autoscaling can trigger on quite specific conditions: only today if CPU > 80% add more instances.

But should you use autoscaling?

Do you know your app? When do you need to add more resources? How will your app respond? How quickly? What happens when you scale down?
Do you know your traffic?
Autoscaling also has its limits. (E.g. there is going to be a delay.)
Do you understand the metrics that matter?

It’s not magic, it’s just automation.

An autoscaling profile has:

Capacity:
- Minimum
- Maximum
- Default
Triggers (a metric we want to trigger on and a value for that metric)
Recurrence (when do we want this to happen? Always? First week of the month?)

Warnings with regards to autoscaling:

Scaling out will happen when any of the triggers are met.
Scaling in will trigger when all triggers are met.
Autoscaling always wins over manual scaling settings.

Okay, we know how to handle load. But do we know what to do when we have a localized failure (e.g. a data center that goes down)? Use:

Multiple regions/geographies
Region pairs
Availability zones

Jeramiah Dooley about the relation between Geography, Region Pairs, Regions and Availability Zones

Azure Front Door:

Global HTTP load balancer with instance failover
Bonuses:
- SSL offloading
- Firewall and DDoS Protections
- Central control plane for traffic orchestration

Future possible directions for scaling:

Content delivery networks (CDNs)
Failovers
Traffic shaping
Draining data centers

Additional resources:

Code (Tip: the code apparently has a very nice deployment script.)
Scale up an app in Azure
Get started with Autoscale in Azure
Business continuity and disaster recovery (BCDR): Azure Paired Regions
What is Azure Front Door Service?

Responding to and learning from failure — Emily Freeman

Failure is inevitable.

Teach people how to solve problems. It will prevent you from being woken up for everything at night.

Having to react to failure leads to:

Unplanned work
Context switching
Confusion
Increased time to repair

We tolerate a certain level of failure every day. Things that are not fine however:

Alerts to inbox
Alerts to a common channel where memes are combined with alerts since you’ll miss the important stuff
Shoulder tapping
Delays in communication
Tiered and off-shore support
Network Operations Center

We can borrow concepts from other disciplines. Like creating a space to communicate and collaborate. Ideally there is a single space for all communication for an incident; a space devoted to that single incident. The benefits of having such a space: you’ll have a time stamped record (which helps the post-incident review), and it focusses communication.

Train your first responders and provide them with context and clear escalation procedures.

Update your stakeholders throughout the incident.

Important when alerting is adding a severity of the incident:

Sev1 (critical): complete outage
Sev2 (critical): major functionality broken, revenue affected
Sev3 (warning): minor problem
Sev4 (warning): redundant component failure
Sev5 (info): false alarm or unactionable alert

The life cycle of an incident:

Detection: discovering that there is an issue. You’ll have to figure out what is wrong ASAP.
Response: about 1/3 of your time is spent in this phase. The incident commander (IC) is the person in charge of the incident response. A communication chief (or scribe, or whatever you call this person) documents the situation for a historical record. Don’t use this to place blame, but to analyze what happened.
Remediation: you know what went wrong and how to fix it. Always under-promise and over-deliver in your timeline. ChatOps tools might help to remediate a situation.
Analysis: host a post-incident review to learn from what happened. The goal is not to produce artifacts, instead the goal is to maximize organizational learning. Prevent a similar incident in the future. Record counter measures and assign a date and responsible person.
Readiness: prepare for the future.

Incidents are not linear and neither is the response.

The last two phases are often forgotten while they are the most important. Since these will prevent the same failure in the future from happening.

Additional resources:

Code
Rasmussen and practical drift
Azure Logic Apps Documentation (concerning building automated workflows)
Overview of alerts in Microsoft Azure
Build a chat bot with the Azure Bot Service

Open tabs — March 2019

2021-11-09T20:09:33Z

Last May I published the list of tabs I had open on my phone at that moment in time. This another one of those posts.

I have switched phones but there were about 50 tabs open in the browser on my old phone. This article is a selection of the pages I want to (re)read in the future.

Go

I want to learn Go. This is a list of resources that seem useful:

Containerization of Golang applications: An example by Jonas-Taha El Sesiy (including a Dockerfile and Makefile) of how you can deploy a Go application in a Docker image.
Effective Go: I’ve heard positive things about this document.
Go for Python Programmers: Well… I’m a Python programmer and I want to learn Go. Nuff said. ;-)
Go by Example: Again practical examples to teach the concepts in Go. The examples are, as far as I’ve seen, relatively short. This is positive because it highlights the concept at hand, but on the other hand it can be useful to have more complex examples to get a better understanding of how concepts work together to create a larger application.
Let’s Go! Learn to Build Professional Web Applications With Golang: This book promises to teach Go while building an application. I find having practical examples helps me when learning a new programming language.
Programming Talks: A (community maintained) list of talks on programming language specifics.
The Go Programming Language Specification: It’s always good to have the specs nearby.

Observability and monitoring

Observability and Monitoring Best Practices: Terminology and a set of best practices by Mark McDonnell.
Lessons from Building Observability Tools at Netflix: Lessons learned from big companies usually make an interesting read. Even though things might not apply to your situation.
Monitoring for Distributed and Microservices Deployments: An article about the challenges of distributed architectures and adjustments needed to be able to respond to incidents in this changed environment.
Observability+: A blog with a number of interesting articles about observability.
Stack Overflow: How We Do Monitoring - 2018 Edition: This article (and Nick Craver’s previous posts by the way) makes for fascinating reading.
Why Your Server Monitoring (Still) Sucks: Five reasons why your current monitoring sucks and what to do about it.
10 monitoring talks that every developer should watch: Enough material in here to learn from.

Cloud services

How many AWS accounts do I need?
How should I organize my AWS accounts?: Two articles published on the #NoDrama DevOps blog about organizing your AWS account(s).
Serverless: From Azure to AWS: Comparing Azure and AWS serverless offerings.

Advanced multi-stage build patterns: Tõnis Tiigi has a number of examples of multi-stage Docker image build patterns.
How To Use Git Hooks To Automate Development and Deployment Tasks: A list of available Git hooks and a demonstration on how to use hooks to automate tasks.
Open sourcing Terratest: a swiss army knife for testing infrastructure code: I am using tools like Terraform and Packer quite a bit these days, but testing the code is still something I’m doing manually. Perhaps Terratest is a nice solution?
SRE University: A list of resources related to Site Reliability Engineering.
What Does a Site Reliability Engineer Do?: Activities an Site Reliability Engineer is participating in, according to Erik Dietrich.

Blog improvements

Once I get around to working on this site again, there are a few things I want to have a look at.

Bye, Bye Disqus - Say Hello to Isso: I am thinking about replacing the Disqus comments with something else. Isso is a contestant.
On Switching from HEX & RGB to HSL: Sara Soueidan convinced me in this article that expressing colors in HSL (Hue, Saturation and Lightness) can be an intuitive way to choose colors. I’ll want to keep this in mind if I overhaul the colors used on this site.
security.txt: A (proposed) standard to provide information on how to report security issues.
The Font Loading Checklist: I like having pretty (IMHO) fonts for this site. This resource might help me improve the use of them.
The headers we don’t want: Time to review the response headers this site is returning.

Miscellaneous

CyberChef: CyberChef is a simple, intuitive web app for carrying out all manner of “cyber” operations within a web browser.
Engineering Management: The Pendulum Or The Ladder: Another great article by Charity Majors. This time she writes about becoming an engineering manager and career options.
Hashicorp at Home
Hashicorp at Home - Part 2: Matt Wallace improved his home network with Vault (managing secret), Consul (managing DNS and service discovery), Nomad (managing containers), Traefik (dynamic reverse proxy/load balancing) and DataDog (monitoring).
How To Use SSHFS to Mount Remote File Systems Over SSH | DigitalOcean: Basically what the title says: instructions on how to mount a file system of a remote machine, using SSH.
Time Management for System Administrators: Something I could get better at: time management. Now to make the time to actually read this book…

Open Source Summit Europe 2018: day three

2021-10-26T18:57:58Z

My notes taken on the third and final day of the 2018 edition of the Open Source Summit Europe.

These are just notes. They are not proper summaries of the talks.

The dates for the Open Source Summit next year:

August 21–23 in San Diego, California
October 28–30 in Lyon, France

DevOps Meets Docs: Documentation as Code — Robert Kratky (Red Hat)

Documentation as code means writing, testing, publishing and maintaining documentation using the same tools developers use for software code.

Advantages of using plain text for documentation, instead of e.g. Word, are that plain text is more accessible and it is easier to do validation. You can also use the same version control tools you use for your code.

You could use issue trackers to automatically generate e.g. release notes.

You can build and test each commit/pull request (continuous integration). You can test all kinds of things: are components indeed available, are links valid, etc.

Generating documentation is one, publishing it is another thing. You could automatically deploy your documentation (continuous deployment).

To make the barrier to entry for contributing lower, you could use a containerized toolchain. This way people can contribute without having to worry about installing all required tools on their own system.

Multiple parties benefit from documentation as code:

Developers
- Familiarity with the toolchain
- Encourages quick contributions to the docs
Writers
- Tight integration with the development team
- Continuous collaboration
- Better integration with other teams, like QA
- Integrating writers with developers increases sense of responsibility; documentation is no longer a second class citizen
Organization/project
- No vendor lock-in (format, tools)
- Tighter integration of teams, departments

There are challenges though. Both writers and developers may resist. Writers have to retrain and it may be a steep learning curve. Developers may feel that doing documentation properly might slow down the development process. Since the documentation is more visible now, developers might become conscious of their language and grammar skills.

There are also technical challenges. Converting to another format might not be trivial and even lead to a disruption in the release cycle. There may also be resource and staffing issues.

(Slides)

What are My Microservices Doing? — Juraci Paixão Kröhling (Red Hat)

A microservice is observable if we can analyze its metrics, logs and tracing information. But you need more than metrics or logs of individual containers. You want the aggregation of all of them to see what is going on and detect patterns.

Distributed tracing tells a story of a request though our services (which services were touched, when, etc). It is especially useful when doing root cause analysis or you want to optimize performance (where am I spending most of my time).

But how does it work? A “span” is a data structure to store units of work in (what, when did it start, etc) plus metadata. Tracing includes references to other spans so you can build graphs of those spans.

Instrumentation can be both implicit and explicit. Explicit instrumentation is done in your code itself. Implicit instrumentation is done via the frameworks you use (e.g. in Sprint Boot in a Java project).

The OpenTracing project wants to document terminology so we are all talking about the same things. After documenting, the next step is offering an API (which the project does for 9 languages at the moment). The result can be used in compatible projects like Zipkin and Jaeger.

Jaeger is a client side tracer that collects data. It sends it to a backend component for storage. A frontend component can then display the results. Jaeger runs on bare metal, OpenShift and Kubernetes.

Data storage for Jaeger is typically on Elasticsearch or Cassandra.

Have a look at OpenTracing API contributions GitHub organization.

(Slides)

An Introduction to Building Clouds with Apache CloudStack — Dag Sonstebo (ShapeBlue)

Apache CloudStack is a scalable, multi-tenant, open-source, purpose-built, cloud orchestration platform for delivering turnkey Infrastructure-as-a-Service clouds.

Virtualization is never going to go away.

Features:

Broad hypervisor support
Scalable architecture
High availability (for VMs and hosts)
Multiple interfaces (web, CLI, API)

Active community: ~200 project committers, last month 25 merged PRs from 16 authors, plenty of mailing list activity. Once a year there is a CloudStack Collaboration Conference.

You can use CloudStack for private cloud, public cloud or a combination of both in a hybrid cloud.

What can CloudStack give you?

User friendly self service of all resources
Automation of all provisioning through API
Compute
Storage
Networking
Templates

CloudStack can be seen as competition to OpenStack. However, CloudStack is vendor neutral. As a result it is a bit of the invisible man in the cloud infrastructure space.

Why CloudStack?

Integrated end-to-end IaaS product
Proven at scale
Rapid time to value
Low implementations & operational costs
True multi-tennant
Focused, user led, development community
Narrow scope / easy integration

Hashicorp Terraform: Deep Dive with no Fear — Victor Turbinskii (Texuna)

Terraform is a resource orchestration management tool, not a configuration management tool. So you cannot compare Terraform with e.g. Ansible, Puppet or Chef.

It is not a cloud agnostic tool. It does however provide a single configuration language.

About a year ago, the Terraform Module Registry was introduced.

Chef and SaltStack are supported by Terraform; Ansible not (yet).

Tips:

Use “terraform plan -destroy” to safely review the destructive actions.
Use backends like S3 to backup your state files (use versioning and use encryption via KMS)
You can either use “data” resources to access resources created by other teams, or use their remote state.
Use the “TF_LOG” and “TF_LOG_PATH” variables for logging.
Use Delve (a Go debugger) to learn how Terraform core works. (Well integrated wth VSCode.)
For anything else than simple scenarios, isolate state files and use workspaces.
Use constraints to ensure that newer versions will not automatically be downloaded.
Use “terraform state rm type.resource.name” to keep resources but remove them from your state.

(Slides)

Posts tagged with “observability” on Mark van Lent’s weblog

All Day DevOps 2022

Keynote: The Rise of the Supply Chain Attack — Sean Wright

Taking Control Over Cloud Costs — Amir Shaked

Beyond Monitoring: The Rise of Observability Platform — Sameer Paradkar

Failure Is Not an Option. It’s a Fact — Ixchel Ruiz

Accurate Metrics — Christina Zeller and Marcus Crestani

Why Is It Always DNS, TLS, and Bad Configs? — Philipp Krenn

Comprehending Terraform Infrastructure as Code: How to Evolve Fast & Safe — Anton Babenko

Keynote: Journey to Auto-DevSecOps at Nasdaq — Benjamin Wolf

Varieties of Incident Response — Kurt Andersen

The E Stands for Enablement - Modern SRE at PagerDuty — Paula Thrasher

All Day DevOps

#adidoescode Where DevOps Meets The Sporting Goods Industry — Fernando Cornago

Multi Cloud “day-to-day” DevOps Power Tools — Ronen Freeman

How to Run Smarter in Production: Getting Started with Site Reliability Engineering — Jennifer Petoff

Everybody Gets A Staging Environment! — Yaron Idan

DevOps to the Next Level with Serverless ChatOps — Jan de Vries

Building Modular Infrastructure in Code — Fergal Dearle

Crossing The River By Feeling The Stones — Simon Wardley

Deploying Microservices to AWS Fargate — Ariane Gadd

Automate Everyday Tasks with Functions — Sean O’Dell

Beyond dev & ops — Patrick Debois

The Open Source Observability Toolkit — Mickey Boxell

Logging

Metrics

Tracing

Service meshes

Podcasts I listen to — 2019 edition

Devopsdays Amsterdam 2019: day one

Observability for emerging infra: what got you here won’t get you there — Charity Majors

How Convenience Is Killing Open Standards — Bernd Erk

Come listen to me, I’m a fraud! — Joep Piscaer

Using Design Methods to Establish Healthy DevOps Practices — Aras Bilgen

Serverless is more FinDev than DevOps — Yan Cui

Return on investment

Ignites

The IJ & the shape of DevOps — Jason Yee

What you see is what you get for AWS infrastructure — Anton Babenko

YAML Considered Harmful — Philipp Krenn

Microsoft Ignite | The Tour: Amsterdam — day two

Modernizing your infrastructure: moving to Infrastructure as Code (IaC) — Emily Freeman

Monitoring your infrastructure and applications in production — Jason Hand

Troubleshooting failure in the cloud — Jason Hand

Scaling for growth and resiliency — Jeramiah Dooley

Responding to and learning from failure — Emily Freeman

Open tabs — March 2019

Go

Observability and monitoring

Cloud services

Other devops related articles

Blog improvements

Miscellaneous

Open Source Summit Europe 2018: day three

DevOps Meets Docs: Documentation as Code — Robert Kratky (Red Hat)

What are My Microservices Doing? — Juraci Paixão Kröhling (Red Hat)

An Introduction to Building Clouds with Apache CloudStack — Dag Sonstebo (ShapeBlue)

Hashicorp Terraform: Deep Dive with no Fear — Victor Turbinskii (Texuna)