(As with all my recent conference notes: these are just notes, not complete summaries.)
Adidas is a big company, about 57,000 employees. Software is also having an impact on sports, e.g. registering your performance, etc. Is Adidas already a software company? Definitely not, but they should start behaving as if they are. They are building their applications 10,000 times a day.
New book from Gene Kim (the author of The Phoenix Project): The Unicorn Project. Expected end of November 2019.
The platform you need to be able to accelerate you business should be:
To improve sales, Adidas' platform can run where their customers are (globally) and can scale up if needed and scale down again when possible.
Making use of data is key. The data is seen as a baton in a relay race and makes its way through the teams. A large part of time consumed in creating a service is spent at looking at the data. The data is used to train their AI systems.
You want to deliver value and make sure that things are working. Testing needs to be done on the physical level (the actual products that are being sold), but also on the virtual level: the IT platform also needs to perform and work as expected.
At Adidas they went from having a couple of “DevOps engineers” to having a DevOps culture in all teams. This removed bottlenecks and allowed them to grow faster. They have substantially reduced costs and manual testing and improved drastically on time to production.
About the relation between SRE (Site Reliability Engineering) and DevOps: SRE is an implementation of DevOps.
What defines a great SRE:
learn fast so you may promptly begin again
relatively simple to get started with); GKE is the most mature
There’s a demo app: https://github.com/Darillium/addo-demo
Software engineering is focussed on design and building, not about running the application.
There are multiple places in organizations where friction occurs:
SRE is a bridge between business, development and operations.
Four key SRE principles:
Some terminology:
100% is the wrong reliability target for basically everything
SRE is about balancing between reliability, engineering time, delivery speed and costs.
Note that you can have an error budget policy without even hiring a single SRE. You can already start with an SLO.
SLOs and error budgets are a necessary first step if you want to improve the situation.
When talking about making tomorrow better than today, you need to address toil (work that needs to be done that is manual, repetitive, automatable and does not add value). If your SRE team is spending time on toil, they are basically an ops team. Instead you want them to make the system more robust and improve the operability of the system.
One way to be able to have your SRE team regulate their workload is to put them in a different part of the organization.
If you want to start with SRE, start with SLOs and let your SRE team own the most critical system first. Don’t make them responsible for all systems at once. Let them deal with the most important systems first. Once they have that under control (removed toil, etc), they can take on more work.
You need leadership buy-in for every aspect of SRE work.
Aim for reliability and consistency upfront. SRE teams should consult on architectural discussions to get to resilient systems.
SRE teams can benefit from automation:
SRE teams embrace failure:
Resources (books):
How can you give an isolated and secure staging environment for each developer? If you follow the path Yaron’s company (Soluto) took, prepare for a rocky transition, getting used to new terminology when moving to Kubernetes and users that can get upset at first.
Their first step: automate the entire process. The realized they needed onboarding documentation and templates. They used Helm for the templates.
Deploy feature branches to the production environment. When developers open a pull request, the CI/CD tool adds a link to where this PR is deployed. This empowers the developers and provides a sense of security. This accelerated the way the developers built features and the adoption of the platform by the company.
The benefits:
The tools used for their implementation:
You can swap the aforementioned tools with other tools, but you must be able to automate them and make sure that they work with the push of a single button.
GitHub repository for the demo: https://github.com/yaron-idan/staging-environments-example
What DevOps is about:
It’s not about:
One option is to send emails if something is wrong. But these emails are slow (knowing that there was an issue yesterday is too late) and it generates a lot of noise. Monitoring dashboards are nice, but you have to actively look at them to detect if there’s something wrong.
A better solution is using chat applications (like Microsoft Teams or Slack).
Your first step is to emit events when something fails (in Azure you can sent such events to Event Grid). Next, you have to subscribe to these events/topics. You can use an Azure Function for this and then have it post an actionable message to Teams.
These messages can have buttons to take immediate action (e.g. post to a web API like an Azure Function) to recover from the problem. The output from the recover action can also send events, so you can get another notification in your chat application about the result.
GitHub repository with demo code: https://github.com/Jandev/ServerlessDevOps
Why use infrastructure as code (IaC)?
The DevOps handbook (part III, chapter 9) talks about on demand spinning up a production-like environment. The only way to do this is to use IaC.
First thing is understanding your infrastructure (VPC, networking, DNS, services, database). Look at patterns like requestor/broker. Use these patterns to create modular stacks.
Patterns:
Stack templates & stack instances:
Stay clear of monoliths
Monoliths are stacks with everything in them. This is one end of the spectrum. The other end is to have one service per stack. You’ll probably will want to end up somewhere in between, where you have a component stack (e.g. with a requestor/broker combination). And then you can combine those stacks.
Static stack inputs: environment names, app names. You should externalize this config from your stack, e.g. by feeding these parameters via the command line or use the SSM Param Store and Secrets Manager.
(Note to self: Fergal mentioned Sceptre. Check what this is about.)
Common problems:
Best practices:
In The Art of War Sun Tzu discusses five fundamental factors which you should consider:
John Boyd developed the OODA loop, which cycles through the following:
You can combine these cycles:
By moving we learn, as long as we can observe the environment.
Maps are ways to:
Graphs are not maps. Identical graphs can spatially be different. On a map, space has meaning, in a graph not. And because space has meaning, we can explore, e.g. by asking ourselves the question “what if we go in that direction?”
What makes a map?
In business there are a lot of maps, e.g. a systems diagram. But almost everything in business we call a map actually is a graph. Which means we don’t have a sense of the landscape.
How do you create a map for a business? Start with the anchors, e.g. business and customers. Work from there to get all related components and express them in a chain of needs.
But how do you position those components? We can use “visibility” as an analogy of distance. You can use this to create a “partial ordered chain of needs.”
And what about movement? You can describe movement as a chain of changes. But how do you do that? You can order components based on their evolutionary stage. Are they in the genesis stage, do we use custom built components, do we have standard products or can we consider it a commodity?
By combining these we can create a map of components by using the position (visibility) on the y-axis and the movement (evolution) on the x-axis.
You can add intent (evolutionary flow) to your map, e.g. move a component from custom built to using a commodity.
By using such a map, you can de-personalize discussions since now you talk about the map, not a story.
Simon told a story of a company that wanted to invest in robots to do manual labour. After putting their business in a map, the problem became clear: they were using custom built racks and as a result needed to customize their servers because they would not fit their non-standard rack. Because the market had changed since the company started, they could switch to commodity (cloud) and as a result also did not need robots.
But keep this in mind:
All maps are imperfect representations of a space
At KPMG they moved a project from using Amazon EC2 to AWS Fargate. They also replaced CloudFormation with Terraform for standardization.
Why managed container services and not EKS?
They used ECS deploy to deploy containers to Fargate.
Why was the customer okay with this change?
Fargate runs in your own VPC so you own the network, etc.
The microservices for this project are placed behind an application load balancer (ALB) using AWS Web Application Firewall (WAF) and TLS termination. Logs are sent to AWS CloudWatch Logs. They used Anchor for end-to-end container security and compliance.
Benefits of AWS for KPMG:
Common use cases for serverless applications are things like web applications (such as static websites), data processing, Amazon Alexa (skills) and chatbots.
Not every workload should run in a serverless fashion
Organizational and application context are relevant. Serverless is just another option. You pick what is right for your use case as serverless is not a silver bullet.
AWS Lambda in a nutshell: there is an event source (e.g. a state change in data or a resource), which triggers a function, which in turn interacts with a service.
GitHub repository with examples: https://github.com/vmwarecloudadvocacy/cloudhealth-lambda-functions
(Note to self: check Amplify)
The DevOps pipeline usually starts at the backlog. But what happens before? And how does that influence people outside the engineering department?
Patrick joined a startup five years ago. He fixed the DevOps pipeline. After a while the organization wanted to sell a product.
The book The Machine by Justin Roff-March describes “the agile version of sales.”
Sales had more in common with IT than Patrick expected. The sales pipeline looked like a Kanban board. Sales people are also on call, especially if you do international sales. If we can deliver faster, this increases the chance of a sale. Sales persons are actually mind readers; this reminded Patrick how IT people feel about tickets in the backlog: “what is meant with this ticket, what do the customers actually want?”
Thinking about the life of the project after it has been delivered, in terms of maintenance and operational costs, is important. Doing this upfront, is beneficial for the project.
The biggest impact sales had on IT was the never ending hammer of “can you make things simpler?” Sales can help you with making your design less complex. Together you can determine what actually adds value for the customer.
Having publicly available documentation really helps and makes it easier to sell the product. Getting the customers on Slack also helped. This showed that the company was willing to listen and help. This also helps sales.
After sales, the marketing department got focus. Patrick has seen the most interesting automation within the marketing department (e.g. email sending). Marketing has a lot of fancy metrics and tools to measure stuff; did they invent metrics? The IT department can help marketing by providing information. Conferences are also a marketing instrument. Marketing also has CALMS; they do very similar things.
We’ve got sales and marketing covered, but for a lot of things we still need budgets. To budget things, you also need to estimate things. Finance knows what it is to be on the receiving end of tickets where you have to guess what is going on.
Sometimes you need to buy things. Patrick looked into agile procurement. Lots of stuff has to be figured out: is it a fit? is there a way to be partners?
Serverless is a manifestation of servicefull, where you use a lot of services, so you can focus on your domain instead of having to build everything yourself.
With cloud services there doesn’t have to be communication initially: you read the website, check the documentation, pull your credit card and start with a proof of concept. Communication is still important. You have to treat your suppliers with respect.
HR: how does your company support hiring people? Have people join the team for some actual work to see if they are a good fit or not.
Do you know what pair programming is? Well, mob programming is a whole other level of collaborating. (Note that it’s more than just have one person typing and a bunch of other people commenting—it has a lot of patterns going on.) Teams can get very enthusiastic about this and can be very productive the first week. But if they are not careful, the team is exhausted the next week. You need to develop a sustainable pace.
Patrick learned a lot from talking to the different departments. They are solving similar problems as IT is.
Reading tip: Company-wide Agility with Beyond Budgeting, Open Space & Sociocracy by Jutta Eckstein and John Buck.
Some characteristics of a modern application:
This brings new challenges compared to the old situation with a monolithic application running on a single machine. You still want to address threats to customer satisfaction. But the old debugging solutions we used for a monolith will not work.
Observability means designing and operating a more visible system. This includes systems that can explain themselves without you having to deploy new code. The tools help you understand the difference between a healthy and unhealthy system.
Modern, distributed systems will experience failures.
Observability takes a holistic approach that recognizes the impact an issue has on the business as a whole. Outages affect the whole business, which should be able to react. Observability gives tools to address issues when they arrive.
External outputs:
These outputs don’t make your system more observable, but they can be part of the holistic approach. They can help you with e.g. a root cause analysis.
Key SRE concepts:
Users are comfortable with a certain level of imperfection. A site that is a bit slow sometimes, is less annoying than a shopping cart that sometimes looses its content. Use this to determine what you’ll spend your time on first.
Logging is the most fundamental pillar of monitoring. Logging records events that took place at a given time. Supported by most libraries because it is such a fundamental thing.
You only get logs if you actually put them in your code. You also have to make sure you don’t lose them, for instance by aggregating them.
Tools:
Metrics are a numeric aggregation of data describing the behavior of a component measured over time. They are useful to understand typical system behavior.
Tools:
Alerts are notifications to indicate that a human needs to take action.
Tracing is capturing a set of causally related events. Helpful for debugging. Example: each request has a global ID and if you insert this ID as metadata at each step, you can trace what happens.
Tools: Jaeger and Zipkin can visualize and inspect traces.
The challenge is that tracing is hard to retrofit into an existing application. You need to instrument all component. Service meshes can make it easier to retrofit tracing.
A service mesh is a configurable infrastructure layer for microservice applications. It can monitor and control the traffic in your environment. It can be a great approach for observability. You will get less information from a service mesh compared to adding tracing to all of your components, but on the other hand it takes less effort to implement than instrumenting your existing code base.
Tools:
Using service meshes won’t eliminate the need to instrument your code, but it can make your life more simple.
Called Traefik Mesh these days. ↩︎
[Slides]
Charity has been a systems engineer since was 17 and she’s done her fair share of being on call. As a result her approach to software contains stuff like:
She co-founded Honeycomb.io a few years ago. Her company wants to help understanding and debugging complex systems.
Vendors often talk about the “three pillars of observability”, meaning metrics,
logs (which is where data goes to die
) and tracing. Which
happen to be the categories of products they are trying to sell you.
The architecture that is being used today is much more complex than the humble LAMP stack we used years ago. Because of this increased complexity you go from “known unknowns” to “unknown unknowns.” Testing is also harder since you cannot simply “spin up a copy” if you want to test something. As a result, monitoring isn’t enough—you need observability since you cannot predict what you need to monitor. Plus: you should fix each problem. So every page should be something new; it should be an engineering problem, not a support problem.
Due to the infinitely long list of almost-impossible failures,
your staging environment becomes pretty useless. Even if you capture all
traffic from production and replay it in your staging environment, the problem
you saw in production does not happen.
Good engineers test in production. Bad engineers also do this, but are just not aware of it. You cannot prevent failure. You can provide guardrails though.
Operational literacy is not a nice-to-have
You’ll need observability before you can think about chaos engineering. Otherwise you only have chaos. If it takes more than a few moments to figure out where a spike in a graph is coming from, you are not yet ready for chaos engineering.
Monitoring is dealing with the known unknowns. Great for the things you already know about. This means it is good for 80% of the old systems, but only 10% of the current systems.
Your system is never entirely “up”
The game is not to make something where nothing breaks. The game is to make something where as many things as possible can break without people noticing it. As long as you are within your SLOs, you don’t have to be in firefighting mode.
Not every system has to be working. As long as the users get the experience they expect, they are happy.
Engineers should know how to debug in production.
Everyone should know what they are doing. If an engineer has a bad feeling about deploying something, investigate that. We should trust the intuition of our engineers. However, the longer you run in staging, the longer you are training your intuition on wrong data. You should be knee-deep in production. That is a way better way to spend your time. Watch your system with real data, real users, etc.
Don’t accept a pull request if you don’t know how you would know if it breaks. And when you deploy something, watch it in production. See what happens and how it behaves. Not just when it breaks, but also when it’s working as intended. This catches 80% of the errors even before your users see them.
Dashboard are answers. You should not flip through them to find the one matching the current scenario. It’s the wrong way of thinking about problems. Instead you need an interactive system to be able to debug your system. Invest time in observability, not monitoring.
By putting software engineers (SWEs) on call, you can make the system better. The SWEs are the ones that can fix things so they don’t have to be woken up in the middle of the night again. If you do it right, and have systems that behave themselves, being on call doesn’t have to interfere with your life or sleep. This does require organizations to allow time to fix things and make things better instead of only adding new stuff.
Teams are usually willing to take responsibility, but there is an “observability gap.” Giving the SWEs the ops tools with things like uptime percentage and free memory is not enough. Most SWEs cannot interpret the dashboards to answer the questions they have. In other words, they don’t have the right tools to investigate the problems they encounter. You need to give them the right tooling to become fluent in production and to debug issues in their own systems.
High cardinality is no longer a nice-to-have. Those high cardinality things (e.g. request ID) are the things that are most identifying. You’ll want this kind of data when you are investigating a problem.
The characteristics of tooling that will get you where you need to go:
By having this you get drastically less pager alerts.
Great talk by Beau Lyddon: What is Happening: Attempting to Understand Our Systems
You can use logs to implement observability as long as they are structured and have a lot of context (request IDs, shopping cart IDs, language runtime specifics, etc).
Senior software engineers should have proven to be able to own their code in production. They should notice and surface things that are not noticeable. Praise people for what they are doing good and make sure that those are also the things people get promoted for.
Thanks to the POSIX standards, Solaris admins could still do things on HP-UX systems. Then the GNU toolset came along. And Linux. Communities. Diversity.
As an industry we have a talent to forget everything we learned before. We jump from one new thing to another, but we keep making the same mistakes over and over.
We have more open source software than ever before. But we also have a dramatic change in the infrastructure. Public cloud becomes more popular. But is this a problem? Isn’t this just evolution?
Open source does not mean open standards however. An API is also not a standard, even if it is open. Standards are defined by organizations like IETF and W3C.
We have an unbalanced market. AWS and Azure are much bigger than the others. They all have their own APIs, but there is no open standard! This is why Terraform is so popular: it solved the missing standard problem for you. But you still need to be aware that an EC2 instance is different from an Azure machine though.
Open source has an impact on open standards.
Amazon is offering services that are based on open source software (e.g. Amazon Elasticsearch Service and Amazon ElastiCache). As a result customers don’t talk to Elastic, but to Amazon. Google does this better and (at least for some products) forwards issues to the developers.
Some companies that create open source software have now come up with new licenses to prevent e.g. Amazon offering their product as a service. This creates uncertainty for companies that want to use the technology.
Open source and open standards need support (money). As an industry we need to make sure that they have chance to have a place in the market.
Customers should ask their vendors to contribute back to the creators of the software.
For a lot of companies infrastructure is overhead. Using cloud services poses a risk, but a risk you perhaps can accept given the convenience those cloud services bring along. Lock-in is a risk, as is slowness.
Bernd himself would also use the cloud for a startup to get off the ground quickly. After one to two years you’ll probably have to redo your setup. Perhaps this is a good time to think about how to move forward? What do you want to be flexible about?
Economics in an unbalanced marked are not good. See the pharmaceutical market where there are a few big players and prices of medicines sometimes explode. Being able to work with different providers and be interoperable is important. We need variety.
Unfortunately the cloud providers don’t have an incentive to work towards open standards and to make it easier for customers to hop around from one provider to another.
Diversity is up to us. Not just on the work floor, but also in the technology we use.
This talk about impostor syndrome. 70% of people experience it.
Joep’s “definition” of it:
Your internal measuring stick is broken.
Everybody is unsure, but nobody talks about it. In IT this problem becomes worse. We talk about “best practices,” but there are so many ways to do things, how do you know what is the “right” way? Perhaps the way you are doing things is perfectly fine.
Joep talked about the Dunning-Kruger effect. But in IT the technology landscape changes in a fast pace. The goalpost is moving constantly.
What can you do about it? How do you recalibrate your internal measuring stick?
To discover why you have impostor syndrome, you have to identify your limiting beliefs.
A number of TED talks you should see:
Companies that want to do DevOps are usually enthusiastic about tools and processes. Culture is a harder subject… Even if you show data that demonstrates the importance of the cultural aspect of DevOps, there’s not much understanding.
Most people think about design as making things beautiful. But it also makes things work better. It makes things more visible and matches the conceptual design with the mental model.
Design can reframe problems.
Uber is not just about getting a ride, but it is rethinking the way getting from A to B works. Airbnb is not only about getting a nice room, but it is rethinking the hospitality industry overall. Starbucks sells coffee, but also turns community spaces into places where you can hang out and have conversations over coffee.
Good designers:
Instead of talking to customers about their tools, processes and culture, you could start with interviews and having collaborative process maps workshops where you look at the entire flow from idea to when it explodes in production.
Don’t just offer the out-of-context “best practices” from the DORA State of DevOps report, but have challenge mapping workshops. This leads to solutions that are applicable and often don’t even need to be imposed but are willingly implemented.
Do diary studies: what do people actually do and what makes working hard for them?
Two takeaways:
(This was a fast-paced talk and I did not have time to make many notes.)
Serverless means that:
Function-as-a-service: upload code to the cloud which gets run when called and your function then handles the events.
Serverless is interesting because:
With regards to the velocity: with serverless you can remove a bunch of steps from your process, like figuring out deployment, configuring an AMI, ELB or autoscaling.
Companies are looking for a return on their investment.
We tend optimize what we can easily measure: operational stuff. But most of the budget is going to engineers, managers, tools, etc.
Serverless might cost just as much as e.g. running EC2 instance and perhaps it costs even more, but you can get so much more done.
With a server you have the same costs no matter if you have only a single transaction or 100. This means that the transaction costs (costs divided by the number of transactions) is variable. To make matters worse: there are often multiple services running on a since server. This makes it even harder to estimate the transaction cost or the cost of a feature.
The pay-per-use characteristic of serverless is useful in this contest. It makes it much more clear what the cost per transaction is.
Ignites are short talks. Each speaker provides twenty slides which automatically advance every fifteen seconds.
The shape of the IJ has changed though the years. Has the shape of DevOps changed over the years?
Well, now there is CLAMS (or CALMS), which stands for:
If you would plot each of these on an axis, you can plot an organization on it.
Now imagine the circumference is an elastic band: if you pull it on one axis, you’ll decrease the value on the other axes.
Work around the circle. If DevOps means automation for you, this helps being lean, which shapes your culture, which helps sharing, etc.
Evaluate your organization: what is it good at, what are you working on? It’s a natural process, a dynamic movement. Work on all axes.
Cloud architects and DevOps engineers want to have a faster conversion from idea to working project.
With Cloudcraft you can visualize your AWS infrastructure. With modules.tf you can export that diagram to Terraform code. And with Terraform you can manage your infrastructure. And we have come full circle.
Why do we use YAML in the first place? We want to have a human readable format (so XML is out), and we want to be able to place comments in our code (which means JSON is out of the picture).
There are a few issues with YAML though. To name a few:
Suppose you have a port mapping in
your docker-compose.yml
:
ports:
- 80:80
- 22:22
Looks fine? Except it is not fine, because YAML will parse numbers separated by a colon as base 60 (sexagesimal) numbers if the numbers are smaller than 60. So the second line is not parsed as the string “22:22” but as the integer 80520, which is something Docker (understandably) cannot handle.
Another example is this:
countries:
- DK
- NO
- SE
- FI
However, “NO” is a falsy value, so the countries
list will contain three
strings and a boolean.
If you want a DSL, please use a DSL, not YAML.
Alternatives:
When you want to write YAML:
I selected two Terraform workshops and one about AWS Lambda this year.
Mikhail provided a Git repository to use during the workshop: https://github.com/mikhailadvani/terraform-workshop
To dive right in, have a look at this snippet to write a file to the current
directory (in this case the in the part-1
directory of the workshop repo):
resource "local_file" "user" {
content = "..."
filename = "${path.module}/${var.filename}"
}
Terraform does not accept relative paths. But using absolute paths means you’ll probably end up with a path containing a home directory of a specific user, which is annoying for team members that have a different username.
To prevent this, you can use the “${path.module}
” variable just like in the
example above. You can also use Docker to have a standard environment you run
Terraform in. Using Docker can also be a practical approach when you want to use
Terraform in your CI.
You also cannot have one variable refer another (at least, in Terraform 0.11). You’ll have to use a local value, for example:
locals {
filename = "${var.name}.txt"
}
Mikhail demonstrated the use of:
terraform plan -var-file=<filename>
: this allows you to provide values for your
variables (docs).
terraform plan -out=<filename>
: you can save your plan. If you feed the
“terraform apply
” command this file, it will not ask for confirmation since
Terraform assumes the plan has already been reviewed
(docs).
If your (remote) state has changed between generating the plan and running
apply
with that plan, Terraform will detect that and fail.
depends_on: ensure that resource A is created before B.
If you work in a team, you want use something like S3 for the state file and DynamoDB for locks. And you probably want to use Terraform to create the S3 bucket and the DynamoDB table. But to be able to create infrastructure you need to have a state file. Catch-22. To solve this, you’ll have to split this up in two phases:
init
, plan
and apply
. This will
create the resources, but keep the Terraform state locally.terraform init
again.
Terraform now picks up that the state should be stored remotely and offers to
move the current state.Note that Terraform will create resources in parallel.
How do you manage secrets? Mikhail is using a secret.tfvars
file and
git-crypt. The benefit of using
vars files over environment variables is you can put the former in version
control. The official recommendation is to use
Vault (also made by Hashicorp). You can
possibly also use e.g. an AWS service to store your secrets.
If you rename resources or move them into modules, Terraform detects that a
resource is no longer in your code and also picks up the new resource. It will
try to delete the old one and create the new resource. To prevent this, you’ll
have to tell Terraform the resource has moved. Use “terraform state mv
”
(docs) to fix this.
An example of a module using Terratest to test the Terraform code: https://github.com/mikhailadvani/terraform-s3-backend
Tips from audience:
terraform validate
”.There are different types of modules:
A composition is a collection of infrastructure modules. An infrastructure module consists of resource modules, which implement resources.
You can use infrastructure modules e.g. to enforce tags and company standards. You can also use things like preprocessors, jsonnet and cookiecutter in them.
Terraform modules frequently cannot be re-used because they are written for a very specific situation and sometimes have hard coded assumptions in them.
Don’t put provider information in your module. For instance: although you can specify module versions, please do not do this:
# Don't do this!
terraform {
required_providers {
aws = ">= 2.7.0"
}
}
Avoid provisioners in modules. Use public cloud capabilities instead, like user_data. (Mark: for an example, see https://github.com/sjparkinson/terraform-ansible-example/tree/master/terraform)
Traits of good modules:
More info: Using Terraform continuously — Common traits in modules
There are two opposite ways of structuring your code
TerraGrunt reduces amount of code
to do similar things.
It has extra features like execution of hooks and a number
of additional functions. TerraGrunt is opinionated.
Terraform workspaces are the worst feature of Terraform ever. (Provisioner is the 2nd worst.) Workspaces allow us to execute the same set of Terraform configs but with slightly different properties. (For example: if this is the production environment spin up 5 instances, else 1 instance.) Workspaces are not infrastructure as code friendly. You cannot answer, from the code:
Better: use reusable modules instead of workspaces.
Will it help us?
It’s the biggest rewrite of Terraform since its creation. There are backward incompatible changes though.
Main changes:
for_each
)... ? ... : ...
) that work as you expectdepends_on
everywhere)(The Hashicorp blog has a number of articles about Terraform 0.12 if you want to know more.)
Terraform developers write and support Terraform modules, enforce company standards, etc. Terraform users (everyone) use modules by specifying the right values; they are the domain experts but don’t care too much about the inner workings of the Terraform modules.
Terraform 0.12 allows developers to implement more flexible/dynamic/reusable modules. For Terraform users the only benefit is HCL2’s lightweight syntax.
There is a command to check your code for 0.12 compatibility: “terraform 0.12checklist
”. Once everything is fine you can use “terraform 0.12upgrade
”.
See the upgrade guide for
more information.
Note that the 0.12 state file is not compatible with the 0.11 state file. If you have a remote (shared) state, once one member of the team upgrades to 0.12, the whole team needs to upgrade.
Anton has a workshop you can do on your own pace at https://github.com/antonbabenko/terraform-best-practices-workshop. The workshop builds real infrastructure using https://github.com/terraform-aws-modules.
During this section of his session Anton answered a bunch of questions people in the audience had. Tools that were discussed:
Related repository: https://github.com/theburningmonk/getting-started-with-serverless-development-with-lambda-devopsdays-ams
There is a (soft) limit of 1000 concurrent running Lambdas. This can be increased by sending a request to AWS. But even if you convinced Amazon to increase it to for instance 10.000, they only scale up at a rate of 500 per minute. So if you have a workload where you have sudden spikes and need to scale quickly, Lambda might not be what you need.
Note that you can set a “reserved concurrency” on a Lambda; this setting acts as a max concurrency of a function.
Use tags and have a naming convention. E.g. add a “team” tag so you know which team to contact about a Lambda. This is useful when you have many, many functions.
Separate customers also get their own instances to run their Lambdas; they are isolated from each other.
By default functions don’t have access to resources in your VPC. You can add it to your VPC, but currently there is a cold start penalty. This penalty can be several seconds; which is a lot if the Lambda only takes a few milliseconds to run. So you probably should not put your function in a VPC if you do not need it.
With regards to execution: you are billed in 100ms blocks. So if your function takes 60ms to run, there is—from a cost perspective at least—no benefit in speeding up the Lambda.
If you want to debug your functions locally in VS Code and you have
the serverless framework installed locally, you can update your launch.json
file:
{
"version": "0.2.0",
"configurations": [
{
"type": "node",
"request": "launch",
"name": "Launch Program",
"program": "${workspaceFolder}/node_modules/.bin/sls",
"args": [
"invoke",
"local",
"-f",
"hello",
"-d",
"{}"
]
}
]
}
How do you organize your code? Don’t have a mono repo. Use microservices or at least a service oriented architecture. Give every microservice its own repo, along with code only used by this service.
Advice: give the service name the same name as the repo to make it easier to find things.
Organize functions into repositories according to boundaries you identify within your system.
How do you share code between Lambda functions? It depends. Within the same repo
you can use e.g. lib/utils.js
. You can also create an npm package. Or create a
new service to provide the functionality.
When using Lambda, you’ll probably end up using more external services. The functions themselves are simple, but the risk shifts to the integration with the external services.
Another risk is the security. With microservices you have more control, but also more things to keep secure. And thus also more places you can misconfigure.
Combining those two: the risk profile for a serverless application is completely different from “traditional” applications.
]]>(On the second day I followed the sessions of the “Operating applications and infrastructure in the cloud” learning path.)
Most of the things in this talk allow you to scale your app without changing your application itself.
We want to design for security. So instead of relying on e.g. environment variables to distribute secrets, you are better off using a centralized solution, like Azure Key Vault.
Some of its features:
Use Managed identities (formerly “Managed Service Identity” or MSI) on your VM, app service or function to authenticate to Azure Key Vault.
To make your database more resilient, you can use Cosmos DB, which has features like:
Failovers can be done automatically and manually. But whatever you choose, you don’t have to touch your application. It is possible (and easy) to have a multi master setup. (It is more expensive though.)
There are several extensions for VS Code to connect to e.g. Cosmos DB from within VS Code. This means you could edit data from within VS Code. However, it might be better to hand out read-only credentials for your production database.
Azure App Service is a managed platform that makes it easy to run your application (like .NET, .NET Core, Node.js, Java or Python). It makes scaling up and scaling out on the fly easy. Even better: you can configure your application to autoscale.
Check out the Azure Extension Pack extension for VS Code. It has a lot of useful features like connecting with your Cosmos DB, as mentioned above, but you can also do things like deploying your application from within your editor.
Azure Storage is useful for static things you want stored in the cloud, e.g. blobs, tables, files and queues. Azure Static Sites are currently in preview. It allows you to serve static files via a web server. Even the right MIME types are returned.
To have a resilient architecture, you’ll have to have the service available
across multiple regions. Use Azure Front Door Service
(currently in preview). It can be used with Traffic Manager.
In the Front Door designer there are “frontend hosts” (domains the service is
hosted on), “backend pools” (the available backends) and “routing rules” which
map frontend hosts to backend pools based on path patterns (e.g.
/api/products/.*
).
Additional resources:
Damian Brady uses the analogy of a pit stop in 1905 compared to one in 2013. Summary: much faster, more people involved.
What is “DevOps”? There are a number of different definitions. At Microsoft it is defined as:
DevOps is the union of people, process and products to enable continuous delivery of value to our end users.
Why DevOps is important:
Have a look at the State of DevOps Report. It is a yearly report full of interesting information. The report compares the high performing teams with low performing teams. The high performers have a 46 times higher deployment frequency, fail less frequently, fix failures 2555 times faster, etc. It’s a great report to convince your manager to change things.
Development used to toss software over the wall to the operations team. When things go wrong, people tend to blame the other team. Developers want change but operations want to have stability (in other words: not change anything)—the teams have opposite incentives.
To change this, you’ll need a process. And you’ll need good products to help you with the process. Azure DevOps can do everything you need. Fortunately Microsoft does allow you to use your current products as well.
Azure DevOps components:
This talk will demo Azure Pipelines.
Azure Pipelines is on the GitHub Marketplace. It is free, assuming you stay within the limits set.
You can configure your Pipeline in your browser and by doing so, a file called
azure-pipelines.yml
is created in your repository. This immediately creates a
build check in GitHub.
Build configuration is similar to deployment configuration.
You can have multiple stages and have pre-deployment conditions between them. This way you can for instance make sure that person X or group Y has to approve the deployment to production.
You can have build artifacts, which you can use in your build pipeline. These artifacts are not available externally—they are part of the release pipeline. If you need more persistent artifacts, you’ll want to use Azure Artifacts.
You can configure canary deployments and do so in different ways. In this demo “deployment slots” are used. They allow you to create a duplicate deployment and direct a part of the traffic to the cloned slot (canary environment).
LaunchDarkly allows you to roll out based on many more criteria (e.g. regional, a specific customer, etc). Note that this is a paid service.
Manual approval can be replaced by a so called deployment gate. These deployment gates are automated approval processes. You can perform a number of checks (security and compliance assessment, work items, Azure Monitor Alerts or invoke a REST API or Azure Function). When they are successful, the application is deployed, otherwise not. This makes deployment faster and safer.
Additional resources:
A demonstration of Azure Application Insights.
Jason Hand is using the analogy of a car: if it breaks down (e.g. runs out of gas), it becomes useless. You’ll want to prevent this. Cars have dashboards to warn if things go out of bounds.
Application Insights has a lot of tools to help you monitor your application. Going back to the car analogy: it’s like having a mechanic with tools in the vehicle with you. It has, among other things, Smart Detection to warn about performance problems and failures. Something else it provides is Application Map where you can more easily spot performance issues.
More in depth information in his other talks:
Additional resources:
It is important to think about your data storage. The benefits of having a storage strategy:
Azure might give you a different way to look at data storage. There are a number of different storage options. Do note that storage is only a small part of the whole.
Storage and compute are abstracted away more and more these days, but everything eventually depends on the storage sitting beneath it. If your storage breaks, you will have a bad day.
Why not use a relational database to store all data? By using other options, we may make the application more granular to deploy and/or give developers more control.
Different kinds of data:
Different properties of data:
Strategies:
An example of function driven approach is building a search functionality. We could implement it as full-text search on our existing SQL database. However, this has an impact on the production database. If we would use Azure Search (search-as-a-service), we would not have to worry about the storage or paying for storage we don’t use.
Data architectures summary:
Additional resources:
On premise infrastructure requires you to answer a lot of questions, like:
Moving to infrastructure as a service still means dealing with a bunch of questions, but less lower level. You pay for the service of someone else worrying about the hardware, access, etc. You are left with questions like:
Platform as a service means even higher level questions:
With serverless, you are down to a single question: how do I architect my app? You focus on the event that happened (a request was sent to an endpoint) not on the whole infrastructure to make that happen.
What is serverless (according to Jeremy)?
Checklist to determine if something is serverless:
Functions: you pay for functions that get called and consume memory, not for servers. You don’t care how much machines are needed to handle the number of requests.
Functions have triggers and bindings. Triggers are things like blob storage, Cosmos DB, HTTP, timer and a webhook. Bindings allow you to work with resources (like files, tables, emails, notifications).
Example: a file is added to storage, a function parses and transforms the file and stores a chart graphic.
Managing events is cumbersome. Event grid can help you. It offers:
You can focus on the messages. The infrastructure ensures reliability and performance. It can, for instance, handle ten million events per second per region.
Event grid delivers the event to you; you never have to go out to grab events.
Scenarios:
Durable functions are an extension to functions for stuff that e.g. has to wait for an asynchronous event. They are paused, but state is stored. And you are only billed for the time the function is actually doing things; you don’t pay when the function is paused.
Durable function patterns:
Logic apps are the integration engine that connect things together. They allow you to design workflows and processes. For example: when a new product is added, an email is sent to the products team to add an image, etc.
Logic apps also have triggers.
Additional resources:
]]>These are just notes. They are not proper summaries of the talks.
Abstracted tools help us to make systems manageable. We layer abstractions on top of each other.
Kafka is a publish/subscribe messaging system. It decouples the source of the messages from the system where they are needed.
Origin aggregated logging is a part of OKD. It is based on Elasticsearch, Fluentd and Kibana (EKF).
Taking a look at the journey of a log message, it first encounters a log collector. This component can tag and enrich log messages. The message then goes to a Kafka source connector, which imports data from external systems into Kafka brokers. (There are many types of connectors.) These brokers are like post boxes: messages are delivered there and other systems can pull them out. Messages written to topics. Finally a Kafka sink connector is used to export broker data to external systems, for instance Elasticsearch.
Kafka is not a cloud native application. Strimzi Operators integrate Kubernetes, OKD and Kafka; it manages the Kafka deployment. The most important operator is the cluster operator which manages Kafka, Kafka Connect, Zookeeper and MirrorMaker.
(Slides)
A quick intro to Prometheus: Prometheus can pull metrics from targets and store them in a time series database. It will push alerts to the AlertManager, based on rules you specify.
For machine learning we need data. Prometheus is not configured to keep your data for a long term. Thanos is a project for reliable historical data storage. Until it is production ready, you can use Influxdb. Unfortunately it eats RAM for breakfast. You could instead store your metrics using Ceph. This is also a future proof solution as it provides a path to Thanos.
Prophet is a Python library to predict future data and dynamic thresholds.
Interesting links:
(Slides)
Packaging makes software integrated, tested, updated and easily installable. Linux distributions pick versions of packages and put them in the same lifecycle.
Containers are great, but they are not the solution for this problem. Fedora Modularity tries to provide a solution. Its core concepts are:
The true benefits of containers are that you can run them (almost) anywhere and build and test the images upfront. With apps in containers, the OS can be immutable (see CoreOS, Silverblue).
OpenFaaS, functions as a service; serverless on your terms. Used by dozens of companies.
Serverless is an architectural pattern. How do you arrange your compute and how do you use it?
Functions:
These properties allow them to auto-scale.
Core values of OpenFaaS:
(Slides)
What (kind of) tools do we use? A lot!
Don’t just use a tool and let it define your work. Define a goal, determine how to achieve the goal and pick the tools that match what you actually need so they work for you.
Make use of automation, as much as you can.
You can use a number of workflows:
(Slides)
]]>