Anjul Sahu


DevOps Roadmap - 2022

Posted at — Feb 15, 2022

In last few weeks, I met some folks in my mentoring sessions, who are new to DevOps or in the mid of their career, were interested in knowing what to learn in 2022. The DevOps skills are high in demand and there is a constant learning required to keep yourself in sync with the market demand.

This post is to share the notes that can help you. Let’s see some guidance based on my experience and understanding.

Roadmap

Be fundamentally strong in the networking technologies

Understand the concepts such as HTTP/2, QUIC or HTTP3, Layer 4 and Layer 7 protocols, mTLS, Proxies, DNS, BGP, how load balancing works, IP Tables, the working of Internet, IP addresses and schemes, and lastly the Network design. I found Julia Evans’s blog very useful and my go to place when I need to understand stuff in a simple way. She has covered a wide variety of topics in her blog posts and zines.

Master the operating system fundamentals particularly Linux

As most of the systems (VMs, Containers etc) run Linux, it is important to know from top to bottom. Learn scheduling, systemd interface, init system, cgroups and namespaces, performance tuning, and mastering the command line utilities - awk, sed, jq, yq, curl, ssh, openssl etc., Learn performance troubleshooting from Brendan’s blog.

CI/CD

If you are still into Jenkins, it is fine. But, the world have move to cloud native pipelines. Conceptually not much has changed in this space but you can look into Github Actions, Tekton etc. How to do releases better? Understand various deployment strategies such as blue green and canary.

Containerisation and Virtualisation

Apart from the popular Docker runtime, try containerd, podman etc and knowing How to containerise applications, how to implement container security, how to run and orchestrate VMs in Kubernetes, see KubeVirt project.

Container Orchestration

Kubernetes is now a de facto standard for running containers. There is a lot of content on the Internet to learn Kubernetes. Focus on configuration best practices, application design, security and scheduling. Setting up Cluster is getting trivial now but the day 2 operational stuff such as setting up monitoring, logging, CI/CD, how to scale the cluster, cost optimisation and security are some of the questions people might be expecting from you.

Observability at Scale

Most of the engineers are aware of the Prometheus Grafana stack or similar. Trend suggest that many organisations are consolidating their Kubernetes clusters and observability, both from the performance and cost perspective, this helps. Learn the advanced configuration and architectures of Prometheus, how to scale them. Look into the technologies like Thanos, Cortex, VictoriaMetrics, Datadog, and Loki. Continuous profiling tools such as Parca, periscope, hypertrace and distributed tracing with open telemetry. Service meshes such as Istio are popular ingredient in the cloud native recipe.

Platform team as a Product team

The function of Platform team is becoming more like a centralised product team who are focussing on their internal platform customers such as Developers and testers. The goal is to improve the ways of working and bring some order to the teams. Try to improvise on the problems the Developer and QA team faces. You are the enabler for other teams, instead of taking all the work in a central team, coach the dev team to take up typical DevOps responsibilities. That way you can scale and don’t burn yourself too much.

DevOps functions

Security

In many small organisations, security was a second class citizen, Product features were given more priorities. But, due to growing sophisticated attacks and various strict compliances, the companies are adapting to shift-left security strategy. End to end encryption, strong RBAC, IAM policies, governance and auditing, implementation of benchmarks such as NIST, CIS, ISO27001 are common. Container security, Policy as code, Cloud Governance and Supply chain security are hot topics.

Programming

DevOps or SRE role is now taking the cross cutting concerns of the Developers and creating tooling that can help in improving their productivity while enforcing the standards. A good quality software engineering practice and skill are required to craft the high quality platform components.

I can’t give enough stress on this. The good organisations are looking for good programming experience in Platform engineers. It is important in site reliability engineering as well where you need to be fluent in programming, able to read, understand and debug the code written by others and if necessary fix it.

Python and Golang are the most popular ones. My suggestion is Golang due to the features like strong concurrency, strict type checking, adoption in various orgs, toolchain and as many major projects are built using Golang, it makes sense to learn that over Python.

Few simple things you can try:

  • Write a CLI in your programming language.
  • Learn to write a REST API and interact with databases
  • Parallelism and Concurrency

Infrastructure as Code

Terraform is a standard in the projects. Once you understand the concept, it is easy to adapt to any other tooling as most of them are based on DSL.

Cloud

Most of the cloud works in a same way. So if you know one cloud well, you can easily work with other cloud providers. Focus on how you can design applications using cloud native components in a highly available, resilient, secured and cost effective way.

Technical Writing

You might be wondering why I am talking about technical writing when discussing DevOps. A lot of folks don’t give enough attention to this but it is super important how you communicate and work with other teams. Future of work is remote and emails, slack/teams, chats are the primary channels to talk and convey idea to others.

On regular basis, you might be creating documents such as runbooks, postmortems, RFCs, architectural decision records and software design docs to name few. A clear, easy to understand document does wonders. It can help you save your and the reader’s time and improve the overall productivity. Suggest you to read this article.

Site Reliability Engineering

Boundary between DevOps and SRE is getting thin. In some organisations, same person might be performing both roles. Understand the concept behind SLI, SLO, and Error budgets and SRE practices. Each organisation does it differently, so I don’t suggest to copy paste someone else’s culture in to your team. Refer to the Google SRE culture.

Conclusion

Personally, I am excited about following in this year. This is not a definite list as it keeps changing with time.

  • Service Mesh - Istio, Cilium Sidecarless mesh, Tetrate and Solo’s Gloo mesh offering.
  • How to improve Developer Productivity? It is a mix of culture, automation and tools.
  • SRE Platforms - honeycomb, Last9.
  • DevPortals - again linked with the motive of improving productivity and bridging knowledge gap.
  • Observability - technologies such as open telemetry, hypertrace, Thanos, VictoriaMetrics, Vector.
  • Security - supply chain security, code signing, tightening cloud security.
  • Golang - improving the current skills.
  • Serverless computing and Event driven architectures
  • Web3 - understanding the landscape related to DevOps and Infrastructure

Be curious and keep learning. Continuous bite size learning is easy which you can do alongwith your full time job. If you are still have any questions, feel free to book some time with me. I am more than happy to help.