Eran Levy
8 min readJun 1, 2020

--

credit: cloudelicious.net

The Cloud Native Engineer: The engineer evolution at a glance

It’s very interesting to see the transition that engineers done in the last years.

As engineers, we used to write code that was interacting with a well defined set of other applications that even though they were complicated we could still understand the limited space they were operating in. The stack was limited (for illustration purposes, it’s obviously depends on scale) — you had one type of database, one type of a cache server, one type of a message queue and most probably the application was written in a single programming language. You usually had set of services that were running in a well defined environment that was maintained by the same teams. Engineers usually didn’t have much choice or impact on the tech stack decisions — one team was managing the database cluster (schema, tables, upgrades, replication, etc), another team shipped the product to the servers, another team was monitoring the application along with its infrastructure. This is how it worked . The teams were built mostly around that flows — you had the application team, the infra team, the operation team, the QA team and so on. Naturally, the quantity of teams were small but teams could have large amount of members.

Its not new that during the design of any piece of software, you have to take many decisions — here is just a glimpse of whats going on in the engineer’s mind while thinking on his service’s internal and external boundaries -

  • Service boundary — if we pick cache as an example for this simulation — do I need cache? is it just a local cache or a distributed cache? which cache am I going to use? whats the size of the data that Im going to store there? do I need to set TTL to the cache entries? when am I going to cache?
  • Service interfaces — what are the channels that other services will be able to interact with me? are they going to interact through: sync channels, async channels or both? whats the contract that Im exposing to them? in case of a data change, are there any other services that should know about it?
  • Other aspects — that we have to deal with such as the technology stack — which cache solutions do we use in our stack? if we do, does it fit my requirements? if anything is missing, what and how it can be achieved with another solution? can it be deployed in our cloud? is there any benefit of using a managed service instead? do I need to perform any POC to understand its limitations/capabilities?

We could continue this simulation on but as you quickly realize, engineers have to take many decisions that will affect the outcome of the service they are developing. It is just a very small part of what an engineer have to think of while designing his next service and obviously it’s what expected from us as engineers to consider. The simulation I have done so far should not be new to you, it might evolved over the years with the technology and workflows but it is a natural part of our engineering role.

The evolvement of cloud native technologies and the need to move fast, led organizations to redesign their structure. As the adoption of “Microservices” grown by many companies, the need for autonomous teams that are self organized and that have all the necessary skills “in-house” grown as well. This autonomy led engineering teams not just write their applications but also deploy, maintain and support them.

As a result of this evolution, engineers these days are closer to the product and the customer needs — there is still a long way to go and companies are still struggling how to get engineers closer to their customers to understand in-depth while developing anything: what do they solve, whats their influence on the customer and know their impact on the product. There is a transition in the engineering mindset — we ship products and not just code!

Engineers are now required to write services that are just one of many other services that usually solve a certain customer problem. It gets much more complicated since you are not just writing your code that somebody else ships to production but as we mentioned earlier you write it, you maintain it, you support it and your service is just one part of a larger piece of software. Yes, your services are smaller than what they used to be but they aren’t alone in a vacuum and you have to understand the problem space that your service is living in — As Ben Sigelman is calling them in his last posts and talks — “deep systems” and images are better than words and this one explains it all:

https://lightstep.com/deep-systems/

As part of transitioning into being more cloud native, distributed and relying on the Kubernetes foundation, engineers face more and more challenges that they didnt have to deal with before —just one example is that when you are on-call for a certain incident and you have to identify the root cause quickly as possible or at least recover fast and this is usually requires you different set of expertise in understanding the problem space.

These days engineers aren’t just writing code and building packages but are expected to know how to write the relevant Kubernetes resource YAMLs, use HELM, containerize their service and ship it to different environments. It doesn’t enough to know it in high level. You should keep adapting your knowledge and understand cloud native technologies at the same level as you know how to develop your Go (or whatever language you write) service. Sometimes it just looks easy and engineers don’t pay much attention to the details, don’t take some percentage of their time to learn (or maybe their companies don’t understand its importance). Hence, they just shipping their services to production without really understanding whats the impact on the ecosystem that their services are living in. Knowing the environment and tools are crucial to succeed in delivering value. More than that, how are you going to debug your service or any flow if you don’t know how your deployment resources behave in certain situations or understand why your exposed ingress is not accessible?

When I was at QCon London 2020 (unbelievable that it was not too far and today we are in a totally different world with the many challenges that COVID-19 brings), here is one of the slides that caught my eyes during Bernd Ruecker talk (I really love the way that he explains the complexities around distributed systems):

Bernd Ruecker — QCon London 2020

Your services will fail, you should know how to handle that safely and make sure your architecture and design have been adapted to that. You should know which questions to ask and which flags to raise. If you know that its going to happen, you have to make sure you got the know-how of taking care the end-to-end development and maintenance of your service.

We as engineers are now building distributed systems which in one hand have many advantages that Im not going to list here but in the other hand have many complexities — as Sidney Decker expressed in her book and quoted by Crystal Hirschorn:

Crystal Hirschorn — QCon London 2020

Modern engineers have to know the frameworks and tools that are available for them in very much the same way that they understand their code. If not already available on their organization toolbox they should encourage them to adapt quickly as possible in order to be efficient and move fast. As an engineers you should know the Kubernetes command line and tools such as: how do you understand your deployment environment variables, secrets, how do you port-forward to connect your service or other servers. You also need to know the other ecosystem that should be part of your toolbox — how do you query the metrics the services exposed (either in Prometheus or any other), debug locally, configure Kubernetes resources (Pod, Deployment, Service, Ingress and the rest of your resources) to understand if the configurations have been propagated successfully. I can continue to write more but I’m pretty sure you get the idea. Since we are dealing with cloud native systems that are mostly distributed, many things can fail- you should know how to monitor, trace, ask your questions in order to debug efficiently.

Besides the cloud native technologies knowledge listed above and that you have to adapt to, make sure the services you develop are simple, instrumented and resilient as much as possible. Network connections tend to fail and things can go wrong, you should know how to recover from that failures and not throw the hot ball back to upstream services or to your customer UI. Obviously as cloud frameworks increase their maturity and adoption (either if its Istio or any other), that enables developers to mostly care on the business logic but you still shouldn’t think that it’s someone else problem. You need to know whats taking care of your service-to-service authentication, how your pods are scheduled in case of a failure in one of your availability zones, who is taking care of your HTTP retry logic and any other responsibilities that you delegate to your orchestration frameworks. It just reminds me of some funny screenshot that Brend captured from the easyJet website in one of the times that he tried to check-in:

Bernd Ruecker — QCon London 2020

One of their application failed but you can’t relax — they throw the ball back to you — do whatever steps that are needed to recover, just don’t call us ;) It doesn’t smell good from design perspective and obviously neither from user experience.

Finally as an engineer you got lots of power in your hands but it always come with a cost — “with great power there must also come great responsibility”. Know that your decisions have an impact.

Happy coding!

--

--