Thursday 12 September 2019

Testing in DevOps

I've just spent the past year embedded in a "devops" team (quotation marks explained later) and I've got a few different points to make, so bear with me, this is going to be a long post. Also a bit of a brain dump so it might not be my best writing ever as I want to write this while its relatively fresh in my head
This post is also going to be a little technical and assume some knowledge of DevOps, if you're new to the phrase, I highly recommend Katrina Clokie's book "A Practical Guide to Testing in DevOps" found here - https://leanpub.com/testingindevops

That word "Devops"

In my experience, there are two different understandings of the word/phrase "devops". Basically it boils down to:
  • "Devops" is not a role, it's a set of practices, which makes it a bit woolly and vague but in general is about bringing two traditionally separate roles together so that a "team" can both deliver (for example) a software application and its hardware but also maintain, operate and support it in production/live/whatever you want to call it. This can be achieved by training the team in operations or by embedding ops engineers.
  • "Devops" is where developers write infrastructure-as-code, typically Ops engineers interested in programming. But sometimes also software programmers interested in getting their hands dirty with more Opsy work. In this definition, you tend to see this become a role called "Devops engineer" and they tend to write re-usable chunks of code that builds infrastructure for software teams to use. For example, creating a generic set of code that provides a MySQL cluster in AWS.

I highlight this because I've realised people aren't aware that there is this difference and personally I prefer to encourage the former rather than the latter. The latter is a bit like SDETs/Developers in Test where you're creating a whole new communication/distance from the end goal role, writing tests that are broken by the dev team because they have no idea about them. The first definition I like because its about teamwork and delivery, encouraging the team to take a more holistic view of software delivery or rather product delivery as a whole. After all, who cares if your code is shit hot if we've put it on hardware too small to run it?

Cloud and infrastructure-as-code

Regardless of those definitions, if you're going to work with cloud-based infrastructure (as opposed to on-premise, where your company owns or rents the physical servers), you're going to be writing infrastructure-as-code. Why? Because in the cloud you are sharing your hardware with other companies and people, which means you have less control over how it operates and this changes the risk profile. The cloud is cheaper than owning the physical servers but this comes at the cost of reliability.
Therefore you may want to automate many aspects of operating your product, such as recoverability and scalability which is where infrastructure-as-code comes in.
There are two main areas that can be expressed in code:

  • Code for provisioning the infrastructure you need e.g. the hardware, the network routing, firewall rules and so on.
  • Code for configuring an individual server, e.g. setting up users and file permissions, installing software, configuring software (such as setting up Java and then setting up a Java based application to run).
Why consider these two separate areas? In my opinion there is so much to consider and think about in both that its worthwhile considering them in their own discussions, even though you will want to develop and test them together.

Testing in DevOps

So what to test? How do you test? What tools do you use? Is there anything to really test?
Hell yes there’s loads to test, you’ve now got code that builds the foundations of your product, and not just that, but its code that defines how your product will scale and recover and also determine its reliability. Suddenly a developer can potentially make a small change and open all of your servers to the public to access.
These are some ideas of what you can test in an Ops world:

  • You can manually run and try out the code as an obvious starting point. Does it build the servers correctly? Can you use the application after destroying it and building it again?
  • Destroying the servers leads to OAT (Operational Acceptance Testing) or general operability. Testing what happens in disaster and failure scenarios. Will your product recover if the servers suddenly disappear and new ones are built? Do you test your backups regularly? This also neatly leads to a ideas such as chaos engineering.
  • The code itself can be sort-of unit tested. For example Ansible has a framework called Molecule which allows you to run your Ansible scripts against a Docker container and assert what state the scripts will leave a server in. There are also more broader integration test tools such as Test Kitchen which have slightly more capabilities.
  • Using tools such as AWS Trusted Advisor or Well-Architected to analyse your infrastructure for common mistakes (such as setting up firewall rules completely open to the public) or under-utilised hardware that could be run more cheaply.
  • Given cloud infrastructure is inherently prone to failure, can you monitor and alert those failures? Do you know if your servers fell over overnight? How many errors are happening in your environments? Usually cloud providers don’t have access to your servers to know what is going on, so you need to setup your own access to software logs (e.g. like Java app errors), have you centralised these logs for easy access?
  • Tools like Sensu allow you write custom automated checks to monitor your servers, this is very useful for more granular checks like specific software health checks (e.g. your server never failed but the software application has crashed, can you tell from your monitoring?). I think there is a lot of value here for testers to help design, write and create new simple but smart checks and improve how observable systems are not just in production but also in all environments!


Some other things I’ve worked on that were very context specific, and not very DevOpsy or Testing, but could be useful ideas:

  • Creating Jenkins pipelines to test out rebuilds of infrastructure-as-code on a daily basis - this was to cover a specific risk we had that our code was very tightly coupled and we were breaking a lot of projects with our changes. This was an interim solution until we un-coupled our architecture but it was useful to be able to do it.
  • Off the back of Jenkins pipelines for rebuilding infrastructure-as-code, I also created jobs that would manage downtime periods to help save a good two-thirds of our monthly costs. In general we only needed our test environments to be up 8 hours a day, 5 days a week, not 24-7. I used Jenkins so I could manage dependencies and run custom checks but this could also be achieved with the right auto-scaling policies in AWS too.
  • Fixing and writing my own Sensu checks, this was useful although I would be careful as its easy to write a check that produces false positives or negatives. Its very hard to think of all scenarios the check could encounter so avoid writing scenario-based checks where possible. It’s not helpful to have monitoring checks that have bugs themselves and are difficult to debug when they fail, keep them simple.
  • Hooking together Dev team Selenium tests, this was because my context was an Ops team changing infrastructure for Dev teams. I wanted a way to test our changes before we potentially broke the Dev teams’ dev environments. This isn’t recommended if you can avoid it as obviously its not very DevOps. But in general finding a way for infrastructure-as-code to be eventually end-to-end tested in an automated way is useful because its hard to really test things like firewall and network routing configuration or file permissions until you actually try to perform certain actions from the server. The hard part is knowing when to run the tests, you to know when your infrastructure-as-code is finished and when the servers and various components have actually completed their setup and app is running. I achieved this with a Jenkins pipeline which polled a Sensu check that might look at a health endpoint from the app, when this went green I knew to proceed with the test and it would timeout if it took longer than usual.
  • Writing simple scripts for monitoring or analysing our AWS account. In our context we needed to tag the hardware we were using for our own internal billing purposes so we could appropriately budget for certain projects. As this relied on humans remembering to include the tagging in their infrastructure-as-code, it was useful to regularly audit the account for servers that were missing tags and therefore wouldn’t be billed appropriately. This also made it easier to investigate under-utilised servers and talk to the owners about saving costs.