Roll out code at a fast pace
Systematic is looking for more members for the team that uses Site Reliability Engineering to automate the operation, deployment, and monitoring.
At Systematic, a small team of eight people focuses on Site Reliability Engineering (SRE), which combines development, operation, and automation. The team is looking for more developers who will try their hand at a discipline that is both challenging, fun, and unusually versatile – at least according to Jesper Skelmose Mathiassen, Senior Systems Engineer at Systematic.
“For one of our public product systems, we roll out an update every 14 days. It's very standard - but in the past, we've actually had a lot more speed and rolled out two updates a week. The customers could not keep up with the documentation, therefore, we had to slow down again”, he says, adding that on some systems the team really puts the turbo on.
“In another project, we micro-update up to 100 times per week. It’s a bit like operating constantly on a heart that is still beating. But it is completely unproblematic because the deployment process is fully automated and independent of service windows”, says Jesper Skelmose Mathiassen.
"We lay the rails while there is steam in the locomotive"
SRE is basically about enabling developers to deal with solving operational tasks. As a rule, the goal is to have fully automated monitoring or rollout of updates on a system that is operating.
“Site Reliability Engineering is a lot of fun because we are constantly solving tasks across a lot of different systems and languages. We simply combine software development - which is something we, in all modesty, are quite good at in Systematic - with solving very specific operational tasks and we can see the result of our work immediately. You could say that we lay the rails while there is steam in the locomotive”, explains Jesper Skelmose Mathiassen.
The SRE team often contributes to a project that is in operation and starts by, for example, rolling out an update manually by following the manual ad litteram. Next, they take the rollout process apart, look into the heart of the processes, and find ways to automate each element by developing small programs and snippets of code. Among other things, it is useful if you need to automate the roll-out of code from development and testing to full production operation.
Built and rolled out critical bug fix in two hours - in the middle of business hours
Jesper Skelmose Mathiassen adds that SRE is effective in partly eliminating classic updating, which is solved by "one man with a bunch of batch files and all processes on the backbone." In part, you become less dependent on service windows, because SRE makes it possible to roll out updates quickly and easily on systems in operation.
“One day we were made aware of a critical error in the server application of the public libraries' administration system. In less than two hours, a bug fix was built and rolled out to all libraries in the middle of opening hours. This could only be done because we had a fully automated roll-out process in place”, notes Jesper Skelmose Mathiassen.
Traditionally, the price of bug fixes rises explosively once a system is up and running - while SRE helps to push resource consumption down so far that the price approaches the update of systems that are not in operation. It also helps to make SRE ideal as an element in both continuous delivery and continuous deployment and ensure a much shorter time to market. To the delight of both developers and customers who have been accustomed to having their solutions always fully updated.
Explore all edges of Site Reliability Engineering
"Google is probably the leading exponent of Site Reliability Engineering. Surely it is practical for those who have invented the concept. But they also have an insane amount of services operating 24/7, which they cannot just shut down - and apparently, they use SRE to send out about 80,000 updates a day. Spotify, Netflix, and many others use it too; they cannot just shut down for four hours to patch some corner of the system, because then millions of people would be furious", says Jesper Skelmose Mathiassen.
Also, SRE is used for building monitoring components that can track exactly the elements of an operating system that may be needed. It provides a solid foundation for working with proactive performance monitoring and for constantly keeping an eye on whether a system lives up to its performance goals.
Often, developers and operations people only hear about problems when the core functions of a system stop working. A hypothetical example could be if citizens across the country suddenly cannot borrow books at the libraries. If, on the other hand, you use SRE as a starting point for incorporating targeted monitoring, you - with real-time metrics in hand - quickly become aware if book lending suddenly starts to take a little longer than usual. It provides a basis for targeting analysis and error correction in time.
“So there are many applications with Site Reliability Engineering, and we are constantly challenging the application methodology. So even though it sounds like the Earth's biggest cliché to say that no two days are the same in our team, it is actually true”, says Jesper Skelmose Mathiassen.
If you are considering changing your job, and you could see a possible career at Systematic, then they are right now looking for new colleagues for their IT positions.
See all the open IT positions via this link.
Comments