Mark Smalley gave a great talkshop about development and operations co-operation. Talkshop, as he described it, is a combination of talking and workshopping. The goal was to understand the expectations of both – development and operations disciplines. Discussing this with 20 specialists led to interesting results. In general both sides agreed that operations need to become a part of the development team to enable better communication and understanding of these disciplines. Today QA is a natural part of development team but ‘historically’ there has been a wall between these two roles. Developers used to drop features over this wall to testers just like many teams in the world drop releases over the wall to operations. Mark is also writing an overview about these results soon.
I really liked that several talks highlighted the important difference between fragile, resilient and antifragile systems. The concept is based on Nassim Nicholas Taleb’s book Antifragile and it describes how every object and organism lies on a spectrum from fragile to antifragile.
Looking at this in the context of software, fragile systems break when there is chaos and disorder. Resilient systems can be compared to a phoenix from Greek mythology, a bird rising from its ashes every time it is destroyed. Amazon Elastic Load Balancing is an example of a resilient system that is able to monitor its availability and spin up new instances if existing ones happen to fail. These are systems that restore their initial state in case of a failure. Antifragile systems take this a step further and don’t only restore their initial state but grow stronger from failure. This is often compared to a Hydra with numerous heads. Each time one is cut off, two grow back. To achieve antifragility, failures are not just expected, they are injected into production. This concept is well adopted by Netflix and has led to the development of Simian Army toolset.
I believe you don’t have to have hundreds of servers to explore antifragility. Occasionally dropping the network link between primary and secondary data centers, while data is being synchronized is just injecting a failure that can actually happen. Having third party integrations could give you an opportunity as well. What if an external system starts misbehaving and your requests do not get a response for several minutes and you don’t also receive 404 or 500 status codes – the connection is just stuck. These are all situations that one could simulate to ensure that these are properly detected and handled.
Many more ideas were presented and I’m really looking forward to applying these in my work. Soon all of the talks should be available online as well. In November Topconf will be organizing it’s biggest conference in Tallinn that will also have links to DevOps through Sustainable Development and Operations track. Check out the full list of tracks of Topconf Tallinn 2016.