Advanced Distributed Systems Design with SOA, part 3 of N

June 06, 2014

(This is part 3 of my notes from the ADSD-SOA course. You'll find part 2 here.)

5. The Topology won't change

... unless a server goes down and is replaced, which is a premise on which cloud services are generally built (see, for example, what an Azure container looks like), is moved to a different subnet ... or we're serving wireless and mobile users (see Google's Mobile Performance from the Radio Up) ...

So ... if our target endpoint (ex. server) is no longer where we expect it to be, what happens? Well, first off: Nothing! And for a significant amount of time too! The default timeout for both OpenTimeout and SendTimeout in WCF is a minute each! Connections to SQL Server? On premise, the default timeout is 15 seconds. If you're connecting to an Azure SQL database, it's 30 seconds. Imagine having a thread waiting for such a connection when/as a client disconnects. It's quite the common scenario. Now, multiply the scenario by 10 clients, 100 ... 1000 clients.

So how do we come to terms with this? Well, first off, we need to be aware of the problem. Test your system for these scenarios. Disconnect clients & terminate servers. Although frowned on by security experts (because it enables sniffing), consider using resilient protocols (e.g. multicast) and discovery protocols. Implementing these protocols and stress testing them takes a lot of time, however, so take your project into consideration - is this resilience a real requirement of the system, or is the system in the process of being rewritten anyway?

6. The administrator will know what to do

Be wary about relying on one person or small set of people to manually keep tweaking the production environment and apply patches as far as our system is concerned. What happens if our dedicated administrator takes ill or gets promoted? Instead, try to design your system to be upgraded in small steps, supporting partial deployments. Your goal, should be a system that is highly available during upgrades. It's a natural extension of continuous integration: Continuous deployment. To succeed in this space, we first need to have thought about versioning.

Now, the curious thing about CI/CD, is that we tend to think less about this the larger the project is, which is really dangerous since we in these scenarios are not likely to have the original crew along anymore as we are starting to hit production.

So, what do we usually do? We log everything, right? As long as we log everything, things will be fine, we tell ourselves. Unfortunately, rarely tends to end up being very useful. Who will read these logs? What do they contain? The typical developer log contains what a developer expects to see during debugging - local values and stack traces. What good does it do to an administrator reading them months later? I'm not arguing that logging is bad, but instead that - if we choose to employ logging, we need to design the logs in cooperation with the staff (or systems) that will read those logs. As with other parts of our system, we need to test our logs to ensure that they emit relevant information, which in turn means that we - early on in our development - need to connect, communicate and get along with our DevOps/Admins.

Upgrade strategies

Having some kind of queuing system between your components can help during an upgrade scenario (since your messages can safely rest in the queue while you are replacing your new component). However, there are no magical solutions here. You need to ensure, that both of your component (your old and your new) can consume the messages in the queue (alternatively, the RPC commands if you opt out of queuing) and optionally ignore some of them not meant for the current system.

(Next up: Transport cost isn't a problem & the Network is Homogeneous)

Search This Blog

Development Experience(s)