As VM automation becomes more and more predominant in cloud environments, the issue of abstraction becomes more important. Consider if you will, an infrastructure in which the creation and management of VMs is fully automated. Now put all those applications and information and VMs in one big cloud that is all self-sufficient and constantly moving around due to load balancing and other automated processes. Then, add in cloud applications, plugins, security and anything else that could possibly run in that environment. Then, connect it all up so that every part of the infrastructure is inter-dependent and connects through a broker. And for fun’s sake, let’s assume there is a memory leak on one of the servers and your start losing VMs.
All of a sudden, any application, plugin, NIC or endpoint could be the culprit. Because everything is now connected, it quickly becomes affected. Before, we used to segregate processes, resources and applications onto physical resources. We’re talking about desktops, infrastructure, applications and resources, everything sits in the same space and talks to each other. If your cloud is interrupted, your services are impacted more severely than ever before. One faulty server takes the entire global organization down, not just a department in a branch office.
If a memory leak should occur, the key question is “Where do I begin?”. It usually starts with a call to the vendor who will expect that you have at least narrowed down the source of the problem to a basic layer; network, hypervisor, whatever. Now imagine when you have to start figuring out where the issue lies, you are most likely going to start with a management dashboard. From there you essentially will have to work your way down, layer by layer, eliminating every variable one by one until you figure out the origin. So unless you caught the memory leak in the earlier stages, how could you possibly draw such a finite cause without spending significant amounts of time running diagnostics in each area of the infrastructure? What if these VMs are located halfway around the world?
The problem is that unless we can easily access diagnostics and have strong visibility into each area of the virtual infrastructure, the time to troubleshoot increases significantly. If the issue results in a situation where the entire cloud/virtualized environment crashes and cannot be restored or has to be taken offline to repair, time and cost (including the effect of downtime to your organization) suddenly become critical.
So you must be thinking the only answer is “Don’t automate?”, right? Well it would make this post pretty uneventful if I didn’t at least give a shred of advice. So here it is; the more you automate, the more important it is to ensure that you have visibility in all areas of your environment. But how do you do that? The main fix that we would all love is to finally get proper dashboards that can get into the nitty gritty, but sadly these aren’t coming any time soon. In the meantime, there are ways to ensure that you have some kind of visibility into your infrastructure is to consider utilizing probes, vprobes for example. These plug-in to all layers of the virtual infrastructure to provide that added visibility required to keep a solid cloud environment. But the best advice is to ensure your organization has proven redundancy plans in place. It is crucial for large cloud environments to be able to take segments offline for troubleshooting without impacting the infrastructure as a whole.
So simply put, before you start to move towards a unified cloud environment, ensure that you include redundancy and business continuity plans into your transition plan.