Today after almost a year and a half of uptime one of my very large KVM hosts (512GB for RAM) running a clone of RHEL 6.6 (Springdale Linux) suffered a kernel panic. It took down with itself 15 KVM guests and a commonly shared RAID6 data pool. All guest were in production used by our grant sponsors. I was able to restore all services within 2.5h from the crash.
However the director of my Lab losted and I feel really lucky that I will be going to work tomorrow. Essentially he put the blame on me for the kernel crash which I am still investigating but more importantly asked me to implement robust redundant VM infrastructure which should be much more reliable (uptime of 1.5 years, one crash, and 2.5h downtime is unacceptable in his words).
We are an academic lab with very limited resources and I will be the first one to concur that I have just rudimentary knowledge of virtualization. I am aware of the hot migration techniques for Xen and today I read https://alteeve.ca/w/2-Node_Red_Hat_...rial_-_Archive
which seems extremely complicated.
Can anybody suggest a simple way to achieve high level of redundancy and reliability for virtual servers infrastructure. I would appreciate any suggestions and reading materials.