At 9:22am on Friday, Drew's GroupWise system experienced a failure resulting in a complete disruption of email service. All services were restored by 9:50am, just under 30 minutes after service was interrupted.
We apologize for the outage and wish to assure the Drew community of our continued commitment to maintaining high levels of service. We recognize the value that these services have to the community and consider any disruption of a major service during the business day to be unacceptable. CNS is committed to maintaining 24x7 service for email and other enterprise services within reasonable constraints. To this end, we are taking specific action in response to the outage this morning. The purpose of this message is to provide, for those who are interested, additional technical details about what occurred and details about the further actions that we are taking.
The GroupWise system has had an excellent availability record since it was installed. Barring environmental and other issues that have affected all of Drew's systems simultaneously, GroupWise has been continuously available since most users at Drew began using it last spring. GroupWise has been implemented specifically at Drew in such a way that it can automatically recover from most routine problems. Like many other services on campus, the GroupWise system uses a technology known as server clustering to ensure that services can continue to operate even if individual servers within the cluster fail. This along with other high-availability technology in use and proactive monitoring help to ensure a high degree of reliability.
For the past several weeks we have been tracking a minor issue with the individual GroupWise servers, for which we have an open support call with the vendor. While we are still unsure as to all of the circumstances surrounding the exact cause of the Friday morning failure, steps we were taking to troubleshoot this problem were in part responsible.
In a rare case where the clustering technology actually works against the availability of the system, a communications issue, which we believe was caused by the troubleshooting we were doing, tripped a safety mechanism built into the clustering software that is designed to prevent data corruption. This resulted in all servers being forcibly evicted from the GroupWise cluster simultaneously. Since the entire cluster was shut down, additional time was required to restore services because we had to perform what is known as a "cold startup" of the cluster.
What we are doing
We are taking several actions as a result of the Friday morning GroupWise failure:
- We have asked the vendor for a more conservative method of collecting the necessary diagnostic information that they require to troubleshoot our other issue. In the meantime, we have suspended our attempts to collect this diagnostic information using the technique we were using.
- We will be reconfiguring the GroupWise cluster in order to improve the situation if a cold startup of the cluster is ever required. Restoration of services took longer than anticipated Friday morning in part because the cluster did not start up in an optimal configuration. While cold startups of the cluster should be extremely rare, we will nonetheless be working to improve recovery times in the event of that situation.
- Within the coming weeks we will be announcing brief periods of downtime after business hours to install software patches for both operating system and GroupWise server components. Please stay tuned to your Drew email for these announcements.
- As always, we will continue to proactively monitor the performance of the system and keep the community informed of our efforts.
Once again, thank you for your patience and understanding.