Drew Community  

Go Back   Drew Community > General Forums > Technology Discussion
uLogin ID  
Password
FAQ Members List Calendar Search Today's Posts Mark Forums Read


Reply
 
Thread Tools Search this Thread Display Modes
  #1  
Old 02-24-2007, 01:55 AM
E. Axel Larsson's Avatar
E. Axel Larsson E. Axel Larsson is offline
Moderator
 
Join Date: Jun 2005
Location: Madison, NJ
Posts: 303
Default About Friday's GroupWise Outage

At 9:22am on Friday, Drew's GroupWise system experienced a failure resulting in a complete disruption of email service. All services were restored by 9:50am, just under 30 minutes after service was interrupted.

We apologize for the outage and wish to assure the Drew community of our continued commitment to maintaining high levels of service. We recognize the value that these services have to the community and consider any disruption of a major service during the business day to be unacceptable. CNS is committed to maintaining 24x7 service for email and other enterprise services within reasonable constraints. To this end, we are taking specific action in response to the outage this morning. The purpose of this message is to provide, for those who are interested, additional technical details about what occurred and details about the further actions that we are taking.

The GroupWise system has had an excellent availability record since it was installed. Barring environmental and other issues that have affected all of Drew's systems simultaneously, GroupWise has been continuously available since most users at Drew began using it last spring. GroupWise has been implemented specifically at Drew in such a way that it can automatically recover from most routine problems. Like many other services on campus, the GroupWise system uses a technology known as server clustering to ensure that services can continue to operate even if individual servers within the cluster fail. This along with other high-availability technology in use and proactive monitoring help to ensure a high degree of reliability.

What happened
For the past several weeks we have been tracking a minor issue with the individual GroupWise servers, for which we have an open support call with the vendor. While we are still unsure as to all of the circumstances surrounding the exact cause of the Friday morning failure, steps we were taking to troubleshoot this problem were in part responsible.

In a rare case where the clustering technology actually works against the availability of the system, a communications issue, which we believe was caused by the troubleshooting we were doing, tripped a safety mechanism built into the clustering software that is designed to prevent data corruption. This resulted in all servers being forcibly evicted from the GroupWise cluster simultaneously. Since the entire cluster was shut down, additional time was required to restore services because we had to perform what is known as a "cold startup" of the cluster.

What we are doing
We are taking several actions as a result of the Friday morning GroupWise failure:
  1. We have asked the vendor for a more conservative method of collecting the necessary diagnostic information that they require to troubleshoot our other issue. In the meantime, we have suspended our attempts to collect this diagnostic information using the technique we were using.
  2. We will be reconfiguring the GroupWise cluster in order to improve the situation if a cold startup of the cluster is ever required. Restoration of services took longer than anticipated Friday morning in part because the cluster did not start up in an optimal configuration. While cold startups of the cluster should be extremely rare, we will nonetheless be working to improve recovery times in the event of that situation.
  3. Within the coming weeks we will be announcing brief periods of downtime after business hours to install software patches for both operating system and GroupWise server components. Please stay tuned to your Drew email for these announcements.
  4. As always, we will continue to proactively monitor the performance of the system and keep the community informed of our efforts.
Once again, thank you for your patience and understanding.
__________________
E. Axel Larsson
Systems Architect and Director of the Enterprise Technology Center
Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -4. The time now is 10:01 AM.


Powered by vBulletin® Version 3.5.7
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.

Drew University is not responsible for the content of posts made on this site. All posts and comments reflect the opinion of the author.