Originally Posted by E. Axel Larsson
We had a problem this evening with one of our three iChain servers. The iChain servers sit in front of all of our web-based services and provide single-sign-on capability, allowing you to log into all of Drew's services such as Webmail, CampusWeb, Moodle, this site, etc. once with your Drew uLogin ID and password.
It looks like one of the three machines started to develop a memory issue around 8:00 pm which continued until it was rebooted after midnight at around 12:30 am. During that time, that machine was not authenticating new users. New users trying to log in would receive a 500 Internal Server Error. Since this was only affecting one of the three machines and they are load-balanced the problem would not have impacted all users, and most likely closing browser and logging in again would send your session to a different machine which was working.
Since none of the services on the iChain machine actually crashed the failure was not detected by our monitoring system. The affected machine was restarted shortly after receiving a user report of the problem.
We apoligize for the outage and I am currently looking into options to enable our monitoring system to detect this type of problem in the future.
We have now implemented some additional service monitoring so that we will get an alert notification if this specific issue occurs again.