Explanation for recent outages

Posted by Joey Day

We’re terribly sorry for the recent outages. The first one (Friday night) was apparently caused by a power event at the datacenter, and the second (this morning) was planned (though I received no advance warning) to repair damage to some failing components which were thought to have contributed the power event. Here are the reports I received from Goleman Networks for both outages:

Friday’s outage:

Date: Friday, January 25, 2008
Time: ~9:00 PM MDT
Duration: ~3 hours 30 minutes
Scope: SLC facility

Affected equipment:

All servers and equipment residing on our SLC network suffered a power outage starting at approximately 9:00 PM on Friday January 25th due to a power event at the datacenter.

Event description:

We began receiving down notifications that equipment at the Salt Lake City co-location facility was unreachable starting at approximately 9:00 PM on January 25th, 2008. The datacenter confirmed that the center had experienced a power event when we contacted them shortly after we became aware of the situation. Power to the datacenter and service to our racks was restored at approximately 11:30am. A software problem on our core router prevented service to our network and our customers’ equipment from being available until the configuration was restored at approximately 12:30 AM on January 26th, 2008.

We are waiting for a final report from the datacenter on what the root cause of the power event was and steps that are being taken to mitigate this type of outage in the future. As we have more information we will update this announcement.

Service Impact:

All systems located on the SLC network were unreachable during this outage.

Sunday’s outage:

Date: Sunday, January 27, 2008
Time: 12:00 AM MDT
Duration: 10 hours 30 minutes
Scope: SLC facility

Event Summary:

All servers and equipment residing on our SLC network experienced a planned power outage from the data center starting at approximately 12:00 AM on Sunday January 27th. The emergency power outage was necessary to replace a failing component in the facilities UPS System that led to Friday nights unplanned outage. A software error on our core router caused an extended network outage that lasted until the router was manually restarted this morning.

Event description:

The datacenter power engineers performed an emergency power outage Sunday morning at approximately 12:00am to replace a failing component in the centers UPS system that was identified as the cause of Friday nights unplanned power event. Power to Goleman Networks racks was down for approximately 30 seconds but our core router failed to return to service after power was restored. To further complicate matters our team did not receive down notifications as we did for Friday’s event because the alerts had not be re-enabled, a mistake on our part.We were notified this morning that our network was unreachable and immediately began working to resolve the outage. Service was restored at approximately 10:30 am this morning. We verified on our core routers ability to return to full service following a power outage at approximately 11:00 am this morning, which caused another outage lasting approximately five minutes while the core switch and router modules re-initialized. The router returned to full service without intervention after the configuration was updated to find the boot rom image on a secondary flash card if the primary is unavailable.

All system monitors and alerts are enabled to ensure any future events will be responded to immediately.

Service Impact:

All systems located on the SLC network were unreachable during this outage and experienced a loss of power for approximately 30 seconds at the beginning of the outage.

Leave a Reply