ORCD Newsletter: October 26

Planned and Unplanned: A Note from Our Executive Director on the Events of October 14th, 2023

Life is full of unplanned activities, and nothing is more exciting than responding to events that are unplanned. This was the case during the evening of Saturday, October 14th, 2023. That was the night when all of our phones started to ring. All electrical power had unfortunately been lost to our flagship high performance computing datacenter, causing computing for each of the five MGHPCC member institutes to immediately, and somewhat abruptly, come to a halt.

As the machines sat quiet and alone in a now dark and silent room 88.2 miles from the ORCD offices, we knew we needed to build a plan to bring them back to life as quickly and safely as possible. Such unplanned situations are so rare—this was the first of its kind in over a decade—that forming good plans around them is unfortunately inordinately difficult. However, the ORCD team scrambled into action. 

We knew what we needed to do. 

Unfortunately, HPC systems of the size we have here at MIT aren’t quite like our laptops; we can’t just turn them off and back on again and expect them to work. Things need to start in a very precise and predictable order. MIT has a huge number of extremely complex systems, and a very large number of them have grown organically over time which has, to this point, made any attempts at automation extremely difficult to achieve—especially after a hard crash.

Power was restored in the morning, and a whole bevy of MIT staff and contractors drove out and assembled on site to bring up and restart all the services. As we said, MIT has an awful lot of services. Many unique, artisanal services. What makes MIT a special place is our ability to engineer exotic systems and services. However, in the case of an unplanned power outage like this, it does create some unexpected challenges. What is connected to this machine? Which services does it provide? Is this a critical service? Also what’s that beeping noise, and where is it coming from?

1,000s of machines and 100,000s of processors needed to be jump started, with petabytes of storage split over a vast number of bespoke clusters all loosely connected like spaghetti in a bowl. As we methodically restarted each and every boutique service for authentication, login, storage, network along with all of the handcrafted operating systems, we were each reminded of the mindblowing scale and complexity of the operation that ORCD is responsible for.



Our mission at ORCD is to make sure our research scientists have access to the very best possible computing available, and that weekend, we all learned a valuable lesson of exactly how much we at ORCD still need to do to harden and improve our processes and systems to plan for availability. Over ten pairs of human hands were needed to resuscitate the various computer systems and services; it took the best part of a day to achieve availability with a week of “loose ends'' where machines didn’t quite operate as they should and interacting with our community to understand exactly why they weren’t behaving.



No one wants to deal with unplanned activities, but if we don’t learn from unplanned events and use them as an opportunity to become better, faster, and stronger we have missed an important opportunity to grow. This is where we are at right now, growing fast and trying (literally) to keep the lights on. We also know the root cause of the outage; it was the first of its kind in over 10 years of our operation. It all started with an electrical bushing failing in an upstream power substation. (Wikipedia has a nice write up if you’re interested in reading more about “bushings.”) Many complex things downstream from that broken bushing unfortunately also cascaded in failure, eventually resulting in us all seeing “ssh connection refused” responses to our terminal sessions.



Our MGHPCC facilities director wrote in his incident report:



When utility power fails [Cuff: this was that pesky bushing going bad upstream], Uninterruptible Power Supply (UPS) units instantaneously provide power to selected racks and facility infrastructure, allowing equipment to continue functioning without interruption. The UPS units provide power long enough for backup generators to start up and synchronize with the UPS power. When a generator is ready to take over from a UPS unit, an Automatic Transfer Switch (ATS) manages the transition. When the Utility power failure occurred on Saturday evening, all generators and UPS units functioned properly. The transition from UPS to generator power for facility infrastructure also functioned properly. However, when the generator that supplies backup power to the Computer Room was ready to take over, the ATS did not allow the transition. The root cause was a safety interlock that was enabled during scheduled testing of the ATS earlier this year, and was not properly disabled when testing was completed. To avoid this condition in the future, we will add steps to the procedure for scheduled testing to include a double check of the state of all ATS interlocks after our testing contractor has finished their work.”

As you can see, even the systems that power our systems are not simple. They, like our computer systems, have a lot of moving parts and complex components that all have to function perfectly. It is a very delicate dance one needs to carry out to provide megawatts of power for giant, hungry computers. We are extremely grateful to the amazing MGHPCC facility and staff for making any of this feat even possible each and every day.



In closing, we do want to sincerely thank every single member of our community for bearing with us, helping us diagnose issues, and having amazing patience with us as we all work together to build a better set of services for everyone on campus.



We could not do this without you, and we really do appreciate each and every one of you!

 

Image
ORCD staff member Renée Hellenbrecht

Meet the Team: Renée Hellenbrecht

Renée Hellenbrecht, our Program Administrator, comes to ORCD from the MIT Libraries, having worked as an administrative assistant in their technology directorate and a project coordinator in their administrative directorate.

After a high school illustration program at Rhode Island School of Design and a college internship at a recording studio convinced her to move away from the art and audio production worlds, Renée began working in events and non-profit administration. She spent nearly a decade as the operations manager of a small non-profit before coming to MIT. Renée holds a BA in Audio/Radio Production from Emerson College and an MS in Business Management from Lesley University.

In her free time, Renée enjoys baking, watching horror movies, taking her dog on adventures with her partner, telling herself that she will definitely be improving both her French and Python skills one of these days, and collecting far too many books on Tudor history and Japanese art.

Above the Fold

  • During the pandemic, Peter subscribed to a newsletter from Katelyn Jetelina, Your Local Epidemiologist. Her most recent letter has some good advice for those of us struggling with the events of the world.

What We’re Reading

  • Many ask ORCD, “Why don’t we just move all our research computing to the cloud?” Aside from the expense, a recent study from Oak Ridge shows the cloud may not be ready for us.
  • MIT’s secret creative writing group here.
  • Finance as art.
  • How cats purr
  • Largest map of the human brain - story here

Events Around Campus