ORCD Newsletter: April 2024

From Peter Fisher, ORCD Head: GPUs on ORCD Clusters

We at ORCD think a lot about Graphics Processing Units – GPUs – the uninspiring name for the magical devices that make AI tools possible. Many multi-cpu servers have GPUs – the Engaging cluster has 366 online, as you can find by typing the truly absurd command on any ORCD cluster:

scontrol -a -o show node | grep -v Reason | grep gpu: | sed s'/.* Gres=gpu:\([^ ]*\).*/\1/' | sed s'/\(.*\)(.*/\1/' | sed s'/[^:]*:\([0-9]\).*/\1/' | awk '{x=x+$1}END{print x}'

Across all ORCD systems (Engaging, Satori, Openmind and Supercloud) there are more than 1400 GPUs, of various generations. Of these, only 32 are currently the latest generation H100 GPUs, although another 96 are in the process of being added to Engaging. All the H100s to date are PI-funded and have priority access for specific sub-groups.

Serious AI’ers need serious GPU support. Around MIT, the magic number from PIs seems to be 4 Nvidia H100 GPUs, on average, per researcher. From the Y report, MIT has about 1,000 graduate students in this space, and at $30,000/GPU (see here), this rolls up to $120M. Moore, Dennard, and warranty coverage give this computing hardware a 5-year lifetime, so the renewal costs come to $24M/year. Each GPU takes up to 1 kW, so electrical power comes to 4 MW (MGHPCC has about 3 MW for MIT right now), and running them at an average of 50% power consumption comes to 21M kWh/year (including energy for cooling). Power in MGHPCC costs about $0.09/kWh for a power bill of $2M/year. In a perfect world, MIT would make a $120M investment in new hardware right away with a combined operation, maintenance, and renewal cost of $26M/yr. Here is the scaling relation if you want to adjust any of the numbers in this estimate:

Image
Scaling relation formula

Maybe each researcher can get away with 2 GPUs instead of 4 or 500 instead of 1,000; researchers need that level of service, or other factors lead to a factor of two over estimate. Our researchers currently spend about $10M on new hardware and $5M on cloud services, so existing expenditures, in concert with a fundraising effort, could meet our renewal needs and grow a large shared system.

Our researchers cannot wait for the needed culture change and slow growth. That's why we have initiated a crucial fundraising effort to meet MIT’s near-term computing needs, especially for AI-oriented hardware along with matching fast storage and fast networking. All these, alongside professional staff, are needed to make systems fly for research. Your participation is vital in this endeavor.

Headlines

What we're reading