It turns out that in addition to the issues of power consumption and hardware monitoring, the main issue was indeed the lack of ECC memory. Virginia Tech itself states:
On September 23, 2003 we turned it all on and began the arduous process of stabilizing and benchmarking this machine. After many more sleepless nights and countless grams of caffeine we finally reached our 10 teraflop goal.It seems rather odd that VT would build a 2200 CPU cluster merely as proof of concept and to hit 10 Tflops/s (and make the top ten list). One may wonder if they built it thinking it may have been suitable for certain short-term usage even without ECC, but then later realized the lack of ECC memory was more problematic than they predicted.
Well with the concept proven we now had to make sure we had a system capable of conducting scientific computation. We needed to upgrade the system to something with error correcting code (ECC) RAM. The Power Macs did not support it and the XServes were coming. So in January we tore the system down and started prepping for the XServes. And now they're here and we have our final system. The best is yet to come.
My guess however is that Apple had told them that the G5 Xserves were coming, and Virginia Tech indeed built the first System X mainly for benchmarking, expecting the G5 Xserves to appear just a few months later. Furthermore, the project's leader, Dr. Srinidhi Varadarajan, had stated early on that they would be moving to an ECC-capable system later. We just didn't realize how soon that would be. Unfortunately for Virginia Tech (and Apple), the new machines were delayed 6 months because of IBM's problems with their 90 nm process shrink.
However, it's all history now, since the new System X is finally up and running.