Tuesday, February 08, 2005

Cell and Apple?

This week at ISSCC 2005, IBM/Sony/Toshiba released some preliminary information on their new Cell processor, which is destined to be used in Sony's PlayStation 3, to be released in 2006.

What is Cell?

1) Cell has a single 64-bit PowerPC in-order core with supports both VMX (aka Altivec) and simultaneous multithreading (SMT). Contrary to rumours, it is not based off the POWER5 core. Rather, it is based on a previously unused design discussed at ISSCC 2000, but likely borrows some design concepts from POWER5. Power utilization of the PowerPC core is unknown.

2) Cell also has 8 independent "synergistic processing elements" (SPE) which function akin to complex DSP cores. Each SPE has its own local 256 KB memory. With 8 cores, and another 512 KB of L2 cache, that's a total of 2.5 MB of on-die memory.

3) Cell's clock speed for the PS3 is unknown, but Cell apparently can hit 5.2 GHz at 1.3 V. Power utilization is of course high at that level, with each SPE drawing around 12 Watts, but at 4 GHz it can probably run at 1.1 V. At this level, each SPE may draw only 4 Watts.

4) Cell's 9 cores are all connected by an "element interface bus" (EIB) which runs at half of the clock speed. The EIB also connects to the 512 KB L2 cache, the bus interface controller (BIC), and the dual XDR memory controller. In other words, the on-chip communications is very fast between the various cores and controllers, at 2 GHz for a 4 GHz chip.

5) Cell has 234 million transistors, and fabricated on the 90 nm process, measures 221 mm². However, in order to reduce die area, power utilization, and cost, it's quite possible that a PS3 version of Cell would have a reduced number of SPE cores, or would be fabriated on the 65 nm process, or both.

In essence, this Cell chip is a new and lean PowerPC chip with Altivec support, and a bunch of non-Altivec vector CPUs tacked onto it (each with its own local memory), all connected by a very fast bus.

What does this mean for Apple?

Quite frankly, I'm not sure. It's a very high-clocked chip, and the PowerPC/Altivec core could likely be supported easily in OS X. However, the SPE cores would be new for Apple, and at this point might just be extra baggage. And considering the amount of die real estate they consume, that baggage would be awfully expensive.

If Apple were to consider using this chip complete with the SPE cores, it would take a significant amount of work to get everything in place in Apple's OS and software, as well as to get the developers on board. Without the SPE cores (and the funky busses) the chip seems like a more traditional design, but with supposedly some potentially significant limitations, despite the high clock speed. (Supposedly, because I'm not an engineer.)

Given the above, it still seems to me that the more likely scenario is that Apple will continue to use the G5 970 series chips, with higher clock speeds and further optimizations for lower power utilization, and perhaps dual cores and increased cache. Eventual stripped-down SMT-capable POWER5 cores with an integrated memory controller and Altivec would also be used.

Cell presents an interesting proposition for Apple for use in the future, but it doesn't seem to be in the cards in the near term.

One more thing... Rumours suggest that Xenon, the chip in Microsoft's Xbox 2, also uses a similar PowerPC core, except that Xenon has 3 of them, and no SPE cores. This would mean that IBM is selling this Power Architecture core design to at least two major competing customers, but is letting them decide on how the overall chip should be implemented.

4 comments:

Anonymous said...

I agree with your assessment that CELL is a long shot for the Mac. Even though it has considerable promise, the CELL design still hasn't even proved that it can deliver the goods in a game console.

Brian

Anonymous said...

Without the SPE cores (and the funky busses) the chip seems like a more traditional design, but with supposedly some potentially significant limitations, despite the high clock speed. (Supposedly, because I'm not an engineer.)
Two things make the CELL's PPC core (hereafter referred to as "CELL") less desirable then the PPC core in the G5 (here after referred to a "G5"). I think the CELL.

The first one is I think the CELL has fewer functional units then the G5, so while the G5 can execute a branch, two adds, a two floating point add and multiplies, a vector permute, and some other stuff all in the same cycle the CELL can't do nearly as much.

The second, and this I'm sure of, is the CELL is strictly in order while the G5 can do a ton of speculatave out of order stuff. Imagine a program that does a load that misses the L2 cache, and then a bunch of following instructions some of which use the loaded value, and many of which don't (this isn't at all unusual).

The G5 will start the load remember all the stuff that needs that load, and execute all the stuff that doesn't need the load (and keep those results in a special place in case the load fails and it needs to undo them all). When the load finally completes it will execute all the stuff that needed the load and start committing the bits that didn't need the load.

The cell will see the load and stall out until the load completes. Once the load finishes it will pick up from there.

It isn't quite that bad since the cell can switch to another thread (but only if one exists) while it wait s for the load to finish, but that doesn't help most programs. It is also worse in some ways because wile the G5 is running the rest of the code waiting for the load to finish it can start up a few more loads that it sees on the way while the CELL will have to stall several more times.

In some cases sprinkling the code with the right prefetch instructions in the right places can ease this pain. In other cases it is too hard to figure out what addresses to prefetch early enough to help.

Basically the gap between what the CELL can do in "the best case" and what it will end up doing on "any old code" will be much larger then on the G5.

And all of that is without considering what the 8 SPEs would be doing (which also strike me as CPUs that have a huge gap between "best case" and "any old code"). It is easy to imagine code on the CELL that is written specifically for the cell running 100 times faster then code written "for a PPC", or even just "in plain old C" running on that same exact CELL CPU.

Anonymous said...

What you guys keep forgetting:
What about the option of using the Cell as a Co-Processor? Imagine Apples Pro Suite with Cell-specific optimizations: FCP, Shake, Motion, DVD Studio Pro, Logic even! It's the PERFECT match for these programs!
Hell, it could be moved down into consumer space later on, iLife could significantly profit from it, too! Imagine this: iDVD could actually become USABLE on anything but High-End G5s! ;-) Encoding a full DVD would be a matter of minutes with the Cell, not a matter of hours as it is now..

Apple would just supply the APIs for the co-processor and whoever wants to use it can do so. Current code continues to run fine on the standard G5 (no, they won't ever make G4s with this, not with the age-old MAXbus arch! ;-)

Another argument for the co-processor version would be that memory-architectures as they are available now and next year for PCs (even DDR3) would starve the Cell bigtime, especially because it relies so heavily on streaming from the RAM. So it would need its own fast XDR-Memory anyway!

One more thing: The Xbox2-CPU isn't called Xenon. That's the Xbox2 development name for the whole system, the CPU carries a different name.

Eug said...

Yes, thanks for correcting me. Xenon is not the name of the next-gen Xbox's CPU specifically. Xenon is the name for the entire system.