SATURDAY APRIL 15, 2006 Find  

Home

About
Apple
Career
Experiences
General
Graphics
Hardware
History
Humor
Interface
Networking
OS
Opinion
Politics
Programming
Quotes
Reviews
Security
Software
Sound
Thought
Web



Cheap International
Airfare Online

Wachovia online banking
Get Free Coupons Online
Finding the perfect
discount hot tub

Payday Loans
Stock Trading Online
Stuffed Animals
Smart Investing Online

HomeHardwareProcessors
IBM's 970
Power4-lite, and the future of the PowerPC?
     By: David K. Every
Kind:
Created:
Size:
Article
2002-10-31 06:27:41
21 KB
 
eople have been asking me about the 970 architecture and what it likely means to the Mac platform. I wasn’t able to go to this MPF (Microprocessor Forum) where the details were released, but I’ve read what information I could get via the Web. I also requested information from IBM a couple of times, but so far, haven’t been able to get any additional information beyond what they have on their Power4 (which I’d already followed and got the basics of). So the information I have is thinner than what I prefer before writing a deep technical analysis.

Ars-Technica did a pretty good article at http://arstechnica.com/cpu/02q2/ppc970/ppc970-1.html. And if the article was complete, I'd feel no need to write more on the subject.

While I generally like Hanibal’s ability to explain technical concepts; he has historically put a very Intel’ish spin on things, while minimizing points that reflect on the design flaws of the x86’s. Or in other words, his points are technically accurate, but a little misleading, and aren’t always the complete or accurate story. Read his article, and you would likely walk away with a completely different conclusion than if you factored in all the facts, or also read my own points (which counter-balance some of his).

Let’s look at some areas of the processors.


For example you read the entire article, and you don’t leave realizing that right now the vector unit (AltiVec) in a PPC (PowerPC) is so good, or the vector unit(s) in a P4 are so bad, that at 1/2 to 1/3 the clock speed and the PPC can still keep up and usually beat a P4. Or in other words, a PPC at 1 Ghz is currently competitive to, if not outright superior to, and P4 at 3 Ghz for vectors. And vectors are very important to media, streaming, graphics and many of the things that you need speed for.

The end result is that the 970 should conservatively cut the difference in clock ratios from 1 GHz : 3 GHz to about 2:3.5 or 2:4. Which no matter how you look at it, is a big improvement for the PPC. More important, in the same time frame, I expect to see improvements in vectorizing compilers, the libraries that Apple has already created to help programs (and programmers and apple using them), the software that uses them, and most importantly improvement in the cache and memory subsystems which should radically help the PPC’s ability to keep AltiVec fed (which is the only thing keeping the PPC from beating the P4 even worse in Vector performance).

So no matter how you cut it, the PPC is already superior to the P4 in vectors, and the 970 is an impressive jump forward.


One area that the PowerPC has not kept up is in Cache and Memory subsystems. The value of a these systems are over-rated, and the differences are over valued. You hear things like the P4 has a memory bus running at 800 MHz and the PowerPC’s only runs at 133 MHz. But then if you get the facts, the PowerPC’s is a wider bus (meaning it transfers more each cycle). When the PPC uses DDR and has double pumped memory, people quote the lower number, but for the P4 they always quote the higher number. People don't mention the latency of RDRAM, and so on. There are facts behind what is being said, just not complete facts.

Another thing to remember is that cache matters a lot. 98% of instructions are executed from cache. So if you have four times the bus bandwidth; which would translate loosely into 4 times the performance for 2% of the instructions (those that you are executing from memory). Implying that 4 times faster is really only like 8% better. Now that’s misleading too, in truth it is actually more complex than that. While that is true in instruction count, in time, those non-cache accesses take a lot of time, and in the real world, a lot of the time you are executing from cache you are prefetching other instructions and so on. And while 98% average hits are not unusual, there are many surges where programs/processors are very dependent on just moving things around. But the point is that a bus that is 4 times faster will probably only translate to 25-50% better in the real world; those stressing their advantage seldom point that out.

The PowerPC has better cache hinting, and the better ISA (instruction set) seem to mean better use of cache. But the 970 takes this to the next level, with twice the instruction cache, twice the L2 cache, and fast bus to memory controller. And the PowerPC already could share cache between multiple processors better than P4 can. So there are a lot of improvements.

All that being said, the PPC’s bus has not kept up in memory and bus performance. The Pentiums and AMD busses have done a far better job of moving more data. The PowerPC’s better instruction set, better cache hinting, and other things have helped a little. But memory and bus architecture is one area where the PPC has fallen behind.

The good news is that the 970 will apparently radically alter this, and will leap the PowerPC bus and memory bandwidth from worst in breed to the best-in-breed level. Now there are many busses, and I don’t think we have the full story there; I think the 970’s 900 MHz processor and cache bus is only a small part of the story: I’m more concerned with all the other I/O and how well the computer as a system works rather than just the computer as in a processor. But the point is that if bus bandwidth is what’s been holding the PowerPC back, then we’re going to see a radical leap forward.


In processors you really hate pipe-stalls. What happens is that your processor is trying to run ahead of itself, and pre-execute many instructions at the same time. Instructions are basically setting or getting values, doing math or computations on those values, and then branching forward or backwards depending on some test (conditional). Braches are ugly, because they can go either way; either pass or fail the test. So the part of the processor that is “reading ahead”, has to guess which way things are going to go. When it guesses the wrong way, you basically dump all your pipes (and some cache, and waste some memory bandwidth, and so on), and start over. The other side of this is the bigger your pipe, the more the impact of these hits. These hits add up and can be a pretty big deal in processor performance.

After reading Ars-Technicas article, I don’t think the average user walks away fully grasping that the G4 has a simple branch prediction unit, because it only needs a simple one. It has a shallow pipe, so the penalties in mispredicted branches are smaller. The P4 has a huge pipe, and so penalties are huge, and thus it has a far more complex branch prediction unit. And they still seem to be roughly equally effective in the real world; though stats measuring just that aren’t often published. But the 970 has a smaller and more efficient pipe (than the P4), combined with a far more powerful branch prediction unit. That to me says, better and better. The ars-article starts making excuses for why it needs that, and there are some reasons why the IBM designers cared, but I don’t think most people left the ars-article realizing that the 970 is just flat out better at branch prediction; which means more efficient use of resources.



ISA means instruction set architecture. Back when programmers used to program assembler this mattered a lot. Now days, compilers hide assembler behind higher-level languages so it matters less. But still, the better and cleaner your instruction set, the easier it is to write good code, or good compilers. The Intel proponents know this, but also know that their instruction set is an ugly, 30 year out of date, hacked together kludge, so they never seem to mention this important fact.

Compilers hide some of the ugliness, and can help reduce the inefficiencies, but in the end there is a loss. As processors get bigger and bigger, this loss is minimized; because the transistors required to translate the anachronistic instructions into better ones takes less and less of the total chip. But it still matters. Even the PowerPC instruction set, which is about 3 generations better than the x86’s, still has inefficiencies and ugliness in it. Yet the facts are that the instruction set of the PowerPC is still far better than x86. The PPC has more registers, can use those registers better, is far easier to program, and so on.

How much this superiority translates to the real world is debatable. There might be a slight advantage for the older x86 instruction set, in that it can theoretically be more efficient with memory (do more in each instruction). But in practice, while there might be some “code creep” (increase) using the PowerPC, the inefficiencies in the x86 instruction set causes at least as much loss, as you need to add instructions to get around limitations or lack of registers and so on. This probably contributed to why the PowerPC seems to require less power, less transistors (manufacturing costs) and less designers to get the same performance.



IPC is instructions per cycle. What it means is that for each clock, how much can the processor get done. The PowerPC’s are far better at this than Intel processors, especially the P4. This is why PowerPC’s running at slower clock rates can still compete with faster Pentium4’s. The P4’s are running fast, but the PowerPC’s are getting more work done.

One of the ways you get more work done each clock, is to have more internal units, and to keep those units fed. You feed them through cache, bus architecture, and by being able to more efficiently use those units. The latter is accomplished by OOOE, or out-of-order-execution.

Basically, you have many units of different types (complex integer, simple integer, complex float, simple float, branch, vectors, and so on); but instructions can come in clusters, all wanting to use units of the same type at the same time; so you can either have all the other units sitting around and twiddling their anthropomorphic thumbs, or you can just reorder instructions (the code) to better use all those units at once. You just advance some things that you’re going to need to do in the future anyways, and do them earlier, and then remember them for later when it is their turn (you reorder them and put them back in place). This helps smooth out the rough spots, and balance out the clusters or surges, so you more evenly use the processor, and get more instructions per second done. The bigger the reorder buffers, the more you can smooth.

Now the 970 has more units that can be executing things at once. Most of those units appear to be better designed so work better. The 970 also has a better ISA than the P4, which means that instructions closer match on the outside what the internals of the processor are going to require, so less reordering is probably necessary and more can be done in the compiler ahead of time (with less work). But to make up for being better at that, the 970 is better still by having the ability to reorder more instructions than the P4 as well. Again, better and better.

The Ars-Technica article spent a lot of time sort of confusing that matter up. They point out that the 970 groups instructions into chunks, and that it dispatches in ways to keep those units filled, and they spent a lot of time explaining bubbles and inefficiencies. They didn’t really explain that those bubbles happen in all processors or getting to the final point.

So while the P4 can be completing 126 psuedo-instructions at once, the 970 can be doing 200. Yes, there is a little less efficiency in the way that instructions are paired, and so sometimes it puts in do-nothings (nops or no-operations). But let’s go on a limb and assume there’s as much as 40% inefficiency or that it is 60% efficient; that would still mean it has more instructions running (at some stage of completion) at any given time than Intels design. On top of that, that would be assuming 100% efficiency for the P4, which is just not the real world.

When you look at the P4, you realize there are bubbles and stalls as well. On top of that, branch prediction is also important to make sure that you’re executing the right instructions (and the 970 is better at that as well). On top of that, in the P4’s case, you have the inefficiencies of the worse ISA converting single instructions into many more IOP’s (internal instructions); where the 970 is far closer to 1:1. So of those internal instructions, the 970 is completing more real instructions at any given time. And so on, and so forth. So the Ars-article uses smoke and mirrors to obfuscate the fact that the PPC G4 is already far better at IPC than the P4, and the 970 takes that to the next level or two, and is a step forward.



There are many ways to mislead people. Early in the last century there were two leaders: one was an honest vegetarian, artists, that wasn’t promiscuous, didn’t drink much compared to his contemporaries, and lead his people out of a depression by creating all sort of charity and work programs. The other leader of the time hung out with crooks, broke the law, tried to seize even power in his nation in what could almost be described as a failed coup, mislead the people, exploited situations and power for personal gain or the gain of his family, was a chain smoking alcoholic cripple that cheated on his wife. But these true statements would make you think that Hitler was a better person than say FDR (Roosevelt); oh yeah, I forgot to tell you that the first one was responsible for one of the worst genocides in history and the other was trying to stop him. Sometimes partial facts don’t tell the whole story. The value isn’t in raw facts, but how well you interpret them.

The Pentium can do some things, like execute “4.5” integer instructions at once, which sounds good and the 970 can only do 2, you’d come away misled and thinking that the Pentium would be better at instructions per second or better at Integers. But the truth is that if you are trying to execute many Integer instructions each second, then you’re going to use the AltiVec (Vector) unit anyways. That can execute 3 instructions at once, each of which is up to 16 or more other sub-instructions at once (2 ALU’s that can be doing up to 16 vectors/values, plus a permute or reorder-shift unit, that can all be working at once). So in the 970, you use the Integer units for less (just some scalar math, and then for flow control), because you have a much better vector unit; and in fact the 970 will just blow the doors off the P4 at most integer instructions per second, or most things you would use the integer unit for (like just moving things around, and so on).

My only problem with the Ars-articles (Hanibal’s especially) are their conclusions. They often lay out the facts, but skip important ones, or spin things, and leave people thinking that the x86 is really great stuff. But then that seems to conflict with the real world, and how come a processor like the G4 can do so good at nearly keeping up, and often beating the P4’s, with such a primitive architecture and at 1/3rd the clock rate? The answer is in the omissions.

Don’t get me wrong; I think the P4 is a more sophisticated processor than the G4. And I think for more things the P4 will be faster. But not nearly as much as most PC advocates think or will mislead you into thinking. I do think Apple stresses their strong points and hides their weak ones, just like Intel or the PC advocates do. So it isn't black and white. But in the end, the propeganda and misinformation machine of Intel is far more effective than Apple's is, which is proven by how misinformed people are, or how surprised they are when they actually use a Mac to do work.



So far the Power4 stomps the Pentium in floating point, and competes seriously on scalar integers. But the 970 is likely to improve both. It is also going to make an even bigger impact by adding in AltiVec; where there aren’t good benchmarks (other than Photoshop or other real world) things to reflect on that.

IBM has traditionally been very conservative with their benchmarking and performance numbers, and traditionally over delivers. Intel has traditionally been misleading and overstated their numbers, and been sued for that a few times. Estimates seem to be that worst case they’ll be at close to parity by next year. Which means most likely, that in the real world, the IBM based solutions should be far better.

Intel only measures raw performance of things they are good at (highly optimized benchmark code). The real world is that Intel processors are usually running Windows, less optimized commercial applications, and so on. I suspect that OSX and LINUX running on the 970 is going to be a far more efficient OS to run things on than Windows. Add to this that you’re probably going to have better compilers and apps, and you end up with better still real world performance.

So where is the 970 better? Just in instruction set, use of transistors, power usage, cache, cost to design and complexity, cost to manufacture, bus speed, branch prediction, number of units, out of order execution, instructions per cycle, multiprocessing, ease of tuning and hinting, integer vectors, floating point vectors, 64 bit support, pipeline design, and probably a few things that I forgot or don’t know of.

And Apple and their OS seems more ready to use more processors at once, and IBM is likely to produce multiple cored versions of the processor in the future, and because the processor produces less heat and uses less transistors, and is better at MP, it will be easier to do so. Plus IBM is the cutting edge company when it comes to manufacturing process; again, more likely to go multi-cored sooner.

What if the P4 going to be better at? Possibly a little in instruction density; programs take less space because they don’t support 64 bits, have a more dense and older CISC based instruction set design. The P4 will be better in raw clock speed. And the P4 will likely support hyper-threading; a technique that allows the processor to try to execute two whole threads of execution at the same time. So far hyper-threading has netted them very little boost in the real world, and a few losses; and IBM did that first and better, but abandoned it as not being worth it. But it still has the potential to give them some boost. But a true MP or multi-core design could do far better, and IBM and Apple have demonstrated they are more aggressive with.

We are speculating about a processor that is a year out, and the actions of Apple computer. Apple hasn’t always been the most predictable or wise. In fact, my biggest concerns are in how Apple is going to react. Traditionally, Apple managers have a history of thinking marketing. Marketing logic tells them not exploit things to the maximum possible in the first pass, and cause a demand spike that they can't keep up with; but instead they have often crippled or limited first releases to limit demand and so that they could improve things in later versions, and try to smooth out the spikes in product evolution that are natural. This makes perfect business sense in the now; but is a way to limit growth to manageable levels; but it also hurts the perception as people sometimes feel they are holding back. And there certainly is some justification that they are doing exactly that. (Performa’s and other fiascos come to mind). This is a new Apple. Let’s hope they learned from older mistakes and are going to just let hardware designers do their jobs and make the best machines they can, instead of having marketing design the hardware. And they realize that not being able to meet demand hurt business in the short term; but being the most wanted and hard to get machines can help you in perception and demand in the long term.

In the end the 970 has a lot of potential; but we’ll have to see how well that potential is realized. Apple and IBM have the ability to remake their whole CHRP dream come true; only a decade later than originally intended. If they do make nice, this could have serious positive ramifications for both companies and more important the computing industry at large. I’d love to see IBM and Apple producing UNIX boxes, that scale from the low end, all the way up IBM’s mainframes or at least higher end servers. This could do more for UNIX and computing than LINUX has. There is the potential to beat the Wintel camp; not just in performance which is becoming less and less critical by itself, but in getting work done and creating powerful solutions. There’s still a ton of bias, misinformation, and egos to get over before this dream can become a reality. But even worse case, and if those dreams aren’t realized, this seems to be a major step forward for Apple and it’s customers. And for the Wintel camp competition is likely to result in more innovation and value. So let's keep our fingers crossed and hope that next year nets us some real cool toys, and a serious change in the computer market.

Format for Printing  Mail 

  About | Contacts | Privacy

Copyright 2003 DKE • All rights reserved • www.iGeek.comLegalese