PPC crosscompilation benchmarks

I'm frustrated with the lack of hard benchmarks to allow comparison between x86 and Apple hardware. Most published Mac benchmarks want to talk about Photoshop filters, media encoding, and so on. The Power Mac G4 page says the 733 MHz G4 is up to 57% faster than a 1.5 GHz Pentium 4---with the footnote "*Based on a suite of performance tests using Adobe Photoshop 6.0." Apple's G4 page talks about supercomputing. And those are good and useful numbers if that's what you care about doing.

I don't do supercomputing. I don't do Photoshop. My media processing is limited to MP3 playback, and one MP3 encoding tool (with very particular parameters). My two time sink games are Diablo 2 and Counter-Strike, which aren't 3D floating point monsters.

I care a lot more about integer code. Stuff like web browsers, mail readers, editors---heck, anything with a lot of cycles going into the interface. And I'm a hacker too, so I care about shells, compilers, interpreters, emulators, network servers, and so on.

I don't have a good way of measuring UI performance, especially across platforms. But I do have a number of big "rebuild everything" compile tasks that I want to be fast. Almost all of them are hosted on Linux, which is good because Linux runs on most hardware I run into. While these tasks are no substitute for a broad, principled benchmark like SPEC CPU2000, they do exercise many different features and can serve as a rough proxy for integer performance. And besides, I already had my tasks ready and packaged :-)

(Update: added cross-only totals and bogohurts numbers; the implications weren't clear enough in the summary tables.)

Tables

	install-egcs					install-glibc					cross-gcc					total
Machine	wall	(sec)	user	sys	cpu	wall	(sec)	user	sys	cpu	wall	(sec)	user	sys	cpu	wall	user
bandwagon	6:14	374	328	32	96%	7:25	445	352	82	97%	2:29	149	137	10	98%	968	817
family-values	4:36	276	219	31	90%	5:35	335	242	71	93%	1:38	98	85	10	97%	709	546
style-over-substance	11:53	713	619	84	98%	13:15	795	548	235	98%	3:59	239	207	28	98%	1747	1374
ohlfs	8:51	531	464	64	99%	9:47	587	399	182	98%	2:59	179	157	22	99%	1297	1020

Wall clock performance relative to P3/733 baseline (bigger is better)
Machine	CPU family	MHz	install-egcs	install-glibc	cross-gcc	total	cross only	"P3 Bogohurts"	"Cross P3 Bogohurts"
bandwagon	P3	733	1.00	1.00	1.00	1.00	1.00	733	733
family-values	Athlon	1200	1.36	1.33	1.52	1.37	1.37	1001	1006
style-over-substance	G3	450	0.52	0.56	0.62	0.55	0.57	406	421
ohlfs	G4	533	0.70	0.76	0.83	0.75	0.78	547	568

User time performance relative to P3/733 baseline (bigger is better)
Machine	CPU family	MHz	install-egcs	install-glibc	cross-gcc	total	cross only	"P3 Bogohurts"	"Cross P3 Bogohurts"
bandwagon	P3	733	1.00	1.00	1.00	1.00	1.00	733	733
family-values	Athlon	1200	1.50	1.45	1.61	1.50	1.50	1097	1096
style-over-substance	G3	450	0.53	0.64	0.66	0.59	0.65	436	475
ohlfs	G4	533	0.71	0.88	0.87	0.80	0.88	587	645

Oversimplified chart

(For people who want charts and graphs. Ken Fox is working on some more graphs--thanks!)

How to read the tables

I use the /usr/bin/time tool to measure statistics on each build process. First, it reports wall clock time---looking at the clock on the wall, how long did the process take? It then reports how much CPU time was charged to the build process by the kernel's time accounting. This is divided into user process time and system time; the latter is time spent in the kernel on the process's behalf. You can think of user time as how fast things could run if the kernel was infinitely fast.

Total CPU utilization as a percentage reports what part of the wall clock time was actually spent directly working on this process, either in user space or in the kernel; this will be less than 100% because of other running processes or waiting for disk I/O.

I don't entirely trust the kernel's CPU time accounting; it's had problems in the distant past. The wall clock times should be trustworthy, though.

The bottom two tables summarize the raw data from the first table. I decided to use bandwagon, a P3/733, as a baseline. "cross-only" is the total of the two tests that should be roughly equal work on any architecture: install-glibc and cross-gcc. See below for details.

In addition to relative performance metrics, I've added a bogus clock speed rating. It's just the machine's performance multipled by the baseline machine's 733MHz clock speed. It suggests how fast a Pentium 3 you'd need to equal the speed of the machine being tested. It's bogus in a lot of different ways; for instance I doubt a P3/850 would reach a score of 850, and a P3/450 should do better than a 450. But it's fun to pretend performance scales linearly with clock speed....

Oh yes. I've run these benchmarks multiple times on each machine. The results were easily within 3% of the original each time. Really, I should have averaged them all but it didn't seem worth it with so little variation.

Choice of workloads

"Build the Linux kernel" is not a good portable benchmark; the set of drivers built will vary from architecture to architecture. "Build gcc" suffers the same problem; different architectures have different code generators.

In general, timing native builds across platforms does not measure equal work. Some architectures require more optimization work to be done at compile time in return for improved run time performance. The only people who should care about native build performance are developers who are planning to work directly on the platform. (Yes, that's me sometimes.)

Instead of compiling source to native binaries, we can compile source to binaries for a single chosen platform. If every platform under consideration produces identical (say) SPARC object files, the same amount of work has gone into their production. (This is the basis of benchmark 176.gcc in SPEC CPU2000. Shame SPEC benchmarks aren't free.)

Compiling for a processor other than the host machine is called cross-compiling, and it's most often seen in the embedded world. I work on the Linux VR project, an ongoing port of the Linux kernel to embedded MIPS processors such as the NEC VR41xx series and Toshiba TX39xx series. I mostly work with the VTech Helio and Agenda VR3 PDAs. Most people don't have powerful little-endian MIPS-based machines available, so most development is performed on standard desktop Linux boxes and cross-compiled to produce MIPS binaries for the devices.

I've done work on the compiler/assembler/linker toolchain itself for Linux VR devices, as well as on the GNU C library. Often changes to one part of the system would require a rebuild of everything from scratch; waiting around for my 300MHz Cyrix box to chug through at all gave me a lot of time to contemplate performance and upgrades....

It would be nice to have some easy benchmarks that don't depend on Linux. This would remove the inevitable complaints that "Mac hardware was designed to run Mac OS!"/"Sun hardware was designed to run Solaris!" etc. I think a really big, complicated LaTeX document could be a good start; TeX is available for most platforms and operating systems and produces identical output everywhere.

Ground rules and assumptions

Tests are all run on Debian Linux 2.2latest, with the recommended stable kernel. This keeps down the number of variables.

No special compiler options are given. Most of the binaries that ship with the Debian system are built like that. Also, shipped binaries have to run on every flavor of the processor family, so you can't use stuff like -march=i686 because some of the opcodes aren't available on other x86 implementations. Quick tests show various optional G3 optimizations in gcc don't seem to help much, and AltiVec is mostly useless for these kinds of programs. (I haven't decided how strongly I believe in this no-flags rule.)

I don't care about non-gcc compilers. I don't get to use them. It seems like few people use non-default compilers to build apps, which is why I'm not that excited about the fantastic numbers Intel's compiler gets---most people are still using VC++... And remember, Mac OS X ships with gcc as the compiler, so the generated code quality should be comparable. If Apple's gcc has PPC optimizations the FSF tree doesn't have, well, I'll get them when Apple contributes them back to the FSF.

The Linux kernel on PPC probably receives less tuning than the x86 kernel. Counting userspace times should compensate for that somewhat.

The tests: what they do

install-egcs

Unpack GNU binutils, the egcs-1.0.3a version of gcc, and glibc 2.0.7 (sic) from upstream source RPMs. Apply standard and custom MIPS patches. Configure binutils and gcc to produce little-endian Linux/MIPS binaries ("mipsel-linux"); build and install them.

This measures native builds. It only should be of interest to people who are planning to do development directly on the platform.

install-glibc

Using the cross-compiler produced previously (mipsel-linux-gcc), configure and build the mipsel-linux version of glibc. Install it to a temporary directory, fix up a few files, and cpio it to the destination.

This test is the same amount of work on any platform.

glibc has a complex build system; just the make process accounts for a significant portion of execution time. There are lots of compiles of little fifteen line files with a single function in them. Lots of shell stuff kicked off too. The kernel does a lot of work.

cross-gcc

Using mipsel-linux-gcc, cross-compile gcc itself. The end product is a mipsel-linux binary of a compiler that produces mipsel-linux code. The product should run on a Linux VR device with enough memory.

This test is the same amount of work on any platform.

gcc's build process, although still complicated, is closer to the way most applications are built. Most of the CPU goes to the compiler.

The boxes

bandwagon is a Dell Optiplex 110; a P3/733 with 192M of RAM.

style-over-substance is an iMac DV+; a G3/450 with 320M of RAM.

ohlfs is a Power Mac G4; a G4/533 with 512M of RAM. It's actually a dual processor box, but only 1 CPU is used because it runs a uniprocessor kernel. Out of the box, the build tests do not really take advantage of a second processor, and there are some performance penalties just for running a multiprocessor kernel; it seemed more fair to compare it as a uniprocessor. I need to do some more work to get meaningful dual processor numbers.

family-values is a no-name PC; an Athlon 1.2GHz (200MHz external) on an ASUS A7V with 256M of RAM. Its IDE interface is crippled (long story) and only manages 3MB/sec in PIO mode---this box eats a lot of overhead talking to disk.

(Tests on decoy (Celeron 450) and red-herring (P2/266) are blocking on OS upgrades.)

Commentary

This is not a disk-intensive benchmark if the machines have enough memory; these do. All machines have IDE multiple mode turned on; all except family-values have IDE DMA on. family-values is hurt by poor disk performance; witness its low CPU utilization numbers.

The PPC machines spend a lot more time in the kernel compared to the x86 boxes. (Or at least it's accounted that way.) I don't understand this, and that's the biggest reason I'm including usertime-only tables. Perhaps the 2.4 kernels will do better and reduce the wall clock time.

bandwagon was $500 with shipping from the Dell refurb store; it would cost approximately $750 delivered, new. family-values is on loan to me; it was assembled assembled by mwave.com. You could rebuy it for ~$800 today. style-over-substance came from eBay, and would cost at least $900 new---that's the cheapest Mac you can buy. ohlfs doesn't belong to me; a stripped single-CPU version costs $2070 at the Apple store. Let's ignore memory prices. Also, I've got enough monitors, graphics cards, CD-ROM drives, operating system licenses, and other parts attached to my network that I don't *want* even the options bundled with most of these boxes.

So here's the $ per bogohurts numbers for userspace performance:

Machine Processor Approximate
price Userspace
$/bogohurts

bandwagon P3/733 $750 $1.02

family-values Athlon 1200 $800 $0.72

style-over-substance G3/450 $900 $2.06

ohlfs G4/533 $2070 $3.53

(Orginally I wasn't going to put this table in, but somebody pointed out that I was already going to generate many flames, so I might as well get in all the iffy tables.)

Maximizing bogohurts/$ is not the goal of most computer buyers. family-values was designed specifically to 0wn the crosscompile workload for low dollars, which is why it has dual 45G IBM IDE drives, no keyboard or mouse, and no CD-ROM drive. It also runs quite hot, and its IDE configuration problems sucked up a bunch of my time.

style and ohlfs run Mac OS X a zillion times faster than the x86 boxes, giving an infinite advantage in price/performance at that workload. That's why I bought that iMac.

One of the motivations for these benchmarks was that I got sick of all the people claiming that the PPC enjoyed vast performance advantages over equiv clocked x86 processors. Claims like "most powerful laptop in the world" and "twice as fast as a Pentium" turn off some of Apple's potential customers, even if the claims are properly qualified down in the fine print. "This is the only hardware allowed to run Mac OS X" was good enough to make me a customer.

Speculation

I hope to get access to a G4/733 box shortly. It should scale linearly at most, so family-values has nothing to fear.

Before I started looking at this, my gut feeling was that PPC should be ~20% faster than equiv clocked Pentiums. Now my gut feeling is that gcc's code generator kinda sucks.

How to replicate

I'd love more reports. If you're proficient with Linux and interested in downloading 30M of compressed sources to get 12 numbers, you can read how to play.

Jay Carlson
nop@nop.com

Machine	Processor	Approximate price	Userspace $/bogohurts
bandwagon	P3/733	$750	$1.02
family-values	Athlon 1200	$800	$0.72
style-over-substance	G3/450	$900	$2.06
ohlfs	G4/533	$2070	$3.53