Linux (still) trumps OS X
p2p news feature / p2pnet: We recently ran a feature item which suggested, heretically, that Linux was faster than Apple’s OS X for statistics.
“A running issue has been that they don’t think it is fair to compare OS X with Linux because the latter is a specialty operating system,” the author, Jasjeet Sekhon, associate professor, Travers Department of Political Science Survey Research Center at UC Berkeley, told p2pnet at the time.
Sekhon followed up his first study and, “Notwithstanding extensive algorithmic improvements (many of which were suggested by a generous and talented programmer working at Apple), a Linux performance advantage remains, although it’s substantially smaller,” he says. >>>>>>>>>>>>>>>>>>>>>>>>
Linux versus Mac OS X on Intel Dual Core
By Jasjeet Sekhon - UC Berkeley
In early May I posted benchmarks comparing Linux and OS X on a
MacBookpro running my R
packages (I later added Windows XP benchmarks). In one of the
original benchmarks, both Linux and Windows XP were more than twice as
fast as OS X. And in a second (more representative) benchmark, Linux
was about 20% faster. The benchmarks were posted on
href="http://digg.com">Digg and a variety of other high traffic
Internet websites such as OSnews.
This attention generated a lot of comments and suggestions.
With the help of a variety of developers working at Apple and elsewhere, the large OS X performance gap previously reported here has been significantly reduced. The most important improvement is the use of a more efficient algorithm which relies on optimized
href="http://www.netlib.org/blas/faq.html">BLAS to perform key
matrix operations. This change increased the performance of the code
on all platforms. The performance gap was further closed by compiling
and linking R on OS X against
href="http://g.oswego.edu/dl/html/malloc.html">Doug Lea’s malloc
(called dmalloc for short).
However, a Linux speed advantage remains which varies with the
size of the dataset used. For example, the gap ranges from 0% for a
small dataset to 10% for what is a medium size dataset for the
algorithm in question. The gap shrinks again to 0% for a large
dataset. The performance gap is much greater if the default OS X
malloc is used notwithstanding the new algorithm: the gap goes from
essentially zero for a small dataset, to 40% for a medium one, and up
to 50% for a large one. Therefore, I recommend that R for OS X be
linked against dmalloc just as it is for Windows. At the very least,
packages such as the Scythe
Statistical Library should be so compiled.
The default malloc on OS X, like the default malloc on
Windows XP, causes a large performance degradation relative to the
default mallocs on Linux and Solaris.
href="http://www.r-project.org">R developers use the default
system malloc on every operating system but Windows. It turns out
that this decision is a bad one in the case of OS X because its memory
allocator makes system calls more often (at 15KB and larger
allocations) than dmalloc (at 256KB and larger). Indeed, as suggested
by Kazushige Goto,
the performance of this code could possibly be further improved by
avoiding mmap altogether. GNU malloc, for example, always makes the
munmap system call for large allocations which results in page faults
with every such allocation. Calls to mmap can be avoided on Linux by
adjusting two environmental variables:
href="http://www.llnl.gov/LCdocs/linux/index.jsp?show=s7">MALLOC_TRIM_THRESHOLD
to -1 and MALLOC_MMAP_MAX to 0. It is unfortunate that it is not
possible to do something similar with OS X’s default malloc because it
would help alleviate the performance issue. I am less clear on why
Linux performs better than OS X with dmalloc. Also, I have been
unable to find an article justifying the 15KB threshold for the switch
to the kernel’s virtual memory system. Is this really the optimal
threshold for today’s computers? For example, dmalloc’s mmap
threshold used to be 128KB but it was increased to 256KB as computers
have changed—compare version
href="http://g.oswego.edu/pub/misc/malloc-2.6.6.c">2.6.6 with
href="http://g.oswego.edu/pub/misc/malloc.c">2.8.3 (search for
DEFAULT_MMAP_THRESHOLD).
While the new BLAS version of the code is faster on all platforms, the
quality of BLAS implementations is not constant across operating
systems. Goto’s BLAS
implementation is currently the fastest and it is not yet available
for x86 OS X. So, a direct comparison between OS X and Linux using
the best BLAS library is not currently possible. For the purposes of
these comparisons, the
href="http://math-atlas.sourceforge.net/">ATLAS (non-threaded) BLAS
were used on both OS X and Linux. It turns out that the default OS X
BLAS libraries in the vecLib Framework provide somewhat better
performance for this code on OS X, so the OS X default BLAS
implementation is used while the default
href="http://packages.ubuntulinux.org/dapper/devel/atlas3-sse2-dev">Ubuntu
ATLAS BLAS libraries are used for Linux.
Linux performance improves if Goto’s BLAS are used instead of the
ATLAS BLAS. Goto’s work is described on an
href="http://www.apple.com/education/science/profiles/vatech/optimization.html">Apple
webpage because he added a custom BLAS for the Apple super
computer at Virginia
Tech. Virginia Tech was able to achieve their excellent
performance by using a kernel level memory manager which provides
physically contiguous memory yielding both high and consistent
performance.
The benchmarks are based on one of my statistical software packages
for R:
href="../matching">Matching (Multivariate and Propensity Score
Matching Software). The code uses C++ code extensively. The
benchmark scripts only vary by the sample size of the dataset being
examined. The data sizes are: 445,
href="GenMatch2.R">890, 1780 and
href="GenMatch5.R">5340 observations. The script with 445
observations is exactly the same as the original
href="GenMatch1.png">Benchmark 1, but it is now run with the new
algorithm which makes use of optimized BLAS. Each script runs the
benchmark three times and the best runtime of the three is recorded.
Each script is executed 100 times and the average times are reported
below.
The computer being used is:
| Label | OS and Chip |
|---|---|
| OS X Core Duo |
href="http://www.apple.com/macosx/tiger">Tiger on
href="http://www.apple.com/macbookpro">MacBookpro, Intel 2.16GHz Dual Core 2GB RAM |
| Linux Core Duo | Ubuntu Linux ( href="http://www.ubuntu.com/testing/dapperbeta?highlight=%28beta%29">Drapper Drake) with i686-SMP kernel on MacBookpro, Intel 2.16GHz Dual Core 2GB RAM. Note: Xorg server running with href="http://www.gnome.org">GNOME |
Unlike the original Benchmark 1 results
(which were obtained using exactly this script), this benchmarks
results in only small differences. There is no difference in
performance between Linux and the OS X code when dmalloc is used. And
the difference between Linux and the code with the OS X malloc is
small but statistically significant—the p-value based on the
empirical distribution over 100 simulations is 0.09.
As we increase the sample size, differences begin to become more
pronounced.

Linux is now 20% faster than the default OS X version (p-value=0.00)
and 10% faster than the OS X version using dmalloc
(p-value=0.00).

There is some evidence that the difference between Linux and OS X with
dmalloc is either asymptoting at about 10% or possibly even
shrinking—from 1.11 times as slow as Linux in the previous benchmark
to 1.08 times as slow now. The next simulation will help to nail this
down. In any case, the gap between Linux and the default OS X malloc
version has doubled and OS X is now about 1.4 times slower than Linux.
In an attempt to answer the asymptoting vs shrinking gap question, a
benchmarks was run with 5340 observations (12 times the original
dataset).

So it appears that the Linux advantage over OS X using dmalloc was
only present for a given range of dataset size and at this point it
has disappeared once again. The gap between OS X (default malloc) and
Linux continues to grow.
I was a bit surprised at the degree to which the malloc implementation
is still making a difference, but small memory allocations are still
being frequently made by my code. The efficiency of the malloc
implementation had a large impact on the original version of the code
because malloc was being called in the inner loop by the
href="http://scythe.wustl.edu/">Scythe Statistical Library. This
library does not require that the dimensions of matrices be explicitly
declared. In my updated algorithm most, but not all, matrix
dimensions are declared at the time of variable creation and key
operations now make use of system
href="http://www.netlib.org/blas/">BLAS libraries. These changes
make the code faster on all platforms. However, there is a tradeoff.
The Scythe Statistical Library
allows for the quick development of C++ code that is faster than R (an
interpreted language). One can certainly write more efficient C++
code. But code using Scythe is clear and straightforward to write,
read and debug especially for R programmers because it uses a similar
syntax. Therefore, it is reasonable for developers to use Scythe even
though there are more efficient alternatives.
The claims that I had previously made about the inefficiency of OS X
system calls relative to Linux still hold although they are not an
issue for these benchmarks at least judging from profiling the
code—as an aside, I would dearly love to have the equivalent of
href="http://developer.apple.com/tools/sharkoptimize.html">Shark
for Linux. As for system calls, as one email corresponded noted,
compare
href="http://cvs.opendarwin.org/cgi-bin/cvsweb.cgi/src/xnu/bsd/dev/ppc/systemcalls.c?rev=1.3&content-type=text/x-cvsweb-markup">Darwin
system call handling versus the
href="http://lxr.linux.no/source/arch/ppc/kernel/entry.S">Linux
equivalent (Tiger version is available
href="http://www.opensource.apple.com/darwinsource/10.4.6.ppc/xnu-792.6.70/bsd/dev/ppc/systemcalls.c">here).
href="http://www.anandtech.com/">Anandtech (
href="http://www.anandtech.com/mac/showdoc.aspx?i=2436">Part I and
Part
II) has conducted a series of benchmarks relevant to this issue,
and Darwin is slower than the 2.6 Linux kernel in these benchmarks
sometimes by several times.
Suggestions
If you have any suggestions or if you think something here is
erroneous, please contact me.
Links
AnandTech’s OS X benchmarks:
href="http://www.anandtech.com/mac/showdoc.aspx?i=2436">“No more
mysteries: Apple’s G5 versus x86, Mac OS X versus Linux” and
href="http://www.anandtech.com/mac/showdoc.aspx?i=2520&p=8">“No more
mysteries, part two”.
Also See:
Linux was faster - Linux trumps OS X, May 8, 2006
NOTE: p2pnet is being sued by Sharman Networks and Nikki Hemming, ceo of p2p application Kazaa. “The suit is a little odd, since P2PNet.net is a champion of peer-to-peer file-sharing, which is the same business that Kazaa is in,” says The Globe & Mail. If you’d like to help p2pnet, or find out more, please go here.
p2pnet newsfeeds for your site.
rss feed: http://p2pnet.net/p2p.rss
Mobile - http://p2pnet.net/index-wml.php



p2pnet - rss feed: 

May 27th, 2006 at 6:57 pm
> … OS X against Doug Lea’s malloc (called dmalloc for short).
The hyperlinks (and the “for short” name) are incorrect. dmalloc is a debugging memory allocation library written by Gary Watson, designed to detect allocation overruns, underruns, leaks, and memory arena corruption. The hyperlinks provided do point to this excellent tool.
However, Doug Lea’s optimized malloc() library, often referred to as dlmalloc (note ‘l’), is available at ftp://g.oswego.edu/pub/misc/malloc.c
I suggest you fix your hyperlinks, and change all occurrences of “dmalloc” to “dlmalloc” in your article.
May 28th, 2006 at 1:49 am
Just a loose thought, the title ought to read “dlmalloc betters malloc when extensively using a particular application”? What I get from this isn’t “use Linux, its faster”, I get “use dlmalloc (on large applications), its faster”…
I’m a little out my depth to comment on this aspect, but I wonder if the small (default) memory allocation for malloc in OS X ties with Apple’s (default) optimisation for size rather than speed approach?
May 28th, 2006 at 5:34 pm
Congratulations. You have definitively proven that you can write inefficient MacOS X code, and nothing more.
May 28th, 2006 at 7:27 pm
It would have been nice if you had linked to the “Ridiculous Fish” blog article where the malloc issue was first explained:
http://ridiculousfish.com/blog/archives/2006/05/16/36/
May 28th, 2006 at 9:04 pm
Um, perhaps the OS X alloc is tuned for good performance on the Obj-C runtime? Which, btw, uses memory pools most of the time. My implication being that the malloc’s tuned for use as a secondary allocator, below the pools.
By all means, different apps should tweak out their memory allocators as needed. Of course, that’s why the STL lets you choose which allocator to use (and write your own) for every container you wish!
May 29th, 2006 at 5:48 pm
first, it is commendable tthat you have the intellectual honestesty to admit that you did not initially understand how your libraries/code used memory resoureces (to the detriment of acuurately representing os/x).
yes, apple’s profiling tool - shark - like most other apple tools are a damn sight better than linux tools.
second, you should rephrase your comparison as between Darwin (the mach/bsd xnu kernel) and os/x proper. There is much more to os/x than just the opensource component; while Darwin is a fine *nix platform, it is the value-added elements (like cocoa) which make os/x truly sing.
finally, it would be nice to see some sustained and systematic follow-up on this issue … since you started this wildfire you have an obligation to help put it out … remember: Smokey Says Dont Play with Matches
cheers: zahadum