Ten Things Apple Did To Make Mac OS X Faster

© Amit Singh. All Rights Reserved. Written in May 2004

Introduction

The performance of computer hardware typically increases monotonically with time. Even if the same could be said of software, the rate at which software performance improves is usually very slow compared to that of hardware. In fact, many might opine that there is plenty of software whose performance has deteriorated consistently with time. Moreover, it is rather difficult to establish an objective performance metric for software as complex as an operating system: a "faster OS" is a very subjective, context dependent phrase.

An operating system's architecture has a much greater longevity than that of common hardware. Operating system researchers do not come up with new, much faster algorithms as consistently or frequently as hardware updates happen. Nevertheless, those involved in "producing" operating systems -- researchers, designers, implementers, and even marketeers -- have the arduous task of ensuring that the associated performance curves keep going up. There are not many viable players in the OS market (some might argue, even if rhetorically, that essentially there's only one). Still, it is a very tough market, and OS vendors must "improve" their systems incessantly.

Now, given that you are not likely to run into earth-shattering algorithmic breakthroughs in every OS release cycle, how do you make your system faster? The problem has a multi-pronged solution:

Example: Mac OS X

This document discusses ten things that Apple did (beyond initial/fundamental OS design and implementation) to improve Mac OS X's performance. Some of these are simply good ideas and obvious candidates for implementation; some are guidelines or tools for developers to help them create high-performance applications, while some are proactive attempts at extracting performance from strategically chosen quarters. Consider the following a sampling of such optimizations, in no particular order:

Summary
1. BootCache

Mac OS X uses a boot-time optimization (effectively a smart read-ahead) that monitors the pattern of incoming read requests to a block device (the boot disk), and sorts the pattern into a "playlist", which is used to cluster reads into a private cache. This "boot cache" is then used for satisfying incoming read requests, if possible. The scheme also measures the cache hit rate, and stores the request pattern into a "history list" for being adaptive in future. If the hit rate is too low, the caching is disabled.

The loadable (sorted) read pattern is stored in /var/db/BootCache.playlist. Once this pattern is loaded, the cache comes into effect. The entire process is invisible from users.

This feature is only supported on the root device. Further, it requires at least 128 MB of physical RAM before it is enabled (automatically).

/System/Library/Extensions/BootCache.kext is the location of the kernel extension implementing the cache while Contents/Resources/BootCacheControl within that directory is the user-level control utility (it lets you load the playlist, among other things).

The effectiveness of BootCache can be gauged from the following: in a particular update to "Panther", a reference to BootCacheControl was broken. BootCache is started (via BootCacheControl, the control utility) in /etc/rc, and a prefetch tag is inserted (unless the system is booting in safe mode). /etc/rc looks for BootCacheControl in the Resources directory of the BootCache.kext bundle, as well as in /usr/sbin, and finds it in the former (it doesn't exist in the latter). However, another program (loginwindow.app) accesses /usr/sbin/BootCacheControl directly, and does not find it. For what it's worth, making BootCacheControl available in /usr/sbin, say via a symbolic link, reduces the boot time (measured from clicking on the "Restart" confirmation button to the point where absolutely everything has shown up on the system menu) from 135 seconds to 60 seconds on one of my machines.

2. Kernel Extensions Cache

There may be close to a hundred kernel extensions that are loaded on a typical Mac OS X system, and perhaps twice as many residing in the system's "Extensions" folder(s). Kernel extensions may have dependencies on other extensions. Rather than scan all these every time the system boots (or worse, every time an extension is to be loaded), Mac OS X uses caching for kernel extensions, and the kernel itself.

There are three types of kernel/kext caches used in this context:

3. Hot File Clustering

Hot File Clustering (HFC) aims to improve the performance of small, frequently accessed files on HFS Plus volumes. This optimization is currently used only on boot volumes. HFC is a multi-staged clustering scheme that records "hot" files (except journal files, and ideally quota files) on a volume, and moves them to the "hot space" on the volume (0.5% of the total filesystem size located at the end of the default metadata zone, which itself is at the start of the volume). The files are also defragmented. The various stages in this scheme are DISABLED, IDLE, BUSY, RECORDING, EVALUATION, EVICTION, and ADOPTION. At most 5000 files, and only files less than 10 MB in size are "adopted" under this scheme.

The "metadata zone" referred to in the above description is an area on disk that may be used by HFS Plus for storing volume metadata: the Allocation Bitmap File, the Extents Overflow File, the Journal File, the Catalog File, Quota Files, and Hot Files. Mac OS X 10.3.x places the metadata zone near the beginning of the volume, immediately after the volume header.

HFC (and the metadata zone policy) are used only on journaled HFS Plus volumes that are at least 10 GB in size.

Note that what constitutes the set of hot files on your system will depend on your usage pattern over a few days. If you are doing extensive C programming for a few days, say, then it is likely that many of your hot files will be C headers. You can use hfsdebug to explore the working of Hot File Clustering.

% sudo hfsdebug -H -t 10 # Top 10 Hottest Files on the Volume rank temperature cnid path 1 537 7453 Macintosh HD:/usr/share/zoneinfo/US/Pacific 2 291 7485 Macintosh HD:/private/var/db/netinfo/local.nidb/Store.128 3 264 7486 Macintosh HD:/private/var/db/netinfo/local.nidb/Store.160 4 204 7495 Macintosh HD:/private/var/db/netinfo/local.nidb/Store.96 5 204 2299247 Macintosh HD:/Library/Receipts/iTunes4.pkg/Contents\ /Resources/package_version 6 192 102106 Macintosh HD:/usr/include/mach/boolean.h 7 192 102156 Macintosh HD:/usr/include/mach/machine/boolean.h 8 192 102179 Macintosh HD:/usr/include/mach/ppc/boolean.h 9 188 98711 Macintosh HD:/usr/include/string.h 10 178 28725 Macintosh HD:/%00%00%00%00HFS+ Private Data/iNode1038632980 3365 active Hot Files.

4. Working Set Detection

The Mach kernel uses physical memory as a cache for virtual memory. When new pages are to be brought in as a result of page faults, the kernel would need to decide which pages to reclaim from amongst those that are currently in memory. For an application, the kernel should ideally keep those pages in memory that would be needed very soon.

In the Utopian OS, one would know ahead of time the pages an application references as it runs. There have been several algorithms that approximate such optimal page replacement. Another approach is to make use of the locality of reference of processes. According to the Principle of Locality, a process refers to a small, slowly changing subset of its set of pages. This subset is the Working Set. Studies have shown that the working set of a process needs to be resident (in-memory) in order for it to run with acceptable performance (that is, without causing an unacceptable number of page faults).

The Mac OS X kernel incorporates a subsystem (let us call it TWS, for Task Working Set) for detecting and maintaining the working sets of tasks. This subsystem is integrated with the kernel's page fault handling mechanism. TWS builds and maintains a profile of each task's fault behavior. The profiles are per-user, and are stored on-disk, under /var/vm/app_profile/. This information is then used during fault handling to determine which nearby pages should be brought in.

Several aspects of this scheme contribute to performance:

For a user with uid U, the application profiles are stored as two page cache files: #U_names and #U_data under /var/vm/app_profile/ (#U is the hexadecimal representation of U).

The "names" file, essentially a simple database, contains a header followed by profile elements:

typedef unsigned int natural_t; typedef natural_t vm_size_t; struct profile_names_header { unsigned int number_of_profiles; unsigned int user_id; unsigned int version; off_t element_array; unsigned int spare1; unsigned int spare2; unsigned int spare3; }; struct profile_element { off_t addr; vm_size_t size; unsigned int mod_date; unsigned int inode; char name[12]; };

The "data" file contains the actual working sets.

5. On-the-fly Defragmentation

When a file is opened on an HFS Plus volume, the following conditions are tested:

If all of the above conditions are satisfied, the file is relocated -- it is defragmented on-the-fly.

File contiguity (regardless of file size) is promoted in general as a consequence of the extent-based allocation policy in HFS Plus, which also delays actual allocation. Refer to Fragmentation In HFS Plus Volumes for more details.

6. Prebinding

Mac OS X uses a concept called "prebinding" to optimize Mach-O (the default executable format) applications to launch faster (by reducing the work of the runtime linker).

The dynamic link editor resolves undefined symbols in an executable (and dynamic libraries) at run time. This activity involves mapping the dynamic code to free address ranges and computing the resultant symbol addresses. If a dynamic library is compiled with prebinding support, it can be predefined at a given (preferred) address range. This way, dyld can use predefined addresses to reference symbols in such a library. For this to work, libraries cannot have preferred addresses that overlap. Apple marks several address ranges as either "reserved" or "preferred" for its own software, and specifies allowable ranges for 3rd party (including the end users') libraries to use to support prebinding.

update_prebinding is run to (attempt to) synchronize prebinding information when new files are added to a system. This can be a time consuming process even if you add or change a single file, say, because all libraries and executables that might dynamically load the new file must be found (package information is used to help in this, and the process is further optimized by building a dependency graph), and eventually redo_prebinding is run to prebind files appropriately.

Prebinding is the reason you see the "Optimizing ..." message when you update the system, or install certain software.

/usr/bin/otool can be used to determine if a binary is prebound:

# otool -hv /usr/lib/libc.dylib /usr/lib/libc.dylib: Mach header magic cputype cpusubtype filetype ncmds sizeofcmds flags MH_MAGIC PPC ALL DYLIB 10 1940 \ NOUNDEFS DYLDLINK PREBOUND SPLIT_SEGS TWOLEVEL

7. Helping Developers Create Code Faster

Mac OS X includes a few optimizations that benefit developers by making development workflow -- the edit-compile-debug cycle -- faster. Some of these were introduced with Mac OS X Panther.

% cat foo.h #define FOO 10 % cat foo.c #include "foo.h" #include <stdio.h> int main() { printf("%d\n", FOO); } % ls foo.* foo.c foo.h % gcc -x c-header -c foo.h % ls foo.* foo.c foo.h foo.gch % gcc -o foo foo.c % ./foo 10 % rm foo.h % gcc -o foo foo.c % ./foo 10

8. Helping Developers Create Faster Code

Apple provides a variety of performance measurement/debugging tools for Mac OS X. Some of these are part of Mac OS X, while many others are available if you install the Apple Developer Tools. Quite expectedly, Apple encourages its own developers, as well as 3rd party developers, to create code in conformance with performance guidelines.

As mentioned earlier, perceived performance is quite important. For example, it is desirable for an application to display its menu bar and to start accepting user input as soon as possible. Reducing this initial response time might involve deferring certain initializations or reordering the "natural" sequence of events, etc.

Mac OS X Tools

Mac OS X includes several common GNU/Unix profiling/monitoring/dissecting tools, such as gprof, lsof, nm, top, vm_stat, and many more, such as:

Refer to Apple's documentation for these tools for more details.

Performance Measurement Tools

CHUD Tools

The Computer Hardware Understanding Development (CHUD) Tools package, an optional installation, provides tools such as the following:

9. Journaling in HFS Plus

While modern filesystems are often journaled by design, journaling came to HFS Plus rather late. Apple retrofitted journaling into HFS Plus as a supplementary mechanism to the erstwhile working of the filesystem, with Panther being the first version to have journaling turned on by default.

On a journaled HFS Plus volume, file object metadata and volume structures are journaled, but not file object data (fork contents, that is). The primary purpose of the journal is to make recovery faster and more reliable, in case a volume is unmounted uncleanly, but it may improve the performance of metadata operations.

10. Instant-on

Apple computers do not hibernate. Rather, when they "sleep", enough devices (in particular, the dynamic RAM) are kept alive (at the cost of some battery life, if the computer is running on battery power). Consequently, upon wakeup, the user perceives instant-on behavior: a very desirable effect.

Similarly, by default the system tries to keep network connections alive even if the machine sleeps. For example, if you login (via SSH, say) from one PowerBook to another, and both of them go to sleep, your login should stay alive within the constraints of the protocols.

Epilogue

Using Mac OS X as an example, we looked at a few kinds of optimizations that "OS people" (particularly those involved in creating an end-user system) adopt to improve performance. The integration of all such optimizations is perhaps even more important than the optimizations themselves. The end result should be a perceptible improvement in performance. A desirable manifestation of such improvement would be a faster workflow for the end-user.

It must be noted that most, if not all, of the optimizations listed here are not specific to Mac OS X. Microsoft uses similar techniques to make Windows faster. In particular, techniques similar or equivalent to (but not limited to) BootCache, Hot File Clustering, and Working Set Detection/Maintenance are also used in Windows.