Benchmarks and Results

Table: Descriptions of the benchmark suite used in performance evaluation.

	The first five tests are the five phases of the Andrews benchmark set:
	andrew: making directories	Creates an empty 21-directory hierarchy.
	andrew: copying files	Fills the created directory structure by copying 71 files into it.
	andrew: statting	Recurses through the directories twice, generating a stat system call for every file.
	andrew: intensive reading	Recurses through the directories twice, searching every file for a given string.
	andrew: cpu intensive	Builds a moderately sized package within the directory structure.
	The next five tests work with large files:
	untar big package	Untars a 4.3MB archive containing approximately 400 files and directories.
	repeated ls-ing	Lists directories totaling 500 files four times.
	read a big file	Reads a 6.1MB file.
	read a big file again	Reads the same 6.1MB file again.
	copy a big file	Makes a copy of a different 6.1MB file.
	The last three tests perform smaller reads and writes:
	random reads	Performs 1,000 reads, ranging in size from 1 to 2,663 bytes, from randomly chosen locations within a 4.3MB file.
	small writes	Performs 2,000 writes, appending the integers 1 to 2,000 (as strings) to a file.
	small reads and writes	Like the small writes test, but rewinding and rereading the entire file between each write (and performing only 1000 writes). This is intended to show worst-case performance for our caching scheme (since the read cache is invalidated on a write, all the work of storing reads on local disk is wasted).

For all the benchmarks we used an isolated 10Mbit ethernet local area network with four hosts connected via an eight port Kingston EtherX workgroup hub. The server was a 486/DX4-100 with 40MB of memory, an ISA bus, and a 3Com 3c509 ethernet card running a standard Linux 2.0.29 kernel from a RedHat 4.0 distribution. The test partition we exported was from a 1GB IDE drive. No other disk activity was taking place on the server during the test.

The test client was a Pentium 200 with 20MB of memory available, a PCI bus with a Adaptec 2940UW SCSI controller, a Intel EtherExpress 100 ethernet card, and a 4GB wide SCSI disk. It was running either a Linux 2.0.27 kernel patched with upgraded EtherExpress100 and Adaptec 2940UW drivers (this configuration was also used when benchmarking the local disk performance) or our enhanced kernel with the same patches and our improved NFS client module. The cache directory for our enhanced client was the same local disk as used when measuring the local disk performance.

The two remaining machines on the network (a 486/DX4-75 and a Pentium 166, both with 3c509 ethernet cards) were used to generate additional network traffic to slow down the network. The ping utility was used to flood the network with icmp packets between those two hosts. Additionally, the server was using ping to flood the network to simulate handling multiple hosts. This only reduced worst case round trip time to about 4ms (average round trip time as reported by ping was 0.7ms). The slower machine was used as the server to more accurately represent the round-trip times on a reasonably-loaded server. This network environment used for our measurements was still far faster than the target network of the University of Washington's computer science department.¹⁶

For the two NFS benchmark runs (standard and enhanced) we used a 5 second time-out for cached file attributes (chosen because that is the timeout the BSD implementation uses [Mac91, p. 54]). We used a 30 second time-out for directory attributes as in the original Sun implementation [SGK⁺85]. Note that increasing the expiration time further increases pressure on the cache since entries persist for almost twice as long.

Though our client machine actually had 256MB of main memory available, we constrained it (via a lilo configuration option) to use only the first 20MB of memory. This limits the size of the VFS-level memory page buffering and reduces its affect on the runs. Additionally, it simulates the more realistic scenario of requiring the majority of main memory for the actual workload (few users have 200MB of main memory available for caching of disk pages). To further reduce the affect of VFS-level memory page buffering, we read almost 8MB of unrelated files from the local disk between each pair of the 26 benchmarks (those pages replace the relevant pages from the prior test that might benefit the following benchmark stage).

All of the remote procedure call data was collected using tcpdump running in raw mode on the client. That data was then analyzed after the benchmarks completed.

Figures

and

shows elapsed time for each of the benchmark tests on the standard Linux 2.0.x NFS client, our enhanced NFS client, and the local disk using ext2fs. Figures

and

compare RPC requests generated by the standard client versus our enhanced client.

**Figure:** Elapsed time for single benchmarks
$\includegraphics[height=8in]{time.single.eps}$

**Figure:** Elapsed time for multiple copies of the benchmarks running in parallel
$\includegraphics[height=8in]{time.parallel.eps}$

**Figure:** RPC activity during single benchmarks
$\includegraphics[height=8in]{rpc.single.eps}$

**Figure:** RPC activity during parallel benchmarks
$\includegraphics[height=8in]{rpc.parallel.eps}$

Figures

and

illustrate that our enhanced client never generates more RPC traffic than the standard client. In particular, the figures show that whenever a file is cached and re-read, all the read RPCs are eliminated.¹⁷ The figures also demonstrate the tremendous advantage asynchronous writing provides for the ``small writes'' and ``small reads and writes'' benchmarks.

The reduction in RPCs is reflected in figures

and

. For all the tests except the parallel ``andrew: copying files'' and ``random reads,'' our enhanced client outperformed the standard NFS client. In those two tests benchmarks we were only 3% and 8% slower respectively. This difference is attributable to the overhead in caching files to local disk, and to measurement error. In all other tests (and notably all single tests) we outperform the standard NFS client by as much as factor of 14 when reading already cached files, and almost a factor of 100 for ``small writes x 4.''

Much of the benefit from tests that did not directly result from the local disk caching came from reduced getattr RPCs resulting from our larger cache, and from our fixing a performance bug in the standard NFS client which resulted in its not exploiting attributes returned as a side effect of other RPCs. For one of the most realistic benchmarks (but one that we did not specifically target for our enhancements), the ``andrew: cpu intensive x 4,'' we observed a better than 10% performance improvement over the standard NFS client.

As mentioned previously, real-world networks are often far slower than the network we used for benchmarking. The relative performance of our enhanced client becomes even more impressive when the relative cost of a remote procedure call increases due to more network traffic or a more distant connection between client and server.