5. Evaluation

This system was implemented and received use on a limited number of (Sun-4, 40MHz SparcStation II) machines.

The goal of this work is to improve overall file system performance -- under certain circumstances, at least -- and to improve it enough to justify the extra complexity. For this method to really work, it must have:

1.: Low overhead latency measurement between switches.
2.: A quick switch.
3.: Low overhead access to the replacement after a switch.
4.: No anomalies or instabilities, like ping-pong switching.
5.: No process hangs due to server failures when a replacement is available.
6.: No security or administrative complications.

We have carried out several measurements aimed at evaluating how well our mechanism meets these goals.

The overhead between switches is that of performance monitoring. The added cost of timing every rfscall() we found too small to measure. The cost of computing medians could be significant, since we retain 300 values. But we implemented a fast incremental median algorithm that requires just a negligible fraction of the time in nfs_lookup(). The kernel data structures are not so negligible: retaining 300 latency measurements costs about 2KB per file system. The reason for the expansion is the extra pointers that must be maintained to make the incremental median algorithm work. The extra fields in the struct vfs, struct vnode, struct file are small, with the exception of the DFT, which is large. The current size of each (per-filesystem) DFT is 60 slots which occupy a total of 1KB-2KB on average.

Our measured overall switch time is approximately 3 sec. This is the time between the request for a new file system and when the new file system is mounted (messages 1-8 in Figure 3.3). Three seconds is comparable to the time needed in our facility to mount a file system whose location is already encoded in Amd's maps (about 1-2 seconds in our environment), suggesting that most of the time goes to the mount operation.

The overhead after a switch consists mostly of doing equivalence checks outside the kernel; the time to access the vfs of the replacement file system and DFT during au_lookuppn() is immeasurably small. Only a few milliseconds are devoted to calling checksumd: 5-7 msec if the checksum is already computed. This call to checksumd is done once and need not be done again so long as a record of equivalence remains in the DFT. If checksums have to be computed, it would take about 5-6 msec more per 1MB of data already loaded in memory, for the comparisons to complete.

A major issue is how long to cache DFT entries that indicate equivalence. Being stateless, NFS does not provide any sort of server-to-client cache invalidation information. Not caching at all ensures that files on the replacement file system are always equal to those on the master copy; but of course the repeated comparisons somewhat defeat the purpose of using the replacement. We suppose that most publicly-exported read-only file systems have their contents changed rarely, and thus one should cache to the maximum extent. Accordingly, we manage the DFT cache by LRU.

As mentioned above, switching instabilities are all but eliminated by preventing switches more frequently than every 5 minutes.

In one experiment we performed, an Emacs process was started from a filesystem on a slow server to be replaced. We simulated the slowing of the server by artificially raising the round-trip times of the NFS lookup operations going to this server. In the midst of our editing within Emacs, the trigger ratio was reached. A flurry of activity ensued, and a replacement was found and mounted. Within only a few seconds the open vnode for the Emacs process was replaced for us by one on the replacement filesystem, and our process was released from its hung state for us to resume editing.

Since we map the uid of all outgoing NFS calls to replacement filesystem to that of ``nobody,'' we avoid potential security problems with users being able to access files owned by others in a different administrative domain. We chose suitable defaults for the variables of our system such that no changes need be made; however, if necessary, privileged users and system-administrators can use our control facilities to tune system parameters.

5.1 Experience

5.1.1 What is Read-Only

Most of the files in our facility reside on read-only file systems. However, sometimes one can be surprised. For example, Emacs is written to require a world-writable lock directory. In this directory Emacs writes files indicating which users have which files in use. The intent is to detect and prevent simultaneous modification of a file by different processes. A side effect is that the ``system'' directory in which Emacs is housed (at our installation, /usr/local) must be exported read-write.

Deployment of our file service spurred us to change Emacs. We wanted /usr/local to be read-only so that we could mount replacements dynamically. Also, at our facility there are several copies of /usr/local per subnet, which defeats Emacs' intention of using /usr/local as a universally shared location. We re-wrote Emacs to write its lock files in the user's home directory since (1) for security, our system administrators wish to have as few read-write system areas as possible and, (2) in our environment by far the likeliest scenario of simultaneous modification is between two sessions of the same user, rather than between users.

Note also that it is not necessary that the whole filesystem being switched be exported read-only, only the parts that are requested. That depends on the ability of the system to allow this. For example, Solaris 2.x [SMCC90c] allows arbitrary parts of a filesystem to be exported with different permissions, but SunOS 4.x [SMCC90d] only allows sibling subdirectories of a filesystem to be exported with different permissions.

5.1.2 Suitability of Software Base

5.1.2.1 Kernel

The vfs and vnode interfaces in the kernel greatly simplified our work. In particular, hot replacement proved far easier than we had feared, thanks to the vnode interface. The special out-of-kernel RPC library also was a major help. Nevertheless, work such as ours makes painfully obvious the benefits of implementing file service out of the kernel. The length and difficulty of the edit-compile-reboot-debug cycle, and the primitive debugging tools available for the kernel were truly debilitating. Recent developments in kernel technologies such as layered kernel modules in Solaris 2.x [SMCC92a] and multi-server systems such as the GNU Hurd [Bushnell94] or the CMU ``US'' server would have been tremendous to us.

5.1.2.2 RLP

RLP was designed in 1983, when the evils of over-broadcasting were not as deeply appreciated as they are today and when there were few multicast implementations. Accordingly, RLP is specified as a broadcast protocol. A more up-to-date protocol would use multicast. The benefits would include causing much less waste (i.e., bothering hosts that lack an RLP daemon) and contacting many more RLP daemons. Not surprisingly, we encountered considerable resistance from our bridges and routers when trying to propagate an RLP request. A multicast RLP request would travel considerably farther.

RLP uses UDP by default. Since UDP does not guarantee delivery of packets, there is a realistic limit on how large a single packet could be for reliable delivery; large packets are more likely not to get delivered correctly and will require a retransmission. Also, most applications which use UDP avoid exceeding 8KB, because most UDP implementations have been written to match the default buffer sizes of NFS, and hardly ever allow you to reach the protocol limit of 64KB as described in [Stevens94]. The information necessary for our attribute-guided search for filesystems could have reached these limits.^5.1 A better protocol, using TCP for example, will remove these limitations. It would also be necessary to send out only minimal information, and then exchange further information on a ``need to know'' basis, and if asked, with remote resource servers; a hierarchical organization such as that successfully used in [Dyer88,Mockapetris87a,Mockapetris87b,SMCC93,Noor94] might be more suitable.

Finally, having exclusively used RLP's ``catch-all'' message format (see Section 4.1) and not what it was primarily designed for is further evidence of its unsuitability for this work.

5.1.2.3 NFS

NFS is ill-suited for ``cold replacement'' (i.e., new opens on a replacement file system) caused by mobility, but is well suited for ``hot replacement'' because of its statelessness.

NFS' lack of cache consistency callbacks has long been bemoaned, and it affects this work since there is no way to invalidate DFT entries. Since we restrict ourselves to slowly-changing read-only files, the danger is assumed to be limited, but is still present. Most newer file service designs include cache consistency protocols. However, such protocols are not necessarily a panacea. Too much interaction between client and server can harm performance, especially if these interactions take place over a long distance and/or a low bandwidth connection. See [Tait92] for a design that can ensure consistency with relatively little client-server interaction.

The primary drawback of using NFS for mobile computing is its limited security model. Not only can a client from one domain access files in another domain that are made accessible to the same user ID number, but even a well-meaning client cannot prevent itself from doing so, since there is no good and easy way to tell when a computer has moved into another uid/gid domain.

Our work was based on version 2 of the NFS protocol. Version 3 [Pawlowski94,SMCC94] of the protocol fixes some of the problems of the current version. For example, it allows for use of TCP, dynamically adjusting buffer sizes, and asynchronous writes -- which would definitely improve its performance over wide-area networks.

NFS version 3 also provides better support for security:

A new NFS operation, ACCESS, allows the client to request from the server a list of access rights per filehandle. The server will check the permissions -- possibly through a set of ACLs and uid/gid mappings -- and decide which of the access requests to grant and which not. This is a more flexible and finer grained method than the version 2 of the protocol, in which the only reliable way to determine if a client had access to the server's files was to try the operation and see if it failed.
A new authentication model has been added, using the Kerberos authentication protocol [Steiner88,Lunt90,Bellovin91].

Next: 6. Experiences Up: Discovery and Hot Replacement Previous: 4. Implementation

Erez Zadok
1999-02-17