Next: 4. Implementation Up: Discovery and Hot Replacement Previous: 2. Background

Subsections

3. Design

The key issues we see in this work are:

1.: Is a switching mechanism really needed? Why not use the same file systems no matter where you are?
2.: When and how to switch from one replica to another.
3.: How to ensure that the new file system is an acceptable replacement for the old one.
4.: How to ensure consistency if updates are applied across different replicas.
5.: Fault tolerance: how to protect a client from server unavailability.
6.: Security: NFS is designed for a local ``workgroup'' environment in which the space of user IDs is centrally controlled.

These issues are addressed below. Not all implementation details are mentioned in this section. The rest of the implementation which did not have direct impact on the design of our system is detailed in Chapter 4.

3.1 Demonstrating the Need

We contend that adaptive client-server matchups are desirable because running file system operations over many network hops is bad for mobile and/or wide area computing in three ways: increased latency, increased failures, and decreased scalability. It is hard to ascertain exact failure rates and load on shared resources without undertaking a full-scale network study; however, we were able to gather some key data to support our claim. We performed a simple study to measure how latency increases with distance.

First, we used the traceroute program^3.1 to gather <hop-count, latency> data points measured between a host at Columbia and several other hosts around the campus, city, region, and continent. The results are listed in Table 3.1.

Hostname	NFS Time	Ping Time	No. Hops
	(seconds)	(mSec)
`ground.cs.columbia.edu`	4.55	4	1
`tune.cs.columbia.edu`	8.60	6	2
`sol.ctr.columbia.edu`	10.68	9	3
`omnigate.clarkson.edu`	339.26	482	10
`gatekeeper.dec.com`	97.90	198	12
`mvb.saic.com`	157.58	287	13
`pit-manager.mit.edu`	138.01	158	13
`wuarchive.wustl.edu`	401.60	979	18
`zaphod.ncsa.uiuc.edu`	183.06	227	19

Latencies were measured by a Columbia host, which is a Sun-4/75 equipped with a microsecond resolution clock. The cost of entering the kernel and reading the clock is negligible, and so the measurements are accurate to a small fraction of a millisecond.

We used the ping program with 1024-byte packets because the default size of ping packets (56 bytes) is too small to mimic NFS traffic.

Next, we mounted NFS file systems that are exported Internet-wide by certain hosts. We measured the time needed to copy 1MB from these hosts using a 1KB block size. We took measurements with two different datagram sizes: 40 bytes and 8KB, hypothesizing that distance degrades performance worse when packets are large. Forty bytes is the minimum packet sent by traceroute; eight kilobytes was chosen since it is the size of NFS block transfers in our facility. A typical result is plotted in Figure 3.1.

**Figure 3.1:** NFS Read Latency vs. Network Hop Count (1KB block size)
$\begin{figure} \rule{\linewidth}{1pt} \epsfxsize=\linewidth \epsffile{latency-vs-hops.ps}\rule{\linewidth}{1pt}\end{figure}$

Latency jumps by almost two orders of magnitude at the tenth hop, which represents the first host outside Columbia. Each plotted point is the median of 10 trials. We performed similar studies, measuring latency to 25 hosts that are each a number of hops away from Columbia; all results were similar to those plotted in Figure 3.1.

We conclude that (current) NFS performance over long distances is poor; too many UDP packets are lost resulting in retransmission requests of whole blocks. However, if we could constrain ourselves to using NFS servers in our ``neighborhood,'' we could keep performance reasonable.

3.2 When to Switch

We have modified the kernel so that rfscall() measures the latency of every NFS lookup and maintains a per-filesystem data structure storing a number of recently measured latencies.

**Figure 3.2:** Variability and Latency of NFS Operations
$\begin{figure} \rule{\linewidth}{1pt} \epsfxsize=\linewidth \epsffile{nfs-op-variability.ps}\rule{\linewidth}{1pt}\end{figure}$

We chose to time the lookup operation rather than any other operation or mixture of operations for two reasons. The first is that lookup is the most frequently invoked NFS operation. We experimented and found that other calls did not generate enough data points to accurately characterize latency. The second reason is that lookup exhibits the least performance variability of the common NFS operations.^3.2 Limiting variability of measured server latencies is important in our work, since we want to distinguish transient changes in server performance from long-term changes. It took lots of experimentation to come to a meaningful and stable measurement mechanism. At the outset of our work, we measured variances in the latency of the most common NFS operations and discovered huge swings, shown in Figure 3.2, even in an extended LAN environment that has been engineered to be uniform and not to have obvious bottlenecks. The measured standard deviations were 1027 msec for all NFS operations, 2547 msec for read, and only 596 msec for lookup.

After addition of each newly measured lookup operation, the median latency is computed over the last 30 and 300 calls. We compute medians because medians are relatively insensitive to outliers. We take a data point no more than once per second, so during busy times these sampling intervals correspond to 30 seconds and 5 minutes, respectively.

The signal to switch is when, at any moment, the short-term median latency exceeds the long-term median latency by a factor of 2. This policy provides insurance against anomalies like ping-pong switching between a pair of file systems: a file system can be replaced no more frequently than every 5 minutes. Looking for a factor of two difference between short-term and long-term medians is our attempt to detect a change in performance which is substantial and ``sudden,'' yet not transient. The length of the short-term and long-term medians as well as the ratio used to signal a switch are heuristics chosen after experimentation in our environment.^3.3 All these parameters can be changed from user level through a debugging system call that we have added; see Section 4.2.

3.3 Locating a Replacement

When a switch is triggered, rfscall() starts a non-blocking RPC^3.4 out to our user-level process named nfsmgrd. The call names the guilty file server, the root of the file system being sought, the kernel architecture, the current median (round-trip time) values, and any mount options affecting the file system. Nfsmgrd uses this information to compose and broadcast an RLP request. The file system name keys the search, while the server name is a filter: the search must not return the same file server that is already in use.

The RLP message is received by the nfsmgrd at other sites on the same broadcast subnet. To formulate a proper response, an nfsmgrd must have a view of mountable file systems stored at its site and also mounted file systems that it is using -- either type could be what is being searched for. Both pieces of information are trivially accessible through /etc/fstab, /etc/exports, and /etc/mtab.

The nfsmgrd at the site that originated the search uses the first response it gets; we suppose that the speed with which a server responds to the RLP request gives a hint about its future performance. (The Sun Automounter [Callaghan89] makes the same assumption about replicated file servers.) If a read-only replacement file system is available, nfsmgrd instructs Amd to mount it^3.5 and terminates the out-of-kernel RPC, telling the kernel the names of the replacement server and file system. The flow of control is depicted in Figure 3.3.

**Figure 3.3:** Flow of Control During a Switch
$\begin{figure} \rule{\linewidth}{1pt} \epsfxsize=\linewidth \epsffile{flow-ctl-switch.ps}\rule{\linewidth}{1pt}\end{figure}$

3.4 Using the Replacement

Once a replacement file system has been located and mounted, all future attempts to open files on the replaced file system will be routed to the replacement whenever the two files are identical. Also, in all cases for which it is possible, open files on the replaced file system will be switched to their equivalents on the replacement. We describe these two cases in Sections 3.4.2 and 3.4.3, respectively.

3.4.1 Relevant Changes to Kernel Data Structures

In order to accommodate file system replacement, we have added some fields to three important kernel data structures: struct vfs, which describes mounted file systems; struct vnode, which describes open files; and struct file, which describes file descriptors.

3.4.1.1 `struct vfs`

See Figure 2.1 for a complete listing of this structure, and Section 2.1.1 for explanation of important fields in the structure as it existed before our changes. The fields we added to struct vfs are:

vfs_mnt_path is valid in every vfs; it stores the absolute pathname of the file system's mount point. It is used to locate a vfs based on its mount-point name, and in the construction of full pathnames from the root to the actual file being replaced (concatenating this field with the relative pathname of the file from mount-point).
vfs_replaces is valid in the vfs structure of the replacement file system; it points to the vfs structure of the file system being replaced.
vfs_replaced_by is valid in the replaced file system's vfs struct; it points to the vfs structure of the replacement file system. When a replacement file system is mounted, our altered version of nfs_mount() sets the replaced and replacement file systems pointing to each other.
vfs_nfsmgr_flags is valid for any NFS file system. One flag indicates whether the file system is managed by nfsmgrd and is turned on by our nfsmgr_ctrl() system call. The idea is to provide the administrators of the client host the ability to control which filesystems are managed by our code, and suspend any use of it when needed.
Another flag indicates whether a file system switch is in progress. Part of the filesystem switching process involve making asynchronous out-of-kernel RPCs. The calls invoke local daemons, which may call other local daemons and remote servers. Obviously the initiating process is blocked until all answers are returned to the kernel, or if timeouts are reached. That opens the possibility that another process might hit a threshold, and cause another search for a replacement filesystem to the same one in progress. To avoid this problem, only one process is allowed to be the (inadvertent) initiator of a filesystem search; all other processes see that the flag is on, and block until the flag is off.
vfs_median_info contains almost all of the pertinent information about the performance of the file system, including the 300 most recent nfs_lookup() response times, the 30 most recent measures, the trigger ratio, many pointers in support of the double-threading nature of this data structure, and more.
The field vfs_dft is the Duplicate File Table (DFT). This per-filesystem table lists which files in the replacement file system have been compared to the corresponding file on the original file system mounted by Amd. Only equivalent files can be accessed on the replacement file system. The mechanism for making comparisons is described in Section 3.4.2.
The size of the DFT is fixed (but changeable) so that new entries inserted will automatically purge old ones. This is a simple method to maintain ``freshness'' of entries.
The DFT is a hash table whose entries contain a file pathname relative to the mount point, a pointer to the vfs structure of the replacement file system, and an extra pointer for threading the entries in insertion order. This doubly-threaded data structure permits fast lookups keyed by pathname and quick purging of older entries.
vfs_queue_size tells how many entries were allocated in the DFT. It is used especially when a DFT needs to be purged; we improve speed by not freeing and then re-allocating kernel memory for DFT entries.

3.4.1.2 `struct vnode`

See Figure 2.2 for a complete listing of this structure, and Section 2.1.2 for explanation of important fields in the structure as it existed before out changes. The two fields we added to struct vnode are:

v_relpath is valid in every vnode which may be replaced. It stores the relative pathname of the file from the root of the mounted filesystem. It is used to construct full pathnames of two files to be compared, and in determining the filename of an opened file during the process of hot replacement. See Sections 3.4.2 and 3.4.3.
v_last_used is valid for every NFS vnode. This is the last time that rfscall() made a remote call on behalf of this vnode. This information is used in our ``hot replacement''.

3.4.1.3 `struct file`

See Figure 2.3 for a complete listing of this structure, and Section 2.1.3 for explanation of important fields in the structure as it existed before our changes. The only field we added to struct file is:

f_path is valid in every open file on an NFS filesystem. It is very similar to the v_relpath field added to the vnode structure. This field is necessary to distinguish among several opened file-descriptors to the same hard-linked file, when the open() calls used different path names. See Section 2.1.3 for details of the interaction between the file and vnode structures in the SunOS 4.x kernel.

3.4.2 After Replacement: Handling New Opens

When Amd mounts a file system it makes a symlink from the desired location of the file system to the mount point. For example, /u/foo would be a symlink pointing to the real mount point of /n/bar/u/foo; by our local convention, this would indicate that server bar exports /u/foo. Users and application programs know only the name /u/foo.

The information that bar exports a proper version of /u/foo is placed in Amd's mount-maps by system administrators who presumably ensure that the file system bar:/u/foo is a good version of whatever /u/foo should be. Therefore, we regard the information in the client's Amd mount-maps as authoritative, and consider any file system that the client might mount and place at /u/foo as a correct and complete copy of the file system. We call this file system the master copy, and use it for comparison against the replacement file systems that our mechanism locates and mounts.

The new open algorithm is shown in Figure 3.4. After a replacement file system has been mounted, whenever name resolution must be performed for any file on the replaced file system, the file system's DFT is first searched for the relative pathname.

**Figure 3.4:** New Open Algorithm
$\begin{figure} \rule{\linewidth}{1pt} \begin{tex2html_preform}\begin{verbatim}op... ...r copy; }\end{verbatim}\end{tex2html_preform} \rule{\linewidth}{1pt}\end{figure}$

If the DFT contains an entry for the pathname, then the file on the replacement file system has already been compared to its counterpart on the master copy. A field in the DFT tells if the comparison was successful or not. If not, then the rest of the pathname has to be resolved on the master copy. If the comparison was successful, then the file on the replacement file system is used; in that case, name resolution continues at the root of the replacement file system.

If the DFT contains no entry for the pathname, then it is unknown whether the file on the replacement file system is equivalent to the corresponding file on the master copy. The two files must be compared.

To test equivalence, au_lookuppn() calls out of the kernel to nfsmgrd, passing it the two host names, the name of the file system, and the relative pathname to be compared. Au_lookuppn() gathers all this information as follows:

1.: The filesystem switch trigger happens while resolving a pathname, finding that the v_vfsmountedhere field of that current vnode is not NULL, indicating that this is a mount point. The vfs_mnt_path of this vfs is recorded as the mount point of the master filesystem.
2.: If the vfs_replaced_by field of the vfs of that vnode is not NULL, there is a replacement filesystem for this vfs. The vfs_mnt_path of it is recorded as the mount point of the replacement filesystem.
3.: Since au_lookuppn() did not complete resolving the full pathname, the unresolved part is the relative (to the mount point) pathname to the file being resolved.

Each of the two mount point names (one per host) is prepended to the name of the file to be compared, yielding two full pathnames to (hopefully) the same file, but on two different filesystems.

At this point a partial DFT entry is constructed, and a flag in it is turned on to indicate that there is a comparison in progress and that no other process should initiate the same comparison.^3.6

**Figure 3.5:** Flow of Control During File Comparison
$\begin{figure} \rule{\linewidth}{1pt} \epsfxsize=\linewidth \epsffile{flow-ctl-compare.ps}\rule{\linewidth}{1pt}\end{figure}$

Nfsmgrd then applies, at user level, whatever tests might be appropriate to determine whether the two files are equivalent. This flow of control is depicted in Figure 3.5. A more complete example of our new pathname resolution scheme is shown in Figure 3.6. Presently, we are performing file checksum comparison: nfsmgrd calls a checksumd daemon on each of the file servers, requesting the checksum of the file being compared. Checksumd, which we have written for this work, computes MD4 file checksums [Rivest91] on demand and then stores them for later use; checksums can also be pre-computed and stored.

**Figure 3.6:** An Example of the New Pathname Resolution Algorithm
$\begin{figure} \rule{\linewidth}{1pt} \epsfxsize=\linewidth \epsffile{path-repl.ps}\rule{\linewidth}{1pt}\end{figure}$

Nfsmgrd collects the two checksums, compares them, and responds to the kernel, telling au_lookuppn() which pathname to use, indicating the file on the replacement file system if possible. Au_lookuppn() completes the construction of the DFT entry, unlocks it, and marks which vfs is the proper one to use whenever the same pathname is resolved again.

In this fashion, all new pathname resolutions are re-directed to the replacement file system whenever possible.

Note that the master copy could be unmounted (e.g., Amd by default unmounts a file system after a few minutes of inactivity^3.7), and this would not affect our mechanism. The next (new) use of a file in that file system would cause some master copy to be automounted, before any of our code is encountered.

3.4.3 After Replacement: Handling Files Already Open

When a file system is replaced, it is possible that some files will be open on the replaced file system at the moment when the replacement is mounted. Were these processes to continue to use the replaced file system, several negative consequences might ensue. First, since the replacement is presumed to provide faster response, the processes using files open on the replaced file systems experience worse service. Second, since the total number of mounted file systems grows as replacements happen, the probability rises that some file system eventually becomes unavailable and causes processes to block. Further, the incremental effect of each successive file system replacement operation is reduced somewhat, since files that are open long-term do not benefit from replacement. Finally, kernel data structures grow larger as the number of mounted file systems climbs. Motivated by these reasons, we decided to switch open files from the replaced file system to the replacement file system whenever the file on the replacement file system is equivalent to that on the master copy.

Although this idea might at first seem preposterous, it is not, since we restrict ourselves to read-only file systems. We assume that files on read-only file systems^3.8 change very infrequently and/or are updated with care to guard against inconsistent reads.^3.9 Whether operating conditions uphold this assumption or not, the problem of a file being updated^3.10 while being read exists independently of our work, and our work does not increase that danger.

We allow for a replacement file system to be itself replaced. This raises the possibility of creating a ``chain'' of replacement file systems. Switching vnodes from the old file system to its replacement limits this chain to length two (the master copy and the current replacement) in steady state. For example, if host A is replaced by host B, and host B is replaced by C, we ``short circuit'' the replacement information stored in the vfs structure and allow for C to directly replace A; see Sections 3.4.1.1 and 4.2. This saves us unnecessary comparisons and delays while we have to contact more hosts. After all, host B has already been determined as ``bad'', so there is no need to use it any longer.

Hot replacement requires knowing pathnames. Thanks to our changes, the vfs structure records the pathname it is mounted on and identifies the replacement file system; also, the relative pathname of the file is stored in the file table entry. This information is extracted, combined with the host names, and passed out to nfsmgrd to perform comparison, as described above. If the comparison is successful, the pathname on the replacement file system is looked up, yielding a vnode on the replacement file system. This vnode simply replaces the previous vnode in all entries in the open file table. This results in a switch the next time a process uses an open file descriptor.

The ``hot replacement'' code scans through the global open file table, keying on entries by (a pointer to the) vfs. Once an entry is found that uses the file system being replaced, a secondary scan locates all other entries using the same vnode. In a single entry into the kernel (i.e., ``atomically''), all file descriptors pointing to that vnode are switched, thereby avoiding complex questions of locking and reference counting. A table scan is necessary because while vnodes point to their vfs, the reverse is not true.

Hot replacement is made possible by the statelessness of NFS and by the vfs/vnode interfaces within the kernel. Since the replaced server keeps no state about the client, and since the open file table knows only a pointer to a vnode, switching this pointer in every file table entry suffices to do hot replacement.

An interesting issue is at which time to perform the hot replacement of vnodes. Since each file requires a comparison to determine equivalence, switching vnodes of all the open files of a given file system could be a lengthy process. The four options we considered are:

1.: Switch as soon as a replacement file system is mounted (the early approach).
2.: Switch only if/when an RPC for that vnode hangs (the late approach).
3.: Switch if/when the vnode is next used (the ``on-demand'' approach).
4.: Switch whenever a daemon instructs it to (the ``flexible'' approach).

The decision to switch earlier or later is affected by the tradeoff that early switching more quickly switches files to the faster file system and improves fault tolerance by reducing the number of file systems in use, but possibly wastes effort. Vnode switching is a waste if the switched vnode will not be used again. Early switching also has the disadvantage of placing the entire delay of switching onto the single file reference that is unlucky enough to be the next one.

We chose the ``flexible'' approach of having a daemon make a system call into the kernel which then sweeps through the open file table and replaces some of the vnodes which can be replaced. We made this choice for three reasons. First, we lacked data indicating how long a vnode lingers after its final use. Second, we suspected that such data, if obtained, would not conclusively decide the question in favor of an early or late approach. Third, the daemon solution affords much more flexibility, including the possibility of more ``intelligent'' decisions such as making the switch during an idle period.

We emphasize that the system call into the kernel switches ``some'' of the vnodes, since it may be preferable to bound the delay imposed on the system by one of these calls. Two such bounding policies that we have investigated are, first, switching only N vnodes per call, and, second, switching only vnodes that have been accessed in the past M time units. Assuming that file access is bursty (a contention supported by statistics [Ousterhout85,Ruemmler93]), the latter policy reduces the amount of time wasted switching vnodes that will never be used again. We are currently using this policy of switching only recently used vnodes; this policy makes use of the v_last_used field that we added to the vnode structure.

Note that if the time since last use is chosen such that vnodes used since the last run of the daemon are switched, then processes will not remain hung indefinitely waiting for a remote file system (assuming that the file causing the hang can be replaced). Since rfscall() will have timestamped the vnode at the beginning of the hung call, the next run of the daemon will cause an attempt to switch the vnode of the hung call. If we want to make sure processes are ``unhung'' sooner we could increase the frequency of the daemon's run.

However, vnodes that cannot be replaced still pose a rather difficult problem. They can easily bring a machine down, especially if the processes that are hung on the RPC call are vital to the stability of the system.

3.5 Security

The NFS security model is the simple uid/gid borrowed from UNIX, and is appropriate only in a ``workgroup'' situation where there is a central administrative authority. Transporting a portable computer from one NFS user ID domain to another presents a security threat, since processes assigned user ID X in one domain can access exported files owned by user ID X in the second domain.

Accordingly, we have altered rfscall() so that every call to a replacement file system has its user ID and group ID both mapped to ``nobody'' (i.e., value -2). Therefore, only world-readable files on replacement file systems can be accessed. Of course any files left owned by ``nobody'' will be at risk.

3.6 Code Size

Counting blank lines, comments, and debugging support, we have written close to 11,000 lines of C. More than half is for user-level utilities: 1200 lines for the RLP library and daemon, 3200 for nfsmgrd, 700 lines for checksumd, and 1200 lines for a control utility (called nfsmgr_ctl). New kernel code totals 4000 lines, of which 800 are changes to SunOS, mostly in the NFS module. The remaining 3200 lines comprise the four modules we have added: 880 lines to deal with storing and computing medians; 780 lines are the ``core NFS management'' code, which performs file system switching, pathname storage and replacement, and out-of-kernel RPC; 540 lines to manage the DFT; and 1000 lines to support the nfsmgr_ctl system call.

The nfsmgr_ctl system call allows query and control over almost all data structures and parameters of the added facility. We chose a system call over a kmem program for security. This facility was used heavily during debugging; however, it is meant also for system administrators and other interested users who would like to change these ``magic'' variables to values more suitable for their circumstances. See Section 4.2 for more details.

Next: 4. Implementation Up: Discovery and Hot Replacement Previous: 2. Background

Erez Zadok
1999-02-17