next up previous contents
Next: 3. Design Up: Discovery and Hot Replacement Previous: 1. Introduction

Subsections

   
2. Background

Our work is implemented in and on SunOS 4.1.2. We have changed the kernel's client-side NFS implementation, and outside the operating system we have made use of the Amd automounter and the RLP resource location protocol. Each is explained briefly below.

   
2.1 NFS

Particulars about the NFS protocol and implementation are widely known and published [Blaze92,Hitz94,Juszczak89,Juszczak94,Keith90,Keith93,Kleiman86,Macklem91,Pawlowski94,Rosen86,Rosenthal90,Sandberg85a,Sandberg85b,Schaps93,Srinivasan89,Stein87,Stern92,Sun85,Sun86,Sun89,Walsh85,Watson92].

For the purpose of our presentation, the only uncommon facts that need to be known are:

  • Translation of a path name to a vnode is done mostly within a single procedure, called au_lookuppn(), that is responsible for detecting and expanding symbolic links and for detecting and crossing mount points.

  • It is at the point during pathname translation where a mount point is crossed that we trigger much of our code.

  • The name of the procedure in which an NFS client makes RPCs to a server is rfscall(). All of the NFS operation-specific functions (such as nfs_close(), nfs_statfs(), nfs_getattr(), etc.) eventually call rfscall() with an operation code and an opaque data structure to be interpreted by the NFS server. Rfscall() calls an out-of-kernel RPC routine. It also times out that subroutine call and prints the infamous message ``NFS server XXX not responding -- still trying''.

We have made substantial alterations to au_lookuppn(), and slight alterations to rfscall(), nfs_mount(), nfs_unmount() and copen().2.1

We added two new system calls: one for controlling and querying the added structures in the kernel (nfsmgr_ctrl()), and the other for debugging our code (nfsmgr_debug()). Additional minor changes in support of debugging were made to ufs_mount() and tmp_mount().

Finally, we added fields to three major kernel data structures: vfs and vnode structures and the open file table. Below we show these modified structures and describe their most relevant fields.

   
2.1.1 struct vfs

A vfs is the structure for a Virtual File System [Kleiman86]. A singly-linked list of such structures exists in the kernel, the head of which is the global rootvfs -- a hand-crafted structure for the root filesystem. This structure was substantially modified; see Figure 2.1.2.2 That is not surprising since most of our work is related to managing filesystems as a whole.


  
Figure 2.1: Modified struct vfs
\begin{figure}
\rule{\linewidth}{1pt}
\begin{tex2html_preform}\begin{verbatim}st...
...MGR */
};\end{verbatim}\end{tex2html_preform}
\rule{\linewidth}{1pt}\end{figure}

The fields of interest are:

  • vfs_next is a pointer to the next vfs in the linked list.

  • vfs_op is a pointer to a function-pointer table. That is, this vfs_op can hold pointers to UFS functions, NFS, PCFS, HSFS, etc. If the vnode interface calls the function to mount the file system, it will call whatever subfield of struct vfsops is designated for the mount function. That is how the transition from the vnode level to a filesystem-specific level is made; see also Section 6.1.1.3.

  • vfs_vnodecovered is the vnode on which this filesystem is mounted.

  • vfs_flag contains bit flags for characteristics such as whether this filesystem is mounted read-only, if the setuid/setgid bits should be turned off when exec-ing a new process, if sub-mounts are allowed, etc.

  • vfs_data is a pointer to opaque data specific to this vfs and the type of filesystem this one is. For an NFS vfs, this would be a pointer to struct mntinfo (located in <nfs/nfs_clnt.h>) -- a large NFS-specific structure containing such information as the NFS mount options, NFS read and write sizes, host name, attribute cache limits, whether the remote server is down or not, and more.

The fields specific to our work are described in Section 3.4.1.1.

   
2.1.2 struct vnode

This structure was only slightly modified; see Figure 2.2. A vnode exists for each open file or directory.2.3 The parts of the kernel that access vnodes directly are the filesystem sections. Therefore a vnode is a representation of open files from the filesystem's point of view. Only one vnode exists for each open file, no matter how many processes have opened it, or even if the file has several names (via hard or symbolic links).


  
Figure 2.2: Modified struct vnode
\begin{figure}
\rule{\linewidth}{1pt}
\begin{tex2html_preform}\begin{verbatim}st...
...MGR */
};\end{verbatim}\end{tex2html_preform}
\rule{\linewidth}{1pt}\end{figure}

Structure fields relevant to our work are:

  • v_flag contains bit flags for characteristics such as whether this vnode is the root of its filesystem, if it has a shared or exclusive lock, whether pages should be cached, if it is a swap device, etc.

  • v_count is incremented each time a new process opens the same vnode.

  • v_vfsmountedhere, if non-null, contains a pointer to the vfs which is mounted on this vnode. This vnode thus is a directory which is a mount point for a mounted filesystem.

  • v_op is a pointer to a function-pointer table. That is, this v_op can hold pointers to UFS functions, NFS, PCFS, HSFS, etc. If the vnode interface calls the function to open a file, it will call whatever subfield of struct vnodeops is designated for the open function. That is how the transition from the vnode level to a filesystem-specific level is made; see also Section 6.1.1.3.

  • v_vfsp is a pointer to the vfs which this vnode belongs to. If the value of the field v_vfsmountedhere is non-null, it is also said that v_vfsp is the parent filesystem of the one mounted here.

  • v_type is used to distinguish between a regular file, a directory, a symbolic link, a block/character device, a socket, a Unix pipe (fifo), etc.

  • v_data is a pointer to opaque data specific to this vnode. For an NFS vfs, this might be a pointer to struct rnode (located in <nfs/rnode.h>) -- a remote filesystem-specific structure containing such information as the file-handle, owner, user credentials, file size (client's view), and more.

The fields specific to our work are described in Section 3.4.1.2.

   
2.1.3 struct file

This structure was also only slightly modified; see Figure 2.3. A file structure exists for each file opened by a process. The kernel modules that access this structure directly are those that handle processes and user contexts. Therefore a struct file is a representation of open files from the user's and process' points of view. The various complex interactions between struct file and struct vnode are de-mystified after the brief explanation of various fields in this structure.


  
Figure 2.3: Modified struct file
\begin{figure}
\rule{\linewidth}{1pt}
\begin{tex2html_preform}\begin{verbatim}st...
...MGR */
};\end{verbatim}\end{tex2html_preform}
\rule{\linewidth}{1pt}\end{figure}

Fields of use to us are:

  • f_flag contains bit flags for characteristics such as whether this file is readable/writable/executable by the current process, if it was created new or opened for appending, [non]blocking modes, and many more.

  • f_type determines if this file is a ``real'' vnode or just a network socket.2.4

  • f_count is incremented for each process referring to the same file in the Global Open File Table.

  • f_data is a pointer to an opaque and specific data -- depending on whether this file is a vnode or a socket.

  • f_offset is the offset into the file.

The fields specific to our work are described in Section 3.4.1.3.

There is only one Global Open File Table in the kernel. It has a limited size with some provisions to extend it dynamically if need be. Each u (user-specific) structure has an array of pointers to its open files. These u.u_pofile_arr[idx] are pointers into the global open file table.

When two different processes open the same file (by name or by link) they get two different struct file entries in the global open file table. Each file structure contains an f_offset field so that each process can maintain a different offset. Each file structure however, will have an f_data field that points to the same vnode.

The vnode structure contains the flags needed for performing advisory locking [SMCC90a,SMCC90b], and has a reference count of how many processes opened it.

Things get more complicated when a process opens a file then forks. The child inherits the same file structure pointer that the parent has. That means that if the child seeks elsewhere into the file, the parent will too, since they have the same f_offset field!2.5

The last bit of missing information is how does the kernel tell that more than one process is sharing the same entry in the global file table. The answer is that each file structure contains an f_count field -- a reference count similar to, but different from, the one in the vnode structure.

   
2.2 RLP

We use the RLP resource location protocol [Accetta83] when seeking a replacement file system. RLP is a general-purpose protocol that allows a site to send broadcast or unicast request messages asking either of two questions:

1.
Do you (recipient site) provide this service?

2.
Do you (recipient site) know of any site that provides this service?

A service is named by the combination of its transport service (e.g., TCP), its well-known port number as listed in /etc/services, and an arbitrary string that has meaning to the service. Since we search for an NFS-mountable file system, our RLP request messages contain information such as the NFS transport protocol (UDP [rfc0768]), port number (2049) and service-specific information such as the name of the root of the file system.

   
2.3 Amd

Amd [Pendry91,Stewart93] is a widely-used automounter daemon. Its most common use is to demand-mount file systems and later unmount them after a period of disuse; however, Amd has many other capabilities.

Amd operates by mimicking an NFS server. An Amd process is identified to the kernel as the ``NFS server'' for a particular mount point. The only NFS calls for which Amd provides an implementation are those that perform name resolution: lookup, readdir, and readlink. Since a file must have its name resolved before it can be used, Amd is assured of receiving control during the first use of any file below an Amd mount point. Amd checks whether the file system mapped to that mount point is currently mounted; if not, Amd mounts it, makes a symbolic link to the mount point, and returns to the kernel. If the file system is already mounted, Amd returns immediately.

An example, taken from our environment, of Amd's operation is the following. Suppose /u is designated as the directory in which all user file systems live; Amd services this directory. At startup time, Amd is instructed that the private mount point (for NFS filesystem which it will mount) is /n. If any of the three name binding operations mentioned above occurs for any file below /u, then Amd is invoked.2.6 Amd consults its maps, which indicate that /u/foo is available on server bar. This file system is then mounted locally at /n/bar/u/foo and /u/foo is made a symbolic link to /n/bar/u/foo. (Placing the server name in the name of the mount point is purely a configuration decision, and is not essential.)

Our work is not dependent on Amd; we use it for convenience. Amd typically controls the (un)mounting of all file systems on the client machines on which it runs, and there is no advantage to our work in circumventing it and performing our own (un)mounts.

2.3.1 How Our Work Goes Beyond Amd

Amd does not already possess the capabilities we need, nor is our work a simple extension to Amd. Our work adds at least three major capabilities:

1.
Amd keeps a description of where to find to-be-mounted file systems in ``mount-maps.'' These maps are written and maintained by administrators and are static in the sense that Amd has no ability for automated, adaptive, unplanned discovery and selection of a replacement file system.

2.
Because it is only a user-level automount daemon, Amd has limited means to monitor the response of rfscall() or any other kernel routine.

Many systems provide a tool, like nfsstat, that returns timing information gathered by the kernel. However, nfsstat is inadequate because it is not as accurate as our measurements, and provides weighted average response time rather than measured response time. Our method additionally is less sensitive to outliers, and measures both short-term and long-term performance.

3.
Our mechanism provides for transparently switching open files from one file system to its replacement.

Amd might be considered the more ``natural'' place for our user-level code, since Amd makes similar mount decisions based on some criteria. Some coding could have been saved and speedups made if we placed our user-level management code inside Amd. However, we saw two main problems with this approach:

1.
Amd is maintained by different people, and we would have to continually keep Amd and our programs in sync.

2.
Not everyone uses Amd as their automounter, if any at all. By placing our code inside Amd we would have forced other administrators to run and maintain Amd as well.


next up previous contents
Next: 3. Design Up: Discovery and Hot Replacement Previous: 1. Introduction
Erez Zadok
1999-02-17