Next: 5. Evaluation Up: Discovery and Hot Replacement Previous: 3. Design

Subsections

4. Implementation

4.1 RLP

We implemented only the parts of the Resource Location Protocol which we needed:

RLP_MSG_WHO_ANYWHERE_PROVIDES
RLP_MSG_DOES_ANYONE_PROVIDE
RLP_MSG_THEY_PROVIDE

See an introduction of RLP in section 2.2.

We use RLP in its ``miscellaneous'' message format, one allowing the client to send an arbitrary byte-stream, which is meaningful only to the server which can decode it. The actual fields in the arbitrary field are as follows:

1.: msg_size: the total size of this message.
2.: hostname: the hostname of the filesystem which is now considered bad and which we are trying to replace. If the same server which is bad receives this RLP request, it is forbidden from answering for itself, to avoid picking the same bad host as a replacement for itself.
3.: filesystem: the name of the filesystem for which a replacement is sought. The name of the filesystem (for example /usr/gnu) is insufficient to describe everything that is in that filesystem. That is why we also relied on comparing individual files to determine their existence and equivalence. See Section 3.4.2.
4.: flags: arbitrary flags describing the replaced filesystem. This field was not in use in our work, and is there for future expansion. One possible use for it was to tell the remote resource server that the replaced filesystem needs to be mounted with special privileges (say the anon=0 option). A remote server might then decide not to reply to such a client for security reasons.
5.: architecture: the host architecture of the replaced filesystem is used to advise the client whether the server is likely to supply it with the same type files it needs. The server reports back if its architecture is identical to that of the client or not. The client then may choose to select or decline using that server.

4.2 Management and Control Facilities

We thought of several ways to query and control various features of our system. One mechanism which might have not required any kernel modifications was to use the kernel memory access mechanism, /dev/kmem, and write a program that ``walks'' the different structures in the kernel, retrieving information as needed, and very carefully modifying others. There were several problems with this approach:

1.: Security: it is better for the kernel to protect itself against malicious processes rather than rely on user-level programs not corrupting vital kernel data. A program that can read kernel memory might be abused or misused into reading or modifying parts of the kernel for which it was not designed.
2.: Atomicity: some of the management and control operations we allow require the operation go to completion, and that no other process could access that data being modified (i.e., exclusive lock). Otherwise kernel data will be left in a corrupt state. Kernel facilities are provided to system calls to make them more atomic, and it is a lot easier to lock out certain parts of our code while they are being modified with the appropriate spl level.^4.1
3.: Portability: some operating systems such as CMU's ``UX'' server for Mach-3 don't have a kmem interface, making any future port of our system more difficult. Eventually a system-call mechanism would have had to be used.

Accordingly we decided to add a new system call to the kernel. A system call is a considered a mechanism far ``cleaner'' than kmem. (However, it does require access to kernel source.)

Our system call is designed with ``object oriented'' like programming style in mind. The calling convention of nfsmgr_ctrl() take 3 arguments:

1.: obj: is the code for the object we want to manipulate or query.
2.: cmd: is the command code we want to apply to the object.
3.: args: is a pointer to a control structure which contains the necessary information which needs to be passed to the kernel. It also provides allocated space in the user-space for operations which need to return data back to the user process.

Following are the various ``objects'' which could be manipulated using the system call, and the commands which could be applied to them:

1.

NMO_NONE: no (N)FS (M)anager (O)bject needs to be manipulated. This one exists mostly for trapping errors, and serves as the ``null'' call.

2.

NMO_DFT: manage the Duplicate File Table. The allowed operations are:

NMC_READ: this (N)FS (M)anager (C)ommand will read the full contents of the DFT into user space. A client we wrote using this system call displays the full DFT in tabular form.
NMC_WRITE: this command will overwrite arbitrary entries in the DFT.
NMC_ADD: add entries to the DFT.
NMC_DEL: remove entries from the DFT, given a pathname.
NMC_CLEAR: clear a single entry from the DFT, given a table index.
NMC_RESET: clear all the entries from the DFT.

3.

NMO_RFSI: manage the RFSI, the Replacement File System Information. To facilitate the ability to query a filesystem and find a replacement for it, each vfs contains pointers to the filesystem it replaces and to the one that replaces it -- the fields vfs_replaces and vfs_replaced_by in the vfs structure; see Figure 2.1. The operations allowed on the RFSI are identical to those for NMO_DFT.

4.

NMO_DFT_SIZE: Size of the DFT. Allowed operations are:

NMC_READ: return the current number of entries in the DFT.
NMC_GETMAX: return the size of the allocated DFT. This is the maximum number of entries that could fit in the DFT.
NMC_SETMAX: set the maximum size of the DFT. This could be used to extend or truncate (via kernel ``realloc'' routines) the length of the DFT.

5.

NMO_RFSI_SIZE: Size of the RFSI. The only allowed operation is NMC_READ which returns the current size of the RFSI. The maximum size of the RFSI cannot be controlled externally. It is equal to the number of vfs structures that exist in the kernel, and grows or shrinks with them.

6.

NMO_MGMT: NFS management flag per filesystem. The only three operations allowed here are NMC_READ, NMC_SETON, and NMC_SETOFF. They return the current value of this flag, set it to on, or turn it off, respectively.

7.

NMO_MEDIAN_SET: Median set is the long-term queue of measured round-trip times of the nfs_lookup operations. Allowed operations are:

NMC_READ will read the current median value of the long queue.
NMC_GETMIN will read the current number of medians stored in the queue. The queue gets reset each time a replacement is made, and no new replacements are made until the queue is full again. This tells us how many data points we have already accumulated.
NMC_GETMAX tells us the value of the most recent median entered, the one at the very top of the queue.

8.

NMO_MEDIAN_SUBSET: Median subset is the short-term queue of measured round-trip times. Allowed operations are identical to those allowed on the full-size median queue (NMC_MEDIAN_SET).

9.

NMO_TRIGGER_RATIO: Trigger ratio between the median of the short-term queue and the long-term one. Allowed operations are NMC_READ for checking the current ratio, and NMC_WRITE for changing it.

10.

NMO_SWITCH_NOW: Forced switching flag. This controls a bit in the vfs structure's field vfs_nfsmgr_flags, as described in Section 3.4.1.1. The only three operations allowed here are NMC_READ, NMC_SETON, and NMC_SETOFF. They return the current value of this flag, set it on, or turn it off, respectively. Setting this flag to on will force a switching of this filesystem the next time an NFS lookup operation is performed on it.

The last field of nfsmgr_ctrl() is used not just to return values to the user, but to pass to the kernel whatever necessary information it requires. Many of the operations listed above are specific to a particular filesystem, such as DFT operations. For these operations, the name of the filesystem (mount point) must be passed to the kernel. The system call will search for the vfs with the same name in the vfs_mnt_path field, and if found, will apply the operation requested to the DFT of the vfs in question.

4.3 Debugging Facilities

This nfsmgr_debug() system call is very simple. It passes an integer to the kernel, and returns a status back. The integer is a bit-mask for turning on debugging (mostly using printf()s) for at various parts of our code. Of course, prior to this mechanism working, we had to wrap parts of debugging code with the appropriate bit-mask tests.

Next: 5. Evaluation Up: Discovery and Hot Replacement Previous: 3. Design

Erez Zadok
1999-02-17