Next: 5. Evaluation
Up: Discovery and Hot Replacement
Previous: 3. Design
Subsections
4. Implementation
4.1 RLP
We implemented only the parts of the Resource Location Protocol which we
needed:
- RLP_MSG_WHO_ANYWHERE_PROVIDES
- RLP_MSG_DOES_ANYONE_PROVIDE
- RLP_MSG_THEY_PROVIDE
See an introduction of RLP in section 2.2.
We use RLP in its ``miscellaneous'' message format, one allowing the client
to send an arbitrary byte-stream, which is meaningful only to the server
which can decode it. The actual fields in the arbitrary field are as
follows:
- 1.
- msg_size: the total size of this message.
- 2.
- hostname: the hostname of the filesystem which is now considered
bad and which we are trying to replace. If the same server which is bad
receives this RLP request, it is forbidden from answering for itself, to
avoid picking the same bad host as a replacement for itself.
- 3.
- filesystem: the name of the filesystem for which a replacement
is sought. The name of the filesystem (for example /usr/gnu) is
insufficient to describe everything that is in that filesystem. That is why
we also relied on comparing individual files to determine their existence
and equivalence. See Section 3.4.2.
- 4.
- flags: arbitrary flags describing the replaced filesystem. This
field was not in use in our work, and is there for future expansion. One
possible use for it was to tell the remote resource server that the replaced
filesystem needs to be mounted with special privileges (say the anon=0
option). A remote server might then decide not to reply to such a client
for security reasons.
- 5.
- architecture: the host architecture of the replaced
filesystem is used to advise the client whether the server is likely to
supply it with the same type files it needs. The server reports back if its
architecture is identical to that of the client or not. The client then may
choose to select or decline using that server.
4.2 Management and Control Facilities
We thought of several ways to query and control various features of our
system. One mechanism which might have not required any kernel
modifications was to use the kernel memory access mechanism, /dev/kmem, and write a program that ``walks'' the different structures in
the kernel, retrieving information as needed, and very carefully
modifying others. There were several problems with this approach:
- 1.
- Security: it is better for the kernel to protect itself against
malicious processes rather than rely on user-level programs not corrupting
vital kernel data. A program that can read kernel memory might be abused or
misused into reading or modifying parts of the kernel for which it was not
designed.
- 2.
- Atomicity: some of the management and control operations we
allow require the operation go to completion, and that no other process
could access that data being modified (i.e., exclusive lock). Otherwise
kernel data will be left in a corrupt state. Kernel facilities are provided
to system calls to make them more atomic, and it is a lot easier to lock out
certain parts of our code while they are being modified with the appropriate
spl level.4.1
- 3.
- Portability: some operating systems such as CMU's ``UX'' server
for Mach-3 don't have a kmem interface, making any future port of our system
more difficult. Eventually a system-call mechanism would have had to be
used.
Accordingly we decided to add a new system call to the kernel. A system
call is a considered a mechanism far ``cleaner'' than kmem. (However, it
does require access to kernel source.)
Our system call is designed with ``object oriented'' like programming style
in mind. The calling convention of nfsmgr_ctrl() take 3 arguments:
- 1.
- obj: is the code for the object we want to manipulate or query.
- 2.
- cmd: is the command code we want to apply to the object.
- 3.
- args: is a pointer to a control structure which contains the
necessary information which needs to be passed to the kernel. It also
provides allocated space in the user-space for operations which need to
return data back to the user process.
Following are the various ``objects'' which could be manipulated using the
system call, and the commands which could be applied to them:
- 1.
- NMO_NONE: no (N)FS (M)anager (O)bject needs to be manipulated.
This one exists mostly for trapping errors, and serves as the ``null'' call.
- 2.
- NMO_DFT: manage the Duplicate File Table. The allowed
operations are:
- NMC_READ: this (N)FS (M)anager (C)ommand will read the full
contents of the DFT into user space. A client we wrote using this system
call displays the full DFT in tabular form.
- NMC_WRITE: this command will overwrite arbitrary entries in the
DFT.
- NMC_ADD: add entries to the DFT.
- NMC_DEL: remove entries from the DFT, given a pathname.
- NMC_CLEAR: clear a single entry from the DFT, given a table
index.
- NMC_RESET: clear all the entries from the DFT.
- 3.
- NMO_RFSI: manage the RFSI, the Replacement File System
Information. To facilitate the ability to query a filesystem and find a
replacement for it, each vfs contains pointers to the filesystem it
replaces and to the one that replaces it -- the fields vfs_replaces
and vfs_replaced_by in the vfs structure; see Figure
2.1. The operations allowed on the RFSI are identical to
those for NMO_DFT.
- 4.
- NMO_DFT_SIZE: Size of the DFT. Allowed operations are:
- NMC_READ: return the current number of entries in the DFT.
- NMC_GETMAX: return the size of the allocated DFT. This is the
maximum number of entries that could fit in the DFT.
- NMC_SETMAX: set the maximum size of the DFT. This could be used
to extend or truncate (via kernel ``realloc'' routines) the length of the
DFT.
- 5.
- NMO_RFSI_SIZE: Size of the RFSI. The only allowed operation is
NMC_READ which returns the current size of the RFSI. The maximum size
of the RFSI cannot be controlled externally. It is equal to the number of
vfs structures that exist in the kernel, and grows or shrinks with them.
- 6.
- NMO_MGMT: NFS management flag per filesystem. The only three
operations allowed here are NMC_READ, NMC_SETON, and NMC_SETOFF. They return the current value of this flag, set it to on, or
turn it off, respectively.
- 7.
- NMO_MEDIAN_SET: Median set is the long-term queue of measured
round-trip times of the nfs_lookup operations. Allowed operations
are:
- NMC_READ will read the current median value of the long queue.
- NMC_GETMIN will read the current number of medians stored in the
queue. The queue gets reset each time a replacement is made, and no new
replacements are made until the queue is full again. This tells us how many
data points we have already accumulated.
- NMC_GETMAX tells us the value of the most recent median entered,
the one at the very top of the queue.
- 8.
- NMO_MEDIAN_SUBSET: Median subset is the short-term queue of
measured round-trip times. Allowed operations are identical to those
allowed on the full-size median queue (NMC_MEDIAN_SET).
- 9.
- NMO_TRIGGER_RATIO: Trigger ratio between the median of the
short-term queue and the long-term one. Allowed operations are NMC_READ for checking the current ratio, and NMC_WRITE for changing
it.
- 10.
- NMO_SWITCH_NOW: Forced switching flag. This controls a bit
in the vfs structure's field vfs_nfsmgr_flags, as described in
Section 3.4.1.1. The only three operations allowed
here are NMC_READ, NMC_SETON, and NMC_SETOFF. They
return the current value of this flag, set it on, or turn it off,
respectively. Setting this flag to on will force a switching of this
filesystem the next time an NFS lookup operation is performed on it.
The last field of nfsmgr_ctrl() is used not just to return values to
the user, but to pass to the kernel whatever necessary information it
requires. Many of the operations listed above are specific to a particular
filesystem, such as DFT operations. For these operations, the name of the
filesystem (mount point) must be passed to the kernel. The system call will
search for the vfs with the same name in the vfs_mnt_path field, and
if found, will apply the operation requested to the DFT of the vfs in
question.
4.3 Debugging Facilities
This nfsmgr_debug() system call is very simple. It passes an integer
to the kernel, and returns a status back. The integer is a bit-mask for
turning on debugging (mostly using printf()s) for at various parts of
our code. Of course, prior to this mechanism working, we had to wrap parts
of debugging code with the appropriate bit-mask tests.
Next: 5. Evaluation
Up: Discovery and Hot Replacement
Previous: 3. Design
Erez Zadok
1999-02-17