Next: 2 Background Up: Fast Indexing: Support for Previous: Fast Indexing: Support for

1 Introduction

Size-changing algorithms (SCAs) are those that take as input a stream of data bits and produce output of a different number of bits. These SCAs share one quality in common: they are generally intended to work on whole streams of input data, from the beginning to the end of the stream. Some of the applications of such algorithms fall into several possible categories:

Compression:

Algorithms that reduce the overall data size to save on storage space or transmission bandwidths over networks. For example, a compression file system that works with NFS [Sandberg, Goldberg, Kleiman, Walsh, and LyonSandberg et al.1985,Pawlowski, Juszczak, Staubach, Smith, Lebel, and HitzPawlowski et al.1994] can reduce the amount of data that is transmitted over the network, thus saving on precious network bandwidth at the expense of additional CPU processing to compress and decompress data. If the CPU is fast enough, such a network-based compression file system can actually perform better than a non-compressing one; this is a common technique in network communications over slow links [Engan, Casner, and BormannEngan et al.1999]. In this category we consider non-lossy compression.

Encoding:

Algorithms that encode the data such that it has a better chance of being transferred, often via email, to their intended recipients. For example, Uuencode is an algorithm that uses only the simplest printable ASCII characters and no more than 72 characters per line. This is to ensure that uuencoded email or binaries can traverse the Internet even if they go through legacy email or networking systems that may not support the full ASCII set of characters or text lines longer than 72 characters.

In this category we also consider transformations to support internationalization of text, as well as Unicoding.

Encryption:

These are algorithms that transform the data so it is more difficult to decode it without an authorization--a decryption key. Encryption algorithms work in different modes. The simplest mode is Cipher Feedback (CFB) mode [SchneierSchneier1995]. This mode does not change the size of the data. As such, it is not the strongest algorithm because it provides potential attackers one more piece of information: the original input data size.

Stronger encryption modes can use Cipher Block Chaining (CBC), a mode that typically increases the size of the output [SchneierSchneier1995]. This algorithm is considered stronger because it does not tell you what the original input data size was. Furthermore, by encoding the input into more output bits, the input data becomes more randomized, increasing the brute-force search space.

Transparent encryption could be particularly useful if combined with a network-based file system such as NFS. In that case, the data that is transmitted over insecure networks is encrypted, making it more difficult to attackers to decode sniffed packets.

There are many useful user-level tools that use SCAs, such as uuencode, compress, and pgp [ZimmermanZimmerman1995]. These tools work on whole files and are often used manually by users. This poses additional inconvenience to users. When you encrypt or decompress a data file, even if you wish to access just a small part of that file, you still have to encode or decode all of it until you reach the portion of interest--an action that consumes many resources. SCAs do not provide information that can help to decode or encode only the portion of interest. Furthermore, running user-level SCA tools repeatedly costs in additional overhead as data must be copied between the user process and the kernel several times. User-level SCA tools are therefore neither transparent to users nor do they perform well.

Instead, it would be useful for a file system to support SCAs. File systems are:

transparent to users since they do not have to run special tools to use files, and
perform well since they often run in the kernel.

File systems have proven to be a useful abstraction for extending system functionality. Several SCAs (often compression) have been implemented as extensions to existing disk-based file systems [AyersAyers1997,Burrows, Jerian, Lampson, and MannBurrows et al.1992,NagarNagar1997]. Their disadvantages are that they only work with specific operating systems and file systems, and they only support those specific SCAs. Supporting general-purpose SCAs on a wide range of platforms was not possible.

Stackable file systems are an effective infrastructure for creating new file system functionality with minimal performance overhead and development cost [Guy, Heidemann, Mak, Page Jr., Popek, and RothmeierGuy et al.1990,Heidemann and PopekHeidemann and Popek1991,RosenthalRosenthal1990,Skinner and WongSkinner and Wong1993,Zadok, Badulescu, and ShenderZadok et al.1999,Zadok and NiehZadok and Nieh2000,ZadokZadok2001]. Stackable file systems can be developed independently and then stacked on top of each other to provide new functionality. Also, they are more portable and are easier to develop [Zadok and NiehZadok and Nieh2000]. For example, an encryption file system can be mounted on top of a native file system to provide secure and transparent data storage [Zadok, Badulescu, and ShenderZadok et al.1998]. Unfortunately, general-purpose SCAs have never been implemented in stackable file systems. The problem we set out to solve was how to support general-purpose SCAs in a way that is easy to use, performs well, and is available for many file systems.

We propose fast indexing as a solution for supporting SCAs in stackable file systems. Fast indexing provide a way to map file offsets between upper and lower layers in stackable file systems. Since the fast indexing is just a mapping, a lower-layer file system does not have to know anything about the details of the SCA used by an upper-level file system. We store this fast indexing information in index files. Each encoded file has a corresponding index file which is simply stored in a separate file in the lower-layer file system. The index file is much smaller than the original data file, resulting in negligible storage requirements. The index file is designed to be recoverable if it is somehow lost so that it does not compromise the reliability of the file system. Finally, fast indexing is designed to deliver good file system performance with low stacking overhead, especially for common file operations.

We have implemented fast indexing using stackable templates [Zadok, Badulescu, and ShenderZadok et al.1999,Zadok and NiehZadok and Nieh2000,ZadokZadok2001]. This allows us to provide transparent support for SCAs in a portable way. To demonstrate the effectiveness of our approach, we built and tested several size-changing file systems, including a compression file system. Our performance results show the following two points:

That fast index files have low overhead for typical file system workloads, only 2.3% over other null-layer stackable file systems.
That such file systems can deliver as much as five times better performance than user-level SCA applications.

This paper describes fast index files and is organized as follows. Section 2 reviews the stacking file-system infrastructure used for this work and discusses related work in SCA support in file systems. Section 3 details the design of the index file. Section 4 describes the usage of the index file in relation to common file operations and discusses several optimizations. Section 5 details our design for a consistent and recoverable index file. Section 6 summarizes important implementation issues. Section 7 describes the file systems we built using this work and evaluates our system. Finally, we present conclusions and discuss directions for future work.

Next: 2 Background Up: Fast Indexing: Support for Previous: Fast Indexing: Support for

Erez Zadok 2001-10-06