Symmetric parallel file systems coordinate sharing in a back end to present an identical view of storage across multiple front-end nodes. Stateless NFS file servers mesh well with cluster file systems to provide scalable remote access. But NFSv4, with its delegations, locks, and share reservations, requires shared server state to be effectively coordinated as well. Furthermore, realizing the scalability of cluster file systems also demands a solution to the single server bottleneck inherent to client/server architectures.
Peter handed out some additional notes that weren't in the tutoring handouts (because he was slightly late in submitting them to Usenix).
(As an aside, to me, Peter sounds vaguely like Norm MacDonald.)
NFS was implemented everywhere because it was so simple. That
simplicity was
fixed
in NFSv4.
A cluster file system is one that is being provided by multiple redundant servers; every node in the server pool serves the same view.
The CITI lab is more than
20 years old. Because they have full-time staff, they need
non-trivial amounts of funding. (Grad students are slave labor,
but you actually have to pay staff.
)
CITI began working on NFSv4 around the turn of the century, under contract from Sun and another vendor (Peter mentioned the other vendor, but I missed it), who wanted to promote it in their products.
I'm not allowed to code. I program in the English language.
NFSv3 was stateless; the server didn't have to maintain any state about clients. E.g., the server didn't keep track of what files were on what clients (unlike AFS). This was a perfect match for cluster filesystems.
You can shoot yourself in the... well, any body part you
like.
Peter doesn't think that GFS is up to speed for building cluster filesystems that span multiple physical locations. Not yet, at any rate.
Don't underestimate the impact of open specifications. Open specs are significant because if you have something really great, and you've locked out your competitors, you can charge based on the value of the system. But if the specification is open, and anyone can build it, then you can only charge based on the cost of building the system. Peter thinks the shift towards open specs (which Sun championed) was as much responsible for the explosion in computing as Moore's Law was.
The promise of remote (distributed) applications hasn't truly been realized yet. Statelessness is actually the enemy of scalability. By maintaining state on the server, the server gains more control over the clients, and can better relinquish control to the clients. Because of the statelessness of NFS, NFS couldn't really get much work done on the clients. Caching is fundamental to good performance. CMU found that over half of all NFS client request were getattrs (to maintain the consistency of the cache). Moreover, most of the requests were needless, because the cache was actually valid.Disks used to be full [because they were expensive]. Well, actually, they're still full, but now they're a lot less expensive.The upgrade from v2 to v3 in the mid 1990s solved two tiny problems. First problem: if a client writes to the server, when does the client have a guarantee of persistence? With v3, the server syncs the file before returning success/failure to the client's close request. (As though clients ever check that, but whatever.) Vendors told CITI that a completely new protocol wouldn't fly; it had to be NFS. But that wouldn't fly, because the protocol was still under the control of Sun.No one wanted to start from scratch in the IETF. It had to be an extension of the installed base... I guess for marketing and commercial reasons I can't speak to. I was actually the chair of the IETF committee, and I knew that nothing was going to happen, because everyone wanted the 'new' distributed filesystem to be NFS.
Compound RFCs are not atomic; requests in a compound RFC may be interleaved by the server with other RPCs. Mandatory locking is optional in v4, but pretty much all vendors support it. Locks are non-blocking; clients must poll if they don't receive the lock. NFSv4 supports Windows lock semantics, which are included in the open() request and are atomic. The underlying filesystem on the server is only going to see requests reservation locks from NFSv4. (Well, and Samba servers.) v4 doesn't implement callbacks for cache invalidation (unlike AFS). Patterns of usage of fileserver usage has been studied, and write-sharing (multiple clients having the same share open for writing) is exceedingly rare. When files are shared, it's virtually always read-sharing. (Serial write-sharing is more common, but that isn't technically sharing.) A delegation is essentially a revocable file lock. A client that has a file with a local delegation can treat that file as a local file; it doesn't have to communicate with the server. The Linux NFSv4 client and server implementations are extremely aggressive in issuing and using delegations. Delegations makes locking very cheap. (That was, in fact, the principal motive for creating delegations.) Delegations are only for files. One of the most important things NFSv4 has is an extension mechanism. NFSv4.1 (currently in draft) includes directory delegations. Leases and refreshed automatically on reads and writes. When a server crashes and reboots, there is a critical period (the lease expiration period) during which the server must exercise care, because there might be clients that haven't realized that the server crashed. (The server needs to be in thisprotectivemode for at least the lease expiration period.) The Sun ONC RPC always had a placeholder for a security layer. (The historic one has been auth-sys/auth-unix; which is just passing uids around.) LIBKEY stands for Low Infrastructure Public Key Mechanism. For every file, the unix protection model splits the universe into three groups, and then determines which group you are in. We're all used to this, but it's non-intuitive, and it's not a good model. Delegations have difficulty scaling based on the number of clients, due to memory consumed by delegations. (In Peter's testing, they were able to crush NFSv4 serves with relative ease.) Maybe the server can issue a generic request to a client to return anyunuseddelegations? This is a very difficult problem to solve, and it really hasn't been solved yet (although CITI is research it). With NFSv4 and cluster filesystems, the fundamental strategy islet the cluster file system do it, because it has to do it anyway. The RFC suggests that the server should grant locks in request order, but doesn't tell us how to actually do that.These are not helpful suggestions.
In order to best resolve the locking problems, NFSv4 came up with a new kind of lock called a provisional lock. Peter is worried about the first provisional lock case (that the process can lose interest only through an external signal). Samba just accepts the race condition (page 52). In practice, it never happens, because all access from Windows clients is funneled through Samba.Over a period of time, the bazaar people began to accept that maybe the Cathedral people (CITI) weren't so bad after all. But there are clearly differences in the point of view.BreakLease leads to a race condition when run through VFS (page 53). People have made much progress in lock management. What they're really lacking, though, is the Holy Gail of NFSv4 and cluster filesystems: transparent migration and load balancing. Peter doesn't know whether Solaris (Sun) is even working on this problem. For vanilla v4, interoperability hasn't really been an issue for several years now. The bake-a-thons that occurred were very effective at rooting these out. For cluster filesystems, the real problem isn't so much interoperability, as plain old operability. Red Hat has been squawking loudly as of late, because they've found that some of the work CITI has been doing doesn't work for GFS. Some companies like working with CITI because CITI knows how to get things into the Linux kernel. NFSv4 works very well for WAN access.
pNFS solves the problem of an NFS server being a bottleneck by separating the data and control paths.
Nobody really wants to standardize the storage device names for different vendors, which is why GET DEVICE LIST returns device IDs. But dealing with device IDs is problematic.
CITI hopes to have pNFS in an RFC in 2008.
Pillsbury actually threatened to sue over the name bake-off, because they trademarked the name a long time ago. That's why CITI uses the term bake-a-thon now.
Public key support in NFSv4 is still a work-in-progress.
PKINIT allows you to perform the initial Kerberos ticket exchange via public keys instead of symmetric encryption. (Meaning, you sign a request for a TGT with your private key; the Kerberos server verifies the signature, encrypts the TGT with your public key, and then sends the encrypted blob back to you. But once you have the TGT, all further encryption operations are standard (symmetric) Kerberos operations.)
The next challenge after NFSv4 will be the protocol that tells us how to incorporate NFSv4 into a global namespace.
If there's one thing to take away from this tutorial, it's that NFSv4 is complicated. There's a whole lot of stuff going on with NFSv4, and a whole lot of engineering effort that went into it (and continues to go into it).
Several people asked questions, mostly involving projects they were currently working on.
Since AFS had come up quite a bit throughout the presentation, I mentioned to one of the people who had been asking questions about it about my experiences with it at Pitt. This kicked off a general commiseration with Peter and several other people about the lost potential of AFS (a product that was largely killed by Transarc's marketing incompetence).
You can go to the index of my Usenix notes.