James' USENIX 2007 notes: tutorial M10

Next Generation Storage Networking
Jacob Farmer, Cambridge Computer Services

There has been tremendous innovation in the data storage industry over the past few years. Proprietary, monolithic SAN and NAS solutions are beginning to give way to open-system solutions and distributed architectures. Traditional storage interfaces such as parallel SCSI and Fibre Channel are being challenged by iSCSI (SCSI over TCP/IP), SATA (serial ATA), SAS (serial attached SCSI), and even Infiniband. New filesystem designs and alternatives to NFS and CIFS are enabling high-performance filesharing measured in gigabytes (yes, bytes, not bits) per second. New spindle management techniques are enabling higher-performance and lower-cost disk storage. Meanwhile, a whole new set of efficiency technologies are allowing storage protocols to flow over the WAN with unprecedented performance. This tutorial is a survey of the latest storage networking technologies, with commentary on where and when these technologies are most suitably deployed.
intro

afternoon session 1


When trying to figure out how fast your storage array is, you must
account for I/O processing.

Jacob oscillates on whether he thinks NAS is a good idea or not.  His
general rule of thumb is that if you have a non-trivial file serving
challenge, and someone wraps up everything you need and abstracts it
in a nice shiny device, then go for it.

NAS set backups back 10 years; NDMP moved it forward 5 years.

SNIA is trying to be vendor-neutral, but this can result in a
least-common-denominator approach.  Jacob refers to this as
vendor-neuter.

Fibre Channel is just SCSI in a star topology instead of a bus
topology.

Storage virtualization is simply redirecting physical addresses to
virtual addresses.  Arguably, almost all storage is virtualized
(e.g., RAID virtualizes).  So, Jacob reserves the term
virtualized storage for techniques that are new, novel,
and/or clever.

SCSI is a channel protocol; it assumes the wire takes care of reliable
transmission.  When the wire drops the ball, SCSI breaks.

The real motivation for SATA was to make it easy for big PC vendors to
plug hard drives into PC motherboards.  All of the other features came
later, after PC vendors had gotten the ball rolling.

Fibre channel is a pain to scale.

Right now: if you're building on your own, stick with fibre channel.
Don't go with a SAS-based solution unless it's inside an encapsulated
system supplied and supported by a big-name vendor, and you test the
bejeezus out of it.

The difference between enteprise hard drives and workstation hard
drives is blurring; i.e., near line storage drives.

According to Jacob, when you have your data sitting in a collection of
SATA drives for long periods of time, you'll find that small
corruptions occur over time.  This is because of differences in the
error rate.  Tolerance for rotational virbration is also an important
factor.

When Seagate calculates tolerance for rotational vibration, as long as
the drive is within tolerance 80% of the time, it is considered to be
within spec.  Jacob finds this alarming, because it means a drive
could be spending up to 20% of its time reseeking, and still be
considered to be within spec.

7200RPM drives are build significantly differently than 15K RPM drives.

Most people say they'll just do iSCSI in software on their
fileservers, but but SNIC cards for their big database servers.  In
reality, this is backwards: database servers generally don't pump that
much actual data over the wire, while people tend to underestimate the
amount of data that their fileservers pump over the network.

Jacob likes to describe Infiniband as a Vulcan mind meld for your PC.
He expounds: Right now, I'm vibrating air with my vocal cords,
which is being picked up my little bones in your ear, which is
translated into electrical impulses, which is then interpreted by your
brain.  If you think about it, this communication protocol has a lot
of overhead.  If I could just core dump directly into your brain, we
could've been out of here hours ago.

The reason why Jacob thinks iSCSI is a win over fibre channel is
because you can take the money you would've spent on fibre channel
devices and instead buy more disk spindles.

Latency is iSCSI typically comes in two places: smaller MTUs, and
software I/O processing.  But the real latency you see is in the hard
drives; latency of the wire tends to pale in comparison (unless your
software I/O processing really sucks).

One of Jacob's customers is heavily invested in a SAN solution from a
three-letter vendor.  They told him that they'd love to look at
simpler SAN technology, but they've invested $100K in training their
staff to understand the SAN.  Jacob replied that that's their problem:
they have a SAN that they have to spend $100K on training in order to
figure out how to use it.

People *will* use Microsoft Sharepoint as a fileserver.  You can go
from megabytes to gigabytes to terbytes with frightening ease.

afternoon session 2


NetApp was the company that came up with the alternative to
copy-on-write for snapshots.  Multiple other SAN vendors implement it
now.

Snapshotting can be used if you have a mostly-identical volume
partitioned over many hosts; just make a snapshot for each host.

Dell does not think highly of PowerVaults.  (At least, PowerVaults as
of a few years ago.)

All of the small vendors that were pushing the envelope with storage
virtualization were bought up by top-tier companies.  Either neat
things are on the way, or some VPs made some bad acquisitions...

Jacob believes that at least for now, you need to learn thoroughly
understand the virtual storage system you're building in order to run
it properly, which is why he favors buying the software and assembling
the hardware yourself over buying a black box.

Jacob normally makes the argument that two cheapo systems provide
better redundancy than a single fault-tolerant system.

Jacob doesn't like block-level virtualization in the switch—switches
are already too complicated, and virtualization offers more
advantages with the logic resides closer to the disks.

The latest compression appliances can get throughput on the order of
500MB/s, not 200MB/s.

WAN accelerators work because WAN throughput is constrained more by
latency than by bandwidth.  WAFS gateways tend to simpler (usually
just caching).

Montillio makes the NFS/CIFS Hardware Accelerator Card.

Every time Jacob turns around, someone is innovating in the storage
acceleration space.

Gear 6 makes an accelerator that uses solid state storage; it's massively optimized for I/O. It accelerates only reads (not writes), so it doesn't boost write performance, but it boosts read performance by insane amounts.

Check out their paper Server Virtualization: Avoiding The I/O Trap.

Q&A session

There were only a few questions, mostly related to specific projects some of the attendees were working on.

Alas, I really don't know enough about our SAN project to pick Jacob's brain for advice.


You can go to the index of my Usenix notes.