When two nodes synchronize via rsync, there
is a significant amount of overhead.
Re-computation of Deltas
Consider when comparing files based upon time stamps. On each synchronization,
directory entry's timestamps are compared to determine which files to exchange.
When doing a synchronization between two hosts without any knowledge other than
the root paths on each end, this cost is unavoidable.
In a mirroring network (like those used in Fedora, Debian, etc.), a small number
of files are changed or added on each update, but updates occur frequently. If
the master server were to pre-calculate the changes, then every downstream
client would be able to know which files to request before connecting to an
upstream server.
If all clients are querying a tracker that can advertise new changes, then this
the reduced latency in each link will lead to overall reduced latency as well.
Consider a network with three tiers of clients. The tracker can advertise
changes to tier 1. When all tier 1 clients have the updates, it can begin to
advertise the updates to tier 2 clients. At each update rollout, downstream
nodes can efficiently transfer data from upstream nodes without ever having to
compare the contents of the archive on each end of the connection (a simple
bitfield of files will suffice).
Lack of a Checksum Cache
Fedora's sample rsync configuration
explicitly turns off the checksum feature of rsync. The rationale behind this
(please note that I agree with this rationale) is that it prevents remote users
from triggering a checksum operation over the entire tree. This operation
requires every file on the tree to be read from disk and run through the
checksum algorithm used.
Thus, in the interest of saving I/O subsystems mirrors disable checksum support.
Without checksum support, rsync falls back to using timestamps. For
security reasons (I'll explain in a later post), timestamps are less than
adequate for detecting changes.
Ignorance of Hard-links
By default, rsync will not transfer hard-links. For example,
/path/to/foo/file and /path/to/bar/file could link to the same file on the
server, but unless the client runs rsync with -H, the file will be
transferred twice.
This is a minor issue that has mostly been solved with proper use of rsync.
The parallel to this issue is duplicated data that the server does not know
about. rsync does not detect if two files on the server are the same.
Reliance Upon Picking a Fast Upstream
With rsync, the choice of upstream mirror is critical to maintaining an
keeping up-to-date with updates. A mirror will never see updates before its
upstream does. In the worst case, the leaf nodes in the network will always be
d (the depth of the tree) synchronizations behind (assuming new content is
pulled at each sync). If there is a critical update that must be pushed
immediately, a O(6d) latency (assuming four synchronizations per day) is not
acceptable.
What CHASM Will Change
In this section I will provide a brief overview of how CHASM addresses each of
the above issues.
CHASM's master node generates a manifest as part of the update process. Each
manifest describes the absolute state of the archive in terms of directories,
symbolic links, and files. Each directory or file object has an associated
traditional unix file mode (ugo-rwx). Files are specified by a cryptographic
hash (where the CH in CHASM comes from).
A client simply has to read the manifest and compare it to local data on the
system. Combine this with a local pool of files identified by their hash value
(like Git's .git/objects/ directory), and CHASM is able to quickly identify
which files in a manifest must be transferred. Even a mirror that is months out
of date could easily avoid re-transferring any data that is still valid.
The solution to the re-computation of deltas adequately provides a checksum
cache, and solves the issue of hard-links. As each file is identified by its
checksum, every duplicate of the file will be replaced with a hard-link as a
matter of course.
Mirrors can query for updates quite frequently without significant penalty. A
full manifest of the Fedora project takes up less than 100MB without
compression. Compression reduces this amount (more on this in a later post).
Creating a diff between manifests reduces the size further.
The Fedora infrastructure already includes support for detecting closest-mirrors
using BGP. We'd like to leverage this to allow clients to dynamically pull
updates from close neighbors so as to minimize bandwidth costs and hopefully
maximize throughput.
Looking to the Future
CHASM is still a work in progress. Everything I wrote was in the above document
was written in present tense for clarity. Many of the features exist as ideas,
and are not reflected in the code.
If you're looking to help, please consider submitting to our CHASM Nightly
Dashboard (nightly build
instructions).
Update: We're also looking for students and support in the Fedora Summer of
Code (CHASM)
Thanks,
--Robert