Mar 19

The personal file cloud (or the challenge of syncing files across machines and devices)

Category: Linux   — Published by tengo on March 19, 2012 at 2:05 pm

Keeping files in sync is an ongoing struggle, and with most people having a multitude of personal computing devices, it's becoming more of a ubiquitous problem. Just think of the nightmare of shuttling files back and forth between your home desktop, the office desktop, your laptop/netbook and your mobile phone. At some point, you will have that moment of "oh no, it's on my <name of the other machine>!". Let's investigate what's out there

Related reading: Jon Stokes take.

Dropbox:
- Only one folder inside your "home folder", so you can only sync all your files with a few tricks
- Keeps version history, but what if you want to "forget" everything about a file?
- It's not open source, not running on your server and you end up putting data into the hands of people you don't know, even if it's encrypted data
- Makes a local copy of whole file-repository. Good as backup, bad if repository is BIG.

Roll your own Dropbox with rsync+inotify
- Linux only (at the moment)
- No GUI or Web-Frontend (at the moment)
- Indirect: a write on local disk only issues data being copied by the watcher-daemon. This also enables the graceful-fallback, but makes it feel like an add-on.
- Makes a local copy of whole file-repository. Good as backup, bad if repository is BIG.

Your files in version control
- Overly complicated
- no nice GUI
- Do you really need versioning for *all* your files?
- Makes a local copy of whole file-repository. Good as backup, bad if repository is BIG.

WebDAV
- Only a very basic Web frontend
- Versioning depends on the WebDAV provider. Very few have it.
- Mounting a WebDAV share requires additional software, WebDAV is not a "single solution".

Mounting a remote drive via sshfs
- Slow! Especially directory indexes are slow
- No diff-only behaviour as with rsync
- No support for xattribs/ extended attributes as the underlying sftp-server in OpenSSH doesn't support it.
(Update: There's an alternative: perlsshfs. It does support xattribs, works quite well but has a small perfomance hit)
- No versioning, as it behaves like a local block-device

Mounting the "home folder" via NFS and similar
- No graceful fall-back in case of no Internet connectivity
- NFS for example can work over WAN, but it requires a really fast/reliable connection!

"The perfect system"
- Rsync like diff-only replication of data-deltas
- Snappy user experience, even over slow, laggy, lossy or defunct network connections
- Behaves like a local block-device. The drive is not being "watched", file operations itself trigger actions, at least more or less.
- Seamless integration into OSes, while additionally providing a Web-frontend or more sophisticated file-broser/manager which takes additional (expert) features into account
- Strong file integrity checking, checksumming
- Full metadata support and integrity: atime, mtime, ctime, crtime, user/group, perm, ACL and xattribs/extended attributes!
- Mounting a remote and possibly gigantic repository/drive does not lead to a sync'ing of this repository to the local machine, it's more like a shadow or glass etching, data appears to be local but is not. But once files (that's might have taken some time to copy over) have been copied to the local machine, they remain local for some time acting as a cache, but will fade/disappear again if they are not used in a time.
-Saving large files directly to the remote machine over the (possibly slow) network can be slow! The system should know when a file should be copied/synced fast/immediately or if it should send the file to local disk first and do the sync asynchronously later on. (It should know about the current network speed)
- Local replicas/caches are heavily encrypted, as much as all communication over the WAN and also the data on the block level of the remote server. At best, crypt-keys on the server should be only kept in memory for the duration of the session, so not even the server-operator has the means to decrypt user data.
-Enterprise-ready. Companies and organizations should be able to host their own setup, completely private if desired.

At what level should the system hook-in?

The presented solutions either are a set of tools the user might use to mount a remote drive at various stages of boot, or it's more or less an (autostart) application that does the job clearly from userspace to add another mount/share into the existing file-folder-structure of the user, a special dir that is synchronized. So the user has to decide if *all* or only *some* of his/her files should be synchronized.

So far it seems as if there is no easy solution that allows a user to have his/her whole (possibly large) user directory replicated across machines which at the same time behaves well under the fall-back condition of no Internet connectivity.

Also, question is if the system should keep the same set of data on all machines or only allow a "remote peek" into a possibly large remote repository. What if the netbook has only 500GB of storage while the remote machine/data you like to work with is a number of times bigger than this?