Thursday, January 10, 2008

Total Saving

We lose data all the time.

Our computers hold a lot of data, but the most precious kind is the data we produce ourselves: the documents we write, the custom configurations, the bookmarks and browsing histories, the secret passwords, the things that would be lost on a clean reinstall. And it is this data that is in the greatest danger, precisely because we are constantly accessing and modifying it.

I save a file, accidentally overwriting another: poof, it is gone forever. I delete a file: gone. I engage in a session of heavy editing only to change my mind ten minutes later, but my editor's undo stack was cleared when I saved the file manually two minutes ago, and now I can't get the original version back. I go back in the undo log to see what the previous version looked like, I change a character by mistake, and now the redo log is gone and I've lost my changes. The software crashes, and the file is corrupted and cannot be recovered. I press Alt+F4 by mistake, and the words I wrote during the two minutes since I last pressed Ctrl+S disappear. I press Ctrl+A, Del, Ctrl+S, Ctrl+Q, and everything I've written is destroyed. Gone, all gone, even though it was on screen just a moment ago. The fatal keypresses cannot be undone.

Our current system of forcing the user to manually keep track of files is insane. It allows complete, irrevocable deletion of anything and everything, using short and simple commands that can be carried out by mistake. Giving a user complete freedom is a good thing, but nevertheless there is definite value in making commands that destroy data more difficult to carry out than just three or four key combinations which, in another context or even just in another order, would be completely innocuous.

There are only three acceptable reasons for a system to ever lose data. The first is storage hardware failure. The second is freeing up space for other data, but hard disk sizes and costs have long reached a point where this is completely irrelevant for all the text and text-based data a person can produce in a lifetime. The third is intentional user deletion, which should be a special command never invoked in ordinary use and very, very unlikely to be invoked by accident.

Backup (as typically implemented) is not a solution. It can only be done every so often. It can only contain a limited number of versions. There is no reason why I should be able to lose all the data I created over the last few days or hours due to a software bug or user error or just by changing my mind and wanting to undo a lot of changes. Backup is a solution for hardware failure, not for user error. (Also, it requires dedicated hardware most people don't have, and effort most people won't put in.)

Version control, as used with code, or wiki pages, or DAV servers, is also not a good solution. It requires the user to manually pick checkpoints at which to commit their work, but this model of managing atomic changesets isn't cleanly applicable to (say) writing a book: there is no reason why I should not want to inspect and manage my change history with arbitrary granularity, the way I typed it. It requires the user to interrupt their work in order to issue a manual commit command, which generally involves entering a commit message (if there is no commit log, why bother with atomic changesets? just autocommit every second or so). It isn't well suited to environments where the user isn't editing a document of some sort, but is still creating important data (e.g., an IRC chat).

These environments are better suited to a model of automatically saving all data as soon as it comes in. It is my argument that all data is best managed using this model, with things like atomic changeset management layered on top if necessary. Today editors save when users tell them to, and maybe autosave every minute or two. There is no reason why they should not save after every word typed, or every character - save with the same granularity and features, in fact, as allowed by their in-memory undo/redo stack.

I call this the total saving approach. It consists in large part of abandoning the optimizations that were necessary for the programs of 1970, but which today are mostly unnecessary and premature and have truly become the root of all evil:

  1. Always save every bit of data you have - it might be useful later on. Discarding some data is premature optimization for storage space.

  2. Save every bit of data as soon as you have it. Buffering data and writing it out later is just premature optimization for bandwidth.

  3. Decouple the problem of what to save and how to save it, from the problem of how to display and process the saved data. Don't let your storage format and the choice of data to store be unnecessarily influenced by the way you're using the data. The way you use it might change tomorrow, but the data itself won't. Slaving storage to functionality is premature optimization for efficiency in a particular access pattern.



If you have transient data too large to keep in RAM (like unlimited-size undo stacks), dump it to disk; don't discard it. If the document being edited lives on a slow or laggy store (like a remote wiki page), keep a copy on the local disk and synchronize it with the remote master copy as bandwidth permits. If the user travels the undo/redo stack a lot, let them open different versions as independent copies and keep a general undo/redo graph.

Never, ever, throw away data. Destroying information is the root of all evil.

But version control can teach us some things. Modern VCSs contain some very useful, advanced features: changeset management capabilities like many-way diffs, branching and merging and cherrypicking, the blame (aka annotate) display and other overlay views. The challenge is to integrate both these features and total saving into a universal framework that will work with all our data and applications.

So to recap, a good editor (or other UI software; a future post will detail why I think almost every other type of program is a special kind of editor) would never lose data. It would let the user navigate all the revisions that ever existed for a document and use operations familiar from the version control world to manage changesets.

It would also manage documents for the user. Manually managing directories and choosing filenames is a bother, but the real reason it's bad is that it lets you then overwrite and delete those files. Obviously a file per document should exist somewhere, if only to let other programs access the files without invoking the editor program, but deleting or modifying them should not destroy data. (A FUSE-like approach might work here - a documentfs:/ mount would let you navigate the document store, locating files by various parameters, accessing different revisions and so forth. This is the filesystem-as-dynamic-DB-query-interface also seen in ReiserFS v4 and elsewhere, and is a topic for a dedicated post.)

A good editor would spot another feature, which I'll provisionally call reversible, non-destructive editing. We've said that no change of any kind made to a document, including meta-changes like saving or renaming it, can destroy data. Also, every change can be un- and re-done. That means all the pesky Yes/No, OK/Cancel message boxes have no reason to exist. OK to overwrite this file? Of course it's ok to overwrite - if I change my mind I'll just press Ctrl+Z, or find the old data in the changelog tomorrow. Once the user really *feels* data cannot be lost, I think tasks like writing won't look and feel as they do now. They would become a joyful, unburdened journey of exploration through the local vicinity of the space of all documents, always knowing we can retrace our path.

A good editor would also sport various other design features which will require their own blog posts to detail. This post is long enough as it is, so I'll stop here.

A final note: the ideas in this and future posts are far from being new inventions of mine. In many cases I'm even aware of the attribution (for instance, the idea about being able to undo any action and removing confirmation queries was described on a Google Tech Talk from maybe a year ago; unfortunately I can't recall which one it was now). The problem is that as far as I'm aware most of them haven't been implemented beyond some POCs. And I'm pretty sure they haven't all been implemented together (counting all the ideas I'll post about in the next few days).

No comments:

Post a Comment