Disparity Bit (old site - long dead): total saving

Showing posts with label total saving. Show all posts

Friday, January 11, 2008

War on Bad Software Paradigms

(Moved here from the introduction to the Total Saving post, because most people wouldn't be interested in this part.)

Most common programs available today are buggy and badly designed and implemented. But these are relatively minor issues - by dint of lifelong indoctrination we manage to tolerate and even ignore their failings in everyday use.

Bad OS and UI paradigms are far worse than mere bugs or design problems. Almost all end-user programs that exist today are built on many layers of fundamental architectural mistakes. Some of them made better sense when they were introduced with Unix and the Internet, thirty or forty years ago. All have remained unexamined and unchanged in mainstream software ever since.

I've had enough of bad computing paradigms. This is my declaration of war on stupid software. Every year we spends man-aeons on faster, safer, more featureful and glittering implementations of the same fundamentally broken designs. We build mousetraps so good that only intelligence-augmented rats can avoid them, but it's time to remember that what we really wanted to catch was a dragon.

This is the first in a planned series of posts that will outline the basic design issues as I see them with some common classes of end-user software - editors, file managers, browsers and the like - and in about eight months (when I plan to go to university) I'll start work on implementing these ideas in earnest. I feel I wouldn't really enjoy working on anything else before I build myself a set of better tools to do it with.

It's said that every aspiring C programmer wants to write their own OS. I prefer Python (which is still far from perfect) and would rather write my own UI - preferably, one that would allow a very high percentage of my computer use with just a few simple but powerful paradigms. Is that just a dream? Time will tell.

And so, on to the first post about Bad Software Paradigms: manual data management vs. Total Saving.

Space costs of Total Saving

Yesterday I discussed my Total Saving approach with a friend, who raised an objection: surely the storage costs would be prohibitively high?

In my last post on the subject I wrote: "hard disk sizes and costs have long reached a point where [removing user-created data to free space] is completely irrelevant for all the text and text-based data a person can produce in a lifetime." It's time to substantiate this claim with some hard numbers.

Suppose I type at 10 cps (characters per second). This is a rate few people can exceed even in ten-second bursts, but let us suppose I can keep it up indefinitely. I type at 10 CPS, without stopping to think, eat, or sleep, for a year.

That gives us 10 (cps) * 3600*24*365 (seconds in a year) = 315,360 thousand characters typed per year.

Suppose I store every keystroke in a log. A log record consists of the character typed, a timestamp, and maybe a few delimiters and such. A timestamp looks like "1200054174.237716", so we come to around 20 characters in all. This brings us to 6.3 GB of data per year.

HD costs are around 4GB per US$, so, for $1.5 a year you can store all your data. Of course, you can't get HDs that small, so I'll put this another way: if you can afford to buy even the smallest disk produced today (probably 40GB), it will last you for six years. By then, the smallest disk will probably be large enough to last you the rest of your life. (With 2008 lifespans, anyway.) Other storage solutions, including online ones, offer prices not much higher (Gmail can store almost that much for free), and with data creation pegged at 200 byes/sec bandwidth won't be a problem.

Of course the real storage room requirements would be smaller by an order of magnitude. We can start by saving every word typed, or every 10 characters, saving a lot of log record overhead. We can store the timestamps as time offsets, which would be far smaller, and use coarser resolution (we don't really need nanosecond precision here). We can compress the whole thing, which ought to produce savings of 50%-60% at least. There are many higher-level techniques we can apply to further reduce the space used, like only storing identical document-states once, but the point is they're not really necessary: we can easily keep well below 1GB a year, probably well below 100MB in practice, and everyone who uses a computer can afford that.

But wait, said my friend, this would be too inefficient. If the editor had to read and "simulate" a year's worth of keystrokes just to open a document, wouldn't that be too slow to be practical?

Not necessarily, but I'll grant that some documents and document-types require a faster approach. No fear - since our actual storage techniques allow random-access writing rather than just appending to a log, we can store the current document as a state-dump suitable for quick loading into an editor (like the complete text of a text file, to start with), while storing reverse diffs for the complete history. (This, too, has been demonstrated in VCSs and elsewhere.) Since we're only storing one dump (or one dump per period of time, say one weekly), this shouldn't take up too much space.

Or would it? My friend pointed out that MS Office documents were huge, often running into the multiple megabytes. My natural reaction, of course, was to ridicule MS Office, but we did conduct an anecdotal-level test.

The biggest MS Word doc we could find was about 16MB large, and contained 33 pages. By removing all Visio drawings and embedded images, we reduced it to 31 pages of pure text (with some formatting and tables), but it was still 6.5MB large. We opened it with OO.o Writer 2.3 and saved as ODF (which disrupted a lot of formatting and layout finetuning), but that still took over 3MB. We finally copied-and-pasted the whole text into a new Writer document, choosing to paste as HTML. The resulting document was only 17KB large. Adding back the lost formatting presumably would make it grow, but nowhere near a multi-megabyte size. The large size of the first ODF might be explained at least in part by Writer preserving Word cruft that wasn't visible from the editor itself.

Of course, the ODF format used by Writer isn't very efficient, even zipped (neither is any other XML-based storage). 17KB, although compressed, is more than the raw text of the document uncompressed. But even so, it would be good enough for the purposes of total storage.

There's an assumption underlying this discussion: that the great majority of data produced by humans is textual in nature. Of course photos and sound and video recordings produce far more data than the amounts described here. Total storage can't be applied to them, at least not with current storage costs.

But even there, some things can be improved. When I record a movie, I may be producing multiple megabytes of data per second. When I edit a movie and apply a complex transformation to it, which changes all its frames at maybe a frame a second, it might seem that I'm producing a frameful of data per second, which can still come out on the order of MB/sec. But in reality, all the new data I've produced is the 100 or 200 bytes of processing instructions, divided by the ten hours it takes to transform the entire movie. The rest was accomplished by an algorithm whose output was fully determined by those instructions and the original movie. If our log stored only these, we could still reproduce the modified movie. (Of course, it would take us 10 hours.)

Put it another way: today we pass around CGI-rendered movies weighing GBs, when the input models rendered them might be an order magnitude smaller (this is only an uneducated guess on my part: perhaps they are in reality an order of magnitude larger - but bear with me for a moment). These generated movies don't contain any more information than the models and software used to produce them, they merely contain more data. One day our computers will be able to generate these movies at the speed we watch them. And then only the input data will be passed around.

Today we download large PDFs, but we could make do with the far smaller inputs used to generate them if only all our recipients had the right software to parse them.

Real-time movie rendering may be a way off yet. But total saving of the complete input we're capable of sending through our keyboards, mice, and touchscreens isn't. We just need to write the correct software. This state of things will last until brain-machine interfaces become good enough to let us output meaningful data at rates far exceeding a measely ten characters a second, or even the few hundred bytes per second we can generate by recording the precise movement of pointing devices. Even then, the complete log of your mouse movements is rarely as interesting as the far smaller log of your mouse clicks and drags-and-drops. The whole MMI field is still in its infancy.

In theory, if you record all the inputs of your deterministic PC since a clean boot, you can recompute its exact state. In practice, total saving of text-based input will suffice - for now.

Thursday, January 10, 2008

Total Saving

We lose data all the time.

Our computers hold a lot of data, but the most precious kind is the data we produce ourselves: the documents we write, the custom configurations, the bookmarks and browsing histories, the secret passwords, the things that would be lost on a clean reinstall. And it is this data that is in the greatest danger, precisely because we are constantly accessing and modifying it.

I save a file, accidentally overwriting another: poof, it is gone forever. I delete a file: gone. I engage in a session of heavy editing only to change my mind ten minutes later, but my editor's undo stack was cleared when I saved the file manually two minutes ago, and now I can't get the original version back. I go back in the undo log to see what the previous version looked like, I change a character by mistake, and now the redo log is gone and I've lost my changes. The software crashes, and the file is corrupted and cannot be recovered. I press Alt+F4 by mistake, and the words I wrote during the two minutes since I last pressed Ctrl+S disappear. I press Ctrl+A, Del, Ctrl+S, Ctrl+Q, and everything I've written is destroyed. Gone, all gone, even though it was on screen just a moment ago. The fatal keypresses cannot be undone.

Our current system of forcing the user to manually keep track of files is insane. It allows complete, irrevocable deletion of anything and everything, using short and simple commands that can be carried out by mistake. Giving a user complete freedom is a good thing, but nevertheless there is definite value in making commands that destroy data more difficult to carry out than just three or four key combinations which, in another context or even just in another order, would be completely innocuous.

There are only three acceptable reasons for a system to ever lose data. The first is storage hardware failure. The second is freeing up space for other data, but hard disk sizes and costs have long reached a point where this is completely irrelevant for all the text and text-based data a person can produce in a lifetime. The third is intentional user deletion, which should be a special command never invoked in ordinary use and very, very unlikely to be invoked by accident.

Backup (as typically implemented) is not a solution. It can only be done every so often. It can only contain a limited number of versions. There is no reason why I should be able to lose all the data I created over the last few days or hours due to a software bug or user error or just by changing my mind and wanting to undo a lot of changes. Backup is a solution for hardware failure, not for user error. (Also, it requires dedicated hardware most people don't have, and effort most people won't put in.)

Version control, as used with code, or wiki pages, or DAV servers, is also not a good solution. It requires the user to manually pick checkpoints at which to commit their work, but this model of managing atomic changesets isn't cleanly applicable to (say) writing a book: there is no reason why I should not want to inspect and manage my change history with arbitrary granularity, the way I typed it. It requires the user to interrupt their work in order to issue a manual commit command, which generally involves entering a commit message (if there is no commit log, why bother with atomic changesets? just autocommit every second or so). It isn't well suited to environments where the user isn't editing a document of some sort, but is still creating important data (e.g., an IRC chat).

These environments are better suited to a model of automatically saving all data as soon as it comes in. It is my argument that all data is best managed using this model, with things like atomic changeset management layered on top if necessary. Today editors save when users tell them to, and maybe autosave every minute or two. There is no reason why they should not save after every word typed, or every character - save with the same granularity and features, in fact, as allowed by their in-memory undo/redo stack.

I call this the total saving approach. It consists in large part of abandoning the optimizations that were necessary for the programs of 1970, but which today are mostly unnecessary and premature and have truly become the root of all evil:

Always save every bit of data you have - it might be useful later on. Discarding some data is premature optimization for storage space.

Save every bit of data as soon as you have it. Buffering data and writing it out later is just premature optimization for bandwidth.

Decouple the problem of what to save and how to save it, from the problem of how to display and process the saved data. Don't let your storage format and the choice of data to store be unnecessarily influenced by the way you're using the data. The way you use it might change tomorrow, but the data itself won't. Slaving storage to functionality is premature optimization for efficiency in a particular access pattern.

If you have transient data too large to keep in RAM (like unlimited-size undo stacks), dump it to disk; don't discard it. If the document being edited lives on a slow or laggy store (like a remote wiki page), keep a copy on the local disk and synchronize it with the remote master copy as bandwidth permits. If the user travels the undo/redo stack a lot, let them open different versions as independent copies and keep a general undo/redo graph.

Never, ever, throw away data. Destroying information is the root of all evil.

But version control can teach us some things. Modern VCSs contain some very useful, advanced features: changeset management capabilities like many-way diffs, branching and merging and cherrypicking, the blame (aka annotate) display and other overlay views. The challenge is to integrate both these features and total saving into a universal framework that will work with all our data and applications.

So to recap, a good editor (or other UI software; a future post will detail why I think almost every other type of program is a special kind of editor) would never lose data. It would let the user navigate all the revisions that ever existed for a document and use operations familiar from the version control world to manage changesets.

It would also manage documents for the user. Manually managing directories and choosing filenames is a bother, but the real reason it's bad is that it lets you then overwrite and delete those files. Obviously a file per document should exist somewhere, if only to let other programs access the files without invoking the editor program, but deleting or modifying them should not destroy data. (A FUSE-like approach might work here - a documentfs:/ mount would let you navigate the document store, locating files by various parameters, accessing different revisions and so forth. This is the filesystem-as-dynamic-DB-query-interface also seen in ReiserFS v4 and elsewhere, and is a topic for a dedicated post.)

A good editor would spot another feature, which I'll provisionally call reversible, non-destructive editing. We've said that no change of any kind made to a document, including meta-changes like saving or renaming it, can destroy data. Also, every change can be un- and re-done. That means all the pesky Yes/No, OK/Cancel message boxes have no reason to exist. OK to overwrite this file? Of course it's ok to overwrite - if I change my mind I'll just press Ctrl+Z, or find the old data in the changelog tomorrow. Once the user really *feels* data cannot be lost, I think tasks like writing won't look and feel as they do now. They would become a joyful, unburdened journey of exploration through the local vicinity of the space of all documents, always knowing we can retrace our path.

A good editor would also sport various other design features which will require their own blog posts to detail. This post is long enough as it is, so I'll stop here.

A final note: the ideas in this and future posts are far from being new inventions of mine. In many cases I'm even aware of the attribution (for instance, the idea about being able to undo any action and removing confirmation queries was described on a Google Tech Talk from maybe a year ago; unfortunately I can't recall which one it was now). The problem is that as far as I'm aware most of them haven't been implemented beyond some POCs. And I'm pretty sure they haven't all been implemented together (counting all the ideas I'll post about in the next few days).