Friday, January 11, 2008

Space costs of Total Saving

Yesterday I discussed my Total Saving approach with a friend, who raised an objection: surely the storage costs would be prohibitively high?

In my last post on the subject I wrote: "hard disk sizes and costs have long reached a point where [removing user-created data to free space] is completely irrelevant for all the text and text-based data a person can produce in a lifetime." It's time to substantiate this claim with some hard numbers.

Suppose I type at 10 cps (characters per second). This is a rate few people can exceed even in ten-second bursts, but let us suppose I can keep it up indefinitely. I type at 10 CPS, without stopping to think, eat, or sleep, for a year.

That gives us 10 (cps) * 3600*24*365 (seconds in a year) = 315,360 thousand characters typed per year.

Suppose I store every keystroke in a log. A log record consists of the character typed, a timestamp, and maybe a few delimiters and such. A timestamp looks like "1200054174.237716", so we come to around 20 characters in all. This brings us to 6.3 GB of data per year.

HD costs are around 4GB per US$, so, for $1.5 a year you can store all your data. Of course, you can't get HDs that small, so I'll put this another way: if you can afford to buy even the smallest disk produced today (probably 40GB), it will last you for six years. By then, the smallest disk will probably be large enough to last you the rest of your life. (With 2008 lifespans, anyway.) Other storage solutions, including online ones, offer prices not much higher (Gmail can store almost that much for free), and with data creation pegged at 200 byes/sec bandwidth won't be a problem.

Of course the real storage room requirements would be smaller by an order of magnitude. We can start by saving every word typed, or every 10 characters, saving a lot of log record overhead. We can store the timestamps as time offsets, which would be far smaller, and use coarser resolution (we don't really need nanosecond precision here). We can compress the whole thing, which ought to produce savings of 50%-60% at least. There are many higher-level techniques we can apply to further reduce the space used, like only storing identical document-states once, but the point is they're not really necessary: we can easily keep well below 1GB a year, probably well below 100MB in practice, and everyone who uses a computer can afford that.

But wait, said my friend, this would be too inefficient. If the editor had to read and "simulate" a year's worth of keystrokes just to open a document, wouldn't that be too slow to be practical?

Not necessarily, but I'll grant that some documents and document-types require a faster approach. No fear - since our actual storage techniques allow random-access writing rather than just appending to a log, we can store the current document as a state-dump suitable for quick loading into an editor (like the complete text of a text file, to start with), while storing reverse diffs for the complete history. (This, too, has been demonstrated in VCSs and elsewhere.) Since we're only storing one dump (or one dump per period of time, say one weekly), this shouldn't take up too much space.

Or would it? My friend pointed out that MS Office documents were huge, often running into the multiple megabytes. My natural reaction, of course, was to ridicule MS Office, but we did conduct an anecdotal-level test.

The biggest MS Word doc we could find was about 16MB large, and contained 33 pages. By removing all Visio drawings and embedded images, we reduced it to 31 pages of pure text (with some formatting and tables), but it was still 6.5MB large. We opened it with OO.o Writer 2.3 and saved as ODF (which disrupted a lot of formatting and layout finetuning), but that still took over 3MB. We finally copied-and-pasted the whole text into a new Writer document, choosing to paste as HTML. The resulting document was only 17KB large. Adding back the lost formatting presumably would make it grow, but nowhere near a multi-megabyte size. The large size of the first ODF might be explained at least in part by Writer preserving Word cruft that wasn't visible from the editor itself.

Of course, the ODF format used by Writer isn't very efficient, even zipped (neither is any other XML-based storage). 17KB, although compressed, is more than the raw text of the document uncompressed. But even so, it would be good enough for the purposes of total storage.

There's an assumption underlying this discussion: that the great majority of data produced by humans is textual in nature. Of course photos and sound and video recordings produce far more data than the amounts described here. Total storage can't be applied to them, at least not with current storage costs.

But even there, some things can be improved. When I record a movie, I may be producing multiple megabytes of data per second. When I edit a movie and apply a complex transformation to it, which changes all its frames at maybe a frame a second, it might seem that I'm producing a frameful of data per second, which can still come out on the order of MB/sec. But in reality, all the new data I've produced is the 100 or 200 bytes of processing instructions, divided by the ten hours it takes to transform the entire movie. The rest was accomplished by an algorithm whose output was fully determined by those instructions and the original movie. If our log stored only these, we could still reproduce the modified movie. (Of course, it would take us 10 hours.)

Put it another way: today we pass around CGI-rendered movies weighing GBs, when the input models rendered them might be an order magnitude smaller (this is only an uneducated guess on my part: perhaps they are in reality an order of magnitude larger - but bear with me for a moment). These generated movies don't contain any more information than the models and software used to produce them, they merely contain more data. One day our computers will be able to generate these movies at the speed we watch them. And then only the input data will be passed around.

Today we download large PDFs, but we could make do with the far smaller inputs used to generate them if only all our recipients had the right software to parse them.

Real-time movie rendering may be a way off yet. But total saving of the complete input we're capable of sending through our keyboards, mice, and touchscreens isn't. We just need to write the correct software. This state of things will last until brain-machine interfaces become good enough to let us output meaningful data at rates far exceeding a measely ten characters a second, or even the few hundred bytes per second we can generate by recording the precise movement of pointing devices. Even then, the complete log of your mouse movements is rarely as interesting as the far smaller log of your mouse clicks and drags-and-drops. The whole MMI field is still in its infancy.

In theory, if you record all the inputs of your deterministic PC since a clean boot, you can recompute its exact state. In practice, total saving of text-based input will suffice - for now.

No comments:

Post a Comment