Tuesday, January 5, 2010

Datahand halves can use extender cable

A quick tip for fellow Datahand owners: the cable connecting the two halves is a standard 15-pin serial cable. I used a standard 3 meter extender cable and it works great. Just in case you're wondering, like I was before buying.

Saturday, January 2, 2010

How to enable DRI + XVideo with several video cards ("vga arb" problem)

I hope this post comes in handy for someone searching for info.

I have a Radeon HD4850 card, which works great with the free radeon (aka xf86-video-ati) driver - which is to say DRI and XVideo work great, and that's all I need. (I do my gaming in Windows.)

Recently I bought a second 4850 card and set them up in dual Crossfire mode. DRI and XV stopped working, saying in the xorg log that DRI was not supported with "vga arb" (i.e. vga arbitrartion between several video cards). I only wanted to use one card - no crossfire - but the mere presence of another card prevented DRI from working.

Some Googling revealed that this is a known condition, which holds for DRI with any two cards with any drivers (even when only one card is active). It's not going to fixed soon, apparently, and presumably when it *does* get fixed it'll be for DRI2 only.

The solution? Disable the second card completely, on the PCI bus level. First I find out the PCI bus ID of the cards:

[danarmak@planet data]$ lspci|grep VGA
01:00.0 VGA compatible controller: ATI Technologies Inc RV770 [Radeon HD 4850]
02:00.0 VGA compatible controller: ATI Technologies Inc RV770 [Radeon HD 4850]
Note that the main bus IDs are 1 and 2, because these are PCIe cards, and each PCIe slot has its own PCI bridge.

And here's how to remove a card from the system:

echo 1 > /sys/bus/pci/devices/0000:02:00.0/remove
After executing that, the card is gone from lspci's output and the directory /sys/bus/pci/devices/0000:02:00.0 is gone as well. If you want it to reappear (without rebooting), do:

echo 1 > /sys/bus/pci/rescan
Googling will give you refs to other names in /sys, which I don't have. I assume the names changed in a recent kernel version. As of 2.6.32, the above 'remove' file is the one to use.

Thursday, January 17, 2008

Quick hit: Btrfs

Btrfs (which I pronounce as Betterfs) is a new Linux filesystem with many of ZFS's design sensibilities. It's being developed by Chris Mason from Oracle and was announced about half a year ago - the lkml announcement thread has some interesting posts.

How did I miss this until now? :-? Regardless, this is great news for Linux and Linux users. With ZFS entering the BSDs and OSX, Linux looked like it was going to be left behind. This is some great tech - go read about its features. I can't wait for it to become stable/mainstream; sadly that will probably be another year or so. Filesystems need a lot of testing before production use.

Have I said I really like the features of btrfs (and ZFS) yet? If I had to write out the features of my dream filesystem, these would constitute a big part.

Friday, January 11, 2008

War on Bad Software Paradigms

(Moved here from the introduction to the Total Saving post, because most people wouldn't be interested in this part.)

Most common programs available today are buggy and badly designed and implemented. But these are relatively minor issues - by dint of lifelong indoctrination we manage to tolerate and even ignore their failings in everyday use.

Bad OS and UI paradigms are far worse than mere bugs or design problems. Almost all end-user programs that exist today are built on many layers of fundamental architectural mistakes. Some of them made better sense when they were introduced with Unix and the Internet, thirty or forty years ago. All have remained unexamined and unchanged in mainstream software ever since.

I've had enough of bad computing paradigms. This is my declaration of war on stupid software. Every year we spends man-aeons on faster, safer, more featureful and glittering implementations of the same fundamentally broken designs. We build mousetraps so good that only intelligence-augmented rats can avoid them, but it's time to remember that what we really wanted to catch was a dragon.

This is the first in a planned series of posts that will outline the basic design issues as I see them with some common classes of end-user software - editors, file managers, browsers and the like - and in about eight months (when I plan to go to university) I'll start work on implementing these ideas in earnest. I feel I wouldn't really enjoy working on anything else before I build myself a set of better tools to do it with.

It's said that every aspiring C programmer wants to write their own OS. I prefer Python (which is still far from perfect) and would rather write my own UI - preferably, one that would allow a very high percentage of my computer use with just a few simple but powerful paradigms. Is that just a dream? Time will tell.

And so, on to the first post about Bad Software Paradigms: manual data management vs. Total Saving.

Space costs of Total Saving

Yesterday I discussed my Total Saving approach with a friend, who raised an objection: surely the storage costs would be prohibitively high?

In my last post on the subject I wrote: "hard disk sizes and costs have long reached a point where [removing user-created data to free space] is completely irrelevant for all the text and text-based data a person can produce in a lifetime." It's time to substantiate this claim with some hard numbers.

Suppose I type at 10 cps (characters per second). This is a rate few people can exceed even in ten-second bursts, but let us suppose I can keep it up indefinitely. I type at 10 CPS, without stopping to think, eat, or sleep, for a year.

That gives us 10 (cps) * 3600*24*365 (seconds in a year) = 315,360 thousand characters typed per year.

Suppose I store every keystroke in a log. A log record consists of the character typed, a timestamp, and maybe a few delimiters and such. A timestamp looks like "1200054174.237716", so we come to around 20 characters in all. This brings us to 6.3 GB of data per year.

HD costs are around 4GB per US$, so, for $1.5 a year you can store all your data. Of course, you can't get HDs that small, so I'll put this another way: if you can afford to buy even the smallest disk produced today (probably 40GB), it will last you for six years. By then, the smallest disk will probably be large enough to last you the rest of your life. (With 2008 lifespans, anyway.) Other storage solutions, including online ones, offer prices not much higher (Gmail can store almost that much for free), and with data creation pegged at 200 byes/sec bandwidth won't be a problem.

Of course the real storage room requirements would be smaller by an order of magnitude. We can start by saving every word typed, or every 10 characters, saving a lot of log record overhead. We can store the timestamps as time offsets, which would be far smaller, and use coarser resolution (we don't really need nanosecond precision here). We can compress the whole thing, which ought to produce savings of 50%-60% at least. There are many higher-level techniques we can apply to further reduce the space used, like only storing identical document-states once, but the point is they're not really necessary: we can easily keep well below 1GB a year, probably well below 100MB in practice, and everyone who uses a computer can afford that.

But wait, said my friend, this would be too inefficient. If the editor had to read and "simulate" a year's worth of keystrokes just to open a document, wouldn't that be too slow to be practical?

Not necessarily, but I'll grant that some documents and document-types require a faster approach. No fear - since our actual storage techniques allow random-access writing rather than just appending to a log, we can store the current document as a state-dump suitable for quick loading into an editor (like the complete text of a text file, to start with), while storing reverse diffs for the complete history. (This, too, has been demonstrated in VCSs and elsewhere.) Since we're only storing one dump (or one dump per period of time, say one weekly), this shouldn't take up too much space.

Or would it? My friend pointed out that MS Office documents were huge, often running into the multiple megabytes. My natural reaction, of course, was to ridicule MS Office, but we did conduct an anecdotal-level test.

The biggest MS Word doc we could find was about 16MB large, and contained 33 pages. By removing all Visio drawings and embedded images, we reduced it to 31 pages of pure text (with some formatting and tables), but it was still 6.5MB large. We opened it with OO.o Writer 2.3 and saved as ODF (which disrupted a lot of formatting and layout finetuning), but that still took over 3MB. We finally copied-and-pasted the whole text into a new Writer document, choosing to paste as HTML. The resulting document was only 17KB large. Adding back the lost formatting presumably would make it grow, but nowhere near a multi-megabyte size. The large size of the first ODF might be explained at least in part by Writer preserving Word cruft that wasn't visible from the editor itself.

Of course, the ODF format used by Writer isn't very efficient, even zipped (neither is any other XML-based storage). 17KB, although compressed, is more than the raw text of the document uncompressed. But even so, it would be good enough for the purposes of total storage.

There's an assumption underlying this discussion: that the great majority of data produced by humans is textual in nature. Of course photos and sound and video recordings produce far more data than the amounts described here. Total storage can't be applied to them, at least not with current storage costs.

But even there, some things can be improved. When I record a movie, I may be producing multiple megabytes of data per second. When I edit a movie and apply a complex transformation to it, which changes all its frames at maybe a frame a second, it might seem that I'm producing a frameful of data per second, which can still come out on the order of MB/sec. But in reality, all the new data I've produced is the 100 or 200 bytes of processing instructions, divided by the ten hours it takes to transform the entire movie. The rest was accomplished by an algorithm whose output was fully determined by those instructions and the original movie. If our log stored only these, we could still reproduce the modified movie. (Of course, it would take us 10 hours.)

Put it another way: today we pass around CGI-rendered movies weighing GBs, when the input models rendered them might be an order magnitude smaller (this is only an uneducated guess on my part: perhaps they are in reality an order of magnitude larger - but bear with me for a moment). These generated movies don't contain any more information than the models and software used to produce them, they merely contain more data. One day our computers will be able to generate these movies at the speed we watch them. And then only the input data will be passed around.

Today we download large PDFs, but we could make do with the far smaller inputs used to generate them if only all our recipients had the right software to parse them.

Real-time movie rendering may be a way off yet. But total saving of the complete input we're capable of sending through our keyboards, mice, and touchscreens isn't. We just need to write the correct software. This state of things will last until brain-machine interfaces become good enough to let us output meaningful data at rates far exceeding a measely ten characters a second, or even the few hundred bytes per second we can generate by recording the precise movement of pointing devices. Even then, the complete log of your mouse movements is rarely as interesting as the far smaller log of your mouse clicks and drags-and-drops. The whole MMI field is still in its infancy.

In theory, if you record all the inputs of your deterministic PC since a clean boot, you can recompute its exact state. In practice, total saving of text-based input will suffice - for now.

Thursday, January 10, 2008

Total Saving

We lose data all the time.

Our computers hold a lot of data, but the most precious kind is the data we produce ourselves: the documents we write, the custom configurations, the bookmarks and browsing histories, the secret passwords, the things that would be lost on a clean reinstall. And it is this data that is in the greatest danger, precisely because we are constantly accessing and modifying it.

I save a file, accidentally overwriting another: poof, it is gone forever. I delete a file: gone. I engage in a session of heavy editing only to change my mind ten minutes later, but my editor's undo stack was cleared when I saved the file manually two minutes ago, and now I can't get the original version back. I go back in the undo log to see what the previous version looked like, I change a character by mistake, and now the redo log is gone and I've lost my changes. The software crashes, and the file is corrupted and cannot be recovered. I press Alt+F4 by mistake, and the words I wrote during the two minutes since I last pressed Ctrl+S disappear. I press Ctrl+A, Del, Ctrl+S, Ctrl+Q, and everything I've written is destroyed. Gone, all gone, even though it was on screen just a moment ago. The fatal keypresses cannot be undone.

Our current system of forcing the user to manually keep track of files is insane. It allows complete, irrevocable deletion of anything and everything, using short and simple commands that can be carried out by mistake. Giving a user complete freedom is a good thing, but nevertheless there is definite value in making commands that destroy data more difficult to carry out than just three or four key combinations which, in another context or even just in another order, would be completely innocuous.

There are only three acceptable reasons for a system to ever lose data. The first is storage hardware failure. The second is freeing up space for other data, but hard disk sizes and costs have long reached a point where this is completely irrelevant for all the text and text-based data a person can produce in a lifetime. The third is intentional user deletion, which should be a special command never invoked in ordinary use and very, very unlikely to be invoked by accident.

Backup (as typically implemented) is not a solution. It can only be done every so often. It can only contain a limited number of versions. There is no reason why I should be able to lose all the data I created over the last few days or hours due to a software bug or user error or just by changing my mind and wanting to undo a lot of changes. Backup is a solution for hardware failure, not for user error. (Also, it requires dedicated hardware most people don't have, and effort most people won't put in.)

Version control, as used with code, or wiki pages, or DAV servers, is also not a good solution. It requires the user to manually pick checkpoints at which to commit their work, but this model of managing atomic changesets isn't cleanly applicable to (say) writing a book: there is no reason why I should not want to inspect and manage my change history with arbitrary granularity, the way I typed it. It requires the user to interrupt their work in order to issue a manual commit command, which generally involves entering a commit message (if there is no commit log, why bother with atomic changesets? just autocommit every second or so). It isn't well suited to environments where the user isn't editing a document of some sort, but is still creating important data (e.g., an IRC chat).

These environments are better suited to a model of automatically saving all data as soon as it comes in. It is my argument that all data is best managed using this model, with things like atomic changeset management layered on top if necessary. Today editors save when users tell them to, and maybe autosave every minute or two. There is no reason why they should not save after every word typed, or every character - save with the same granularity and features, in fact, as allowed by their in-memory undo/redo stack.

I call this the total saving approach. It consists in large part of abandoning the optimizations that were necessary for the programs of 1970, but which today are mostly unnecessary and premature and have truly become the root of all evil:

  1. Always save every bit of data you have - it might be useful later on. Discarding some data is premature optimization for storage space.

  2. Save every bit of data as soon as you have it. Buffering data and writing it out later is just premature optimization for bandwidth.

  3. Decouple the problem of what to save and how to save it, from the problem of how to display and process the saved data. Don't let your storage format and the choice of data to store be unnecessarily influenced by the way you're using the data. The way you use it might change tomorrow, but the data itself won't. Slaving storage to functionality is premature optimization for efficiency in a particular access pattern.

If you have transient data too large to keep in RAM (like unlimited-size undo stacks), dump it to disk; don't discard it. If the document being edited lives on a slow or laggy store (like a remote wiki page), keep a copy on the local disk and synchronize it with the remote master copy as bandwidth permits. If the user travels the undo/redo stack a lot, let them open different versions as independent copies and keep a general undo/redo graph.

Never, ever, throw away data. Destroying information is the root of all evil.

But version control can teach us some things. Modern VCSs contain some very useful, advanced features: changeset management capabilities like many-way diffs, branching and merging and cherrypicking, the blame (aka annotate) display and other overlay views. The challenge is to integrate both these features and total saving into a universal framework that will work with all our data and applications.

So to recap, a good editor (or other UI software; a future post will detail why I think almost every other type of program is a special kind of editor) would never lose data. It would let the user navigate all the revisions that ever existed for a document and use operations familiar from the version control world to manage changesets.

It would also manage documents for the user. Manually managing directories and choosing filenames is a bother, but the real reason it's bad is that it lets you then overwrite and delete those files. Obviously a file per document should exist somewhere, if only to let other programs access the files without invoking the editor program, but deleting or modifying them should not destroy data. (A FUSE-like approach might work here - a documentfs:/ mount would let you navigate the document store, locating files by various parameters, accessing different revisions and so forth. This is the filesystem-as-dynamic-DB-query-interface also seen in ReiserFS v4 and elsewhere, and is a topic for a dedicated post.)

A good editor would spot another feature, which I'll provisionally call reversible, non-destructive editing. We've said that no change of any kind made to a document, including meta-changes like saving or renaming it, can destroy data. Also, every change can be un- and re-done. That means all the pesky Yes/No, OK/Cancel message boxes have no reason to exist. OK to overwrite this file? Of course it's ok to overwrite - if I change my mind I'll just press Ctrl+Z, or find the old data in the changelog tomorrow. Once the user really *feels* data cannot be lost, I think tasks like writing won't look and feel as they do now. They would become a joyful, unburdened journey of exploration through the local vicinity of the space of all documents, always knowing we can retrace our path.

A good editor would also sport various other design features which will require their own blog posts to detail. This post is long enough as it is, so I'll stop here.

A final note: the ideas in this and future posts are far from being new inventions of mine. In many cases I'm even aware of the attribution (for instance, the idea about being able to undo any action and removing confirmation queries was described on a Google Tech Talk from maybe a year ago; unfortunately I can't recall which one it was now). The problem is that as far as I'm aware most of them haven't been implemented beyond some POCs. And I'm pretty sure they haven't all been implemented together (counting all the ideas I'll post about in the next few days).

Friday, November 30, 2007

New Israeli Copyright Law passed

A new Copyright Act was passed by the Knesset (the Israeli Parliament) last Sunday. The law can be read here (Hebrew - there's no official English version). It replaces a 1911 British law which was sorely in need of an update. (Extrapolating software copyright from state licensing of printing presses can lead to some truly weird results.) Blogsphere coverage is sparse (good news is no news), but here and here are a couple of other posts. The short version: a well-balanced law that creates new rights, freedoms and exceptions for the public instead of taking them away.

IANAL and the following is very, very far from being informed legal analysis. Call it informed wild legal speculation. I want to highlight some features of the new law, which appears to be reasonably balanced - probably as much as it can be while remaining in compliance with international treaties.

This is especially important in a time of ever-spreading DMCA-like laws. We had our own super-DMCA proposed last year by a member of the small left-wing party Meretz. Luckily, the government cared enough and was sane enough to pass this law instead.

Here are some of the major provisions of the new law. All translations to English are by me, who once again am not a lawyer. Also, I've omitted a lot of stuff which is either the same as in every other country, or only relevant to things other than software, books, music and movies.

Copyright terms last for life + 70 years, but music only gets 50 years, total (not life + 50, just 50 from the moment of publication). European-style "author's moral rights" are recognized. Non-exclusive copyright licenses don't have to be in writing and can be implied. Ideas, processes and methods, mathematical theories, facts, data and news are not copyrightable (but unique ways of expressing them are). The sample implementation that's supposed to accompany a patent request is not copyrightable.

Illegal copying for monetary profit or on a commercial scale is a criminal offense, punishable by up to 5 years in prison and fines; illegal copying for private use is a civil offense, punishable by up to 100,000 ILS in fines without proof of damages (but only once per "group of violations", not e.g. per each file copied). Crucially, selling illegal copies is a violation of the law, but buying or posessing them is not (although a court might order the illegal copies themselves to be handed over - I'm a bit unclear on this).

The following actions are among those explicitly allowed to any possessor of a legal copy:

  1. Fair use: for study, research, criticism, journalistic reporting, reviews, overviews, excerpting, and teaching in a recognized educational institution. The courts may add more, considering the effect that the use has had on the work's market value and so on.

  2. Incidental inclusion in another work, like snatches of background music in a film (but not the film's actual score).

  3. Photographing or otherwise recording architecture or a work of art that is displayed in a public location.

  4. Copying software (not other digital media) for backup, installation, and maintenance.

  5. Copying and modifying software for the following purposes:

    1. To allow the intended use of the software, through bugfixing and adaptions to other software.

    2. Security analysis and fixes.

    3. Reverse engineering for interoperability.

  6. Temporary copying which is required as part of a technological process to enable any otherwise legal use of the protected work, or to transport the work over a network (while creating temporary copies at intermediate network nodes), is permitted as long as the temporary copies have no significant commercial value of their own.

  7. A work of art can copy or derive from a previous work by the same author, even if the author has signed away the copyright for his previous work.

Also, mandatory licensing for music is possible (I don't know if it actually exists or will exist), and public libraries and state-recognized educational institutions are granted some pretty broad exemptions.

The passing of this law means we're reasonably safe from DMCA-like prohibitions on reverse engineering (and making compatible printer ink cartridges and garage remote controls): the law specifically allows reverse-engineering. In fact we can even modify our copies of proprietary software not to produce proprietary formats in the first place (heh), like modifying MS Word to export ODF by default. We can also fix bugs in said proprietary software when Microsoft refuses to.

The following parts are more speculative since they include some legal interpretations on my part, but time-, space-, and format-shifting (including removing DRM) seem to be allowed. The law is vaguely worded for my taste, but it does allow temporary copies "needed as part of technological processes for legitimate use of the protected works". That seems to include ripping CDs, needed fot the legitimate use of playing purchased music on my mp3 player.

Also, shrinkwrap licenses and usage-restricting EULAs are probably irrelevant. Firstly, I don't need a special license for the incidental copying of software to my hard disk or RAM that is necessary to run it. Secondly, I might be allowed to modify the software to run even when I don't click 'I Agree', or to write my own EULA-less installer. The courts will have to rule on this, I guess.

Of course the law is far from perfect. But it's quite good, compared to what most other Western countries have. Apparently there's a further draft law being circulated by the government that would create a DMCA-like safe harbor provision for ISPs. That's the procedure whereby they remove your site because someone tips them that you're violating copyright, without asking them for proof or you to defend yourself. But at least we shouldn't need to fear a DMCA-style anti-circumvention law now.