Monday, Sep 27, 2004

Pratap Ravindran

Archiving all that the Internet can hold is a gargantuan task. For something as precious as information, the effort's worth it. Here's tracking efforts in that direction.

CONSIDER this: According to an estimate made two years ago, a staggering 93 per cent of all new information is "born digital" - that is, the information originates in digital form.

This may not mean anything to those who believe that the Internet is basically about e-commerce, chatting, e-mail and porn (not necessarily in that order) — but the fact is that the Internet, initially conceived as a communications network, has grown into a forum for electronic publishing and a medium for the dissemination of diverse documents relating to the arts and the sciences and just about every discipline in between.

Inevitably, the archiving of the Internet has now come be perceived as an imperative, although such an initiative is fraught with difficulties in view of the sheer sweep and diversity of the multimedia content (images, audio, video and so on), the absence of an international standard and a common protocol, among other factors.

While various Internet archival projects have been initiated, the most talked about is The Internet Archive (TIA) established in 1996 as a non-profit organisation by Brewster Kahle, the founder and CEO of Alexa Internet.

Kahle says: "Universal access to all human knowledge is within our grasp. It could be one of the greatest achievements of all time." And a lot of people are listening...

Kahle breaks down TIA this way: The Library of Congress, which has the largest collection of books, has around 24 million books. One book takes up approximately one megabyte of storage. And so, to digitally store all the books in the Library of Congress it would take a whopping 24 terabytes of storage space. If you want to store images of the pages, it would take a bit more, but that's still within the reach of TIA. You can store 24 terabytes of data in a stack of Linux boxes that's takes about square meter of space and costs approximately $60000. All doable numbers... .

There's, of course, the hassle of converting all the books to a digitised format.

They have to be scanned. There are projects already doing this and the Internet Archive is involved with them: the Million Book Project and Project Gutenberg, for instance. Scanning can be done in various ways. One is to do it by hand; that is, a person flips through the pages and "takes a picture" of the pages. This, according to Kahle, has been going on in India, where it costs about ten dollars a book. In western countries, where the cost of labour is higher, he says "they have set up so called in-library scanning systems, where the local people do the scanning, and automated systems, which consist of robots and custom made arrangements."

"The price is still a bit higher per book, but they are making progress and collecting experiences from India."

What about making the books available? Reading books on computer screens isn't all that great and so TIA has a project called Bookmobile. Bookmobiles are trucks that carry a satellite on the top and printing facilities inside. One bookmobile costs around $15000, including the car. A Bookmobile downloads the book through satellite connection from the archive, prints and binds the book. The whole thing costs about $1 per book (assuming the book is black and white and around 100 pages). The local people can do the whole thing with just a week's training. Also, as loaning a book from Harvard library costs, all in all, around $2, it might actually be cheaper to print and bind a book and to give it away

Can you and I scan a book and make it available to others? Imaging, i.e., the actual scanning of the books, is allowed - if you own a copy of the book. But making it available to others is a bit more complicated. First, there are the in-print books that are strictly legislated. But there is also a huge body of copyrighted books that are permanently out of print - they are called the "orphans." By current legislation, you just can't make them available, even though the books are never going to be in-print anymore. To clarify the copyright issue with orphans, TIA has filed a law-suit, Kahle vs. Ashcroft, better known as "Free the Orphans."

As for the Web, TIA has been archiving it since 1996, taking a full snapshot of the publicly available Web, all the text and all the pictures, every two months. It takes about 20 terabytes storage space compressed, around 50-60 terabytes uncompressed.

All that data is available through the Wayback machine. With the Wayback Machine, you can - or try to - "surf the Web is it was."

It is visited by 150, 000 people a day and it gets 8 million hits a day. The database is in order of 300-400 terabytes compressed. It might be one of the largest databases around, but it doesn't run on Oracle. Instead, it runs on cluster of Linux machines and the data is stored as flat files on the file system because that is the only way they can be made to scale.

Only about 2 million to 3 million audio recordings — mostly music — have ever been published for public consumption. The Internet Archive has begun to store digitised recordings of concerts as well and has about 15,000 shows in its database to date.

There are between 100,000 to 200,000 theatrical movies — half of them from India — in existence and about 20 terabytes of TV broadcasts a month.

On audio and music, Kahle figures that almost all of the published audio works is music. They consist of 2-3 million titles (78s, LPs, CDs etc), a doable number in terms of storage. But, he adds, published music is currently a highly litigated and restricted area as a result of which TIA cannot give access to it. Instead, TIA has started to work on other areas of music. It has been working with bootleg recorders. "There's an ongoing tradition, started by the Grateful Dead, amongst jam bands that you are allowed to tape and trade recordings of live performances as long as you don't make profit out of it."

The trading has moved to the Internet, but the traders have had problems with bandwidth. One concert typically takes up about gigabyte, when it's compressed without any data loss, and downloading or uploading can prove pretty heavy. So the TIA people are offering live recorders "unlimited storage, unlimited bandwidth, forever, for free." They've archived over 12,000 concerts, mostly from the "guys with guitars genre" and lately also from the "guys with mandolins genre." The only requirement is that the recordings are under Creative Commons licence.

What about software? According to Kahle, it's estimated that there are roughly 50,000 packaged and released software titles around. "Technically, archiving the software is doable. We can rip the stuff, we can run them through emulators. You can replay all that amazing stuff from the early days, software for early Mac, Sol, Commodore 64, etc."

However, TIA has run into a major obstacle here: The Digital Millennium Copyright Act (DMCA), which, as Kahle puts it, is an "amazing piece of Soviet-era legislation where `everything-is- illegal- unless- we- give- permission'," and which "is an anti-thesis of what the United States used to stand for."

"DMCA implies that ripping (ie. making a digital copy of) the software is illegal. You are not allowed to break the copy protection, which a lot of the early software had, which effectively means that you can't archive them. The librarians went to the copyright office with a lot of preparation, briefs etc. They waved a floppy in their hands and said, paraphrased: "Here's Lotus 1-2-3 on a floppy, rotting away. We are ready to make digital copies of this stuff and we want to be allowed to do this."

The librarians won. They got a copyright exemption for two years. What The Internet Archive now needs is donations of the physical objects, for example, floppies, and digital copies of the software. People need to act fast with this. The exemption is for two years and the window of opportunity may shut. When the legal hearings come around again, we can show that the world didn't crater as the software industry thought. The exemption says that you are allowed to rip old software if you are `an agent for' Only ripping is allowed, you can't publish the stuff on the Net."

The Web grows by about 20 terabytes of compressed data a month — one terabyte equals one trillion bytes. Although legal issues related to the storing and viewing of all this information persists, it is doable. And TIA is doing it.

Picture by Bijoy Ghosh

