Special Pages - Reports

Sunday, October 3, 2010

Internet Archive: treasure trove of free books, articles + movies, concerts, recordings - Update

On January 24, I included most of the info just below about the quite incredible, free Internet Archives' text area.

The Internet Archive Text area has over 2 million free books and documents - Kindle formats included -- via ".mobi" or ".prc" files or even specifically labeled "Kindle."

Remember that the Kindle models  (UK: K3) also directly read .txt and PDF files although .prc or .mobi ones may be more readable in font size than PDF ones and many of the PDF versions will be much larger files usually - with images of pages that are accurate renderings of original pages though they then can't be searched as text or annotated unless the pages were put through OCR or optical-character-recognition processing rather than just left as images.

There are also areas of files with somewhat less obvious value, but the main sub-collections include American Libraries, Canadian Libraries, Universal Libraries (Carnegie Mellon, governments of India, China, Egypt), Project Gutenberg (another access point) -- and there are recent contributions from The Library of Congress, UCLA Scanning Center's special collections, etc.).

  Additional sub-collections of books, articles, and other texts as well as a listing of All Collections (usually organized by topic) are accessible via this linked page, most recently added first.

I added this resource in January to the ongoing free and low-cost books posting.  Note that there is also free live music and audio linked on the home page.

The Washington Post did a story about the Internet Archive, which is based in San Francisco.  Rob Pegoraro took what he called a 'field trip' to the organization's new headquarters, which was once a Christian Science church, in San Francisco.  He points out that while the site has long been known for its "WayBack Machine" that displays the initial days and ongoing development of websites, it's far more than that fun feature.

The Archive founder, Brewster Kahle, gave him a good tour of the place and and the work they do there.  They recently moved from the Presidio, near the Marina, to a building in the Richmond neighborhood that looks remarkably like the logo they developed 16 years before (well, they did get to choose the building) and dates back to 1923, "the last year of the public domain"; most works created since then remain under copyright.

The photo at the right, just above, was taken by the richmondsfblog, which notes
' [The Internet Archive] is a non-profit that was founded in 1996 for the purpose of creating an Internet library.  Their goal is to offer permanent access for researchers, historians, scholars, people with disabilities, and the general public to historical collections that exist in digital format.
...
The Internet Archive is also well known for its Open Library project, which seeks to create “one web page for every book ever published”. In other words, free and easy access to all published books online.'

They'll be hosting the "Books in Browsers" Conference there, October 21-22, co-sponsored by O'Reilly Media, with a demo and gathering at the SF headquarters on Oct. 21 from 6:30 - ~9:00 pm, an evening event that will be open to the public (unlike the conference itself).

As you'll see in the photos by ZYZZYVA, the main hall still looks like a church, which can indicate a reverence for the idea of a universal library, but they do plan to make it more of a library setting eventually.  Staff offices are on the lower level.  Other photos on that page are of a book-scanning device, the unobscured building, the main hall, and a room for gatherings.  Pegoraro describes the offices:
' Next door, the old Christian Science reading room has been turned into a scanning center, as part of the archive's mission to preserve print as well as pixels.  On each side, staffers were operating specialized scanners -- operated by pedals, like old sewing machines -- that photograph two pages of a book at a time.  In the center, other employees were running computer-driven microfilm scanners. "That looks like the 1900 census," Kahle said as he peered over one staffer's shoulder at a screenful of handwritten documents.

Poring over page after page in a room made hot by that accumulation of computing machinery seemed like it could get a tad repetitive.  I asked Kahle if there was a risk of burnout. Yes, he said, pointing to himself as an example of the wrong sort of person for that work: "I would get fired!"  But some employees, he said, have been there three years.

One of the archive's newer projects is a site called Open Library, which both catalogues books and provides access to electronic copies of them.  Anyone can download public-domain works, while visually impaired users can access text-to-speech versions of works through a program set up by the Library of Congress.  The archive is also working to set up a system for direct downloads of e-book loans. '
It turns out that that the founder, Kahle, doesn't own an e-book reader.  I like, though, that Kahle objects, Begoraro said, "to the way some libraries have begun to rely on Google's collections and 'de-accessioning' paper copies -- that is, trashing them"... and would "rather see libraries keep their original source material while also using the Internet to make that content available to more people. 'Let's not lose it all,' he said."

Last year, about 40 percent of the organization's income was contributed and 60 percent came from indexing and scanning services it provides to other libraries.

Re Kahle's preferred file formats for long-term storage, Kahle said that the archive uses FLAC (Free Lossless Audio Compression) for music, had adopted H.264 for video storage after trying five other formats, used JPEG for photos and employed a related format, JPEG 2000, for text-heavy images.  But he also said that for personal storage, PDF or nearly universally supported commercial formats -- even Microsoft Office -- would be fine, too.

UPDATE - Commenter Stbalbach draws to our attention the fact that the Kindle does not support jpeg-2000 format.  Kahle does say that they use JPEG for photos and the related JPEG 2000 "for text-heavy images."  Stbalbach advises: "To read PDF's from Internet Archive, you need to download the .djvu version and convert it to PDF.  Instructions can be found here:
http://www.archive.org/post/277920/pdfs-on-amazon-kindle.

There are very forthright and interesting responses by Kahle to questions in the Comments area, but, as this is the Internet, rudeness comes pretty quickly.  Comments were closed after a couple of days.  Here are some replies by Kahle:
' sniz15 asks: lossy formats for images. Any chance they mentioned why?

We used to store uncompressed TIFF (50MB/image) and then RAW (17MB/image), but when you are scanning 1000 books/day it gets big, and also, we (folks from Harvard, Library of Congress, UC and Internet Archive) did studies to find out what kind of degradation we get if we use jpeg-2000 at about 1MB per image and we found very very little.  So for mass scanning we are using jpeg-2000.  A big problem is that it is not supported in browsers.

Try zooming in on one of our books-- I hope you will find it pretty good: http://www.archive.org/stream/lifeofabrahamli2463tarb#page/n7/mode/2up

+++
dltj asks: what about the PDF/A?

Yes, PDF/A is better for long term access.  For our book scanning we were disappointed that it did not support an image layer as jpeg-2000 (or at least not originally) and we found that to be a dramatic enough improvement in quality per megabyte over a jpeg layer for books that we chose normal PDF.  We also don't think of this as the preservation format for these books.

What most end-users are doing is scanning documents with their scanner or taking office documents and writing them to disk.  For these purposes, PDF, we have found, works quite well.  PDF in this case is a container format that keeps metadata, images, and text together.  sometimes even has page numbers, chapter starts etc.

When users upload these to the Internet Archive for long term preservation, we use open source tools to process these files and adobe has not gone after those developers, so we are happy with the format.

+++
Hemisphire asks: Will older music sets in SHN be transferred to FLAC?

We are not migrating from user uploads from SHN to FLAC yet. They are pretty big, and SHN is still commonly supported. If people are finding this a problem, please let us know on the archive.org forums.

Posted by: brewster2 | May 19 '

By the way, at the RichmondSFBlog, they added that:
"One of the more enjoyable archives on the site is old time radio programs. Click below to check out an episode of 1953′s The Six Shooter, which brought James Stewart to the NBC microphone for a series of folksy Western adventures. This episode is called “The Coward”."
  Go to their site, at the bottom to play that if intrigued.


Kindle 3's   (UK: Kindle 3's),   DX Graphite

Check often: Temporarily-free late-listed non-classics or recently published ones
  Guide to finding Free Kindle books and Sources.  Top 100 free bestsellers.
    Also, UK customers should see the UK store's Top 100 free bestsellers.

2 comments:

  1. > Brewster: So for mass scanning we are using jpeg-2000

    Kindle does not support jpeg-2000!

    To read PDF's from Internet Archive, you need to download the .djvu version and convert it to PDF. Instructions can be found here:
    http://www.archive.org/post/277920/pdfs-on-amazon-kindle

    ReplyDelete
  2. Stephen,
    Thanks for the heads-up on this. I incorporated it as an Update to the blog entry.

    I love Irfanview but do also have the pro version of Adobe Acrobat. I think most people don't have the latter, so the recommendations of the free Irfanview and the Bullzip.com one are very helpful.

    ReplyDelete

NOTE: TO AVOID SPAM being posted instantly, this blog uses the blogger.com "DELAY" feature.

Am often away much of the day, and postings won't show up right away. Posts done to use referrer-links may never show up.

Usually, am online enough to release comments within a day though, so the hard-to-read match-text tests for commenting won't be needed this way.

Feedback and questions are welcome. Thanks for participating.

Technical Problems?
If you're having problems leaving a Comment, Google's blogger-help asks that you clear the 'blogger.com' cookies on your browser's Tools or Options menu bar and that will fix the Comment-box problems (until they have a permanent fix).

IF that doesn't work either, then UNcheck the "keep me signed in" box -- Google-help says that should allow your comment to post (it's a workaround to a current bug).
Apologies for the problems.

TIP: There's a size limit. If longer than 3500 characters or so, in a text editor, make two posts out of it.