Friday, August 28, 2009

Google describes the book conversion process

This is an Addendum to yesterday's article on how Kindle owners can read any of Google's million-plus public domain books now offered in ePub format -- thanks to a simple 3-minute conversion process, which also allows control over the layout if wanted (such as adding a hyperlinked Table of Contents if the book doesn't already have one).

Here is some info from Google's statement on August 26, explaining what they've managed to do here.

" Try doing a search for [Hamlet] on Google Books.  The first few results you'll get are "Full View" books — which means you can read the full text.  And, because the book is in the public domain, you can also download a copy of Hamlet in PDF form.

  Starting today, you'll be able to download these and over one million public domain books from Google Books in an additional format.  We're excited to now offer downloads in EPUB format, a free and open industry standard for electronic books.

Following that, they explain further why they're offering ePub versions of the books in addition to the PDF versions released earlier.
" By adding support for EPUB downloads, we're hoping to make these books more accessible by helping people around the world to find and read them in more places.  More people are turning to new reading devices to access digital books, and many such phones, netbooks, and e-ink readers have smaller screens that don't readily render image-based PDF versions of the books we've scanned.

  EPUB is a lightweight text-based digital book format that allows the text to automatically conform (or "reflow") to these smaller screens.  And because EPUB is a free, open standard supported by a growing ecosystem of digital reading devices, works you download from Google Books as EPUBs won't be tied to or locked into a particular device.  We'll also continue to make available these books in the popular PDF format so you can see images of the pages just as they appear in the printed book. "

 That is followed by more detail on how this was accomplished and just what was involved.  They caution us that the books go through automated scanning that is sophisticated but nevertheless sometimes can't interpret words that aren't clear on the original page.
" The process begins with a book that has been preserved by one of our library partners from around the world.  Google borrows the book ... Before returning the book in undamaged form, we take photographs of the pages.  Those images are then stitched together and processed in order to create a digital version of the classic book.

  This includes the difficult task of performing Optical Character Recognition on the page image in order to extract a text layer we can transform into HTML, or other text-based file formats like EPUB (if you're interested, you can read more about this process here). "

I'll add a bit from that linked page here about readying text for the smaller screens of mobile phones and e-readers.
" [We] extract the text from the page images so it can flow on your mobile browser just like any other web page.  This extraction process is known as Optical Character Recognition (or OCR for short).  The following example demonstrates the difference between page images and the extracted text:

=> "Because I made a blunder, my dear Watson— which is, I am afraid, a more common occurrence than anyone would think who only knew me through your memoirs...


The extraction of text from page images is a difficult engineering task. Smudges on the physical books' pages, fancy fonts, old fonts, torn pages, etc. can all lead to errors in the extracted text.  The example below shows the page image from the original manuscript for Alice's Adventures Under Ground.  In this extreme case, the extracted text is riddled with errors:
=> "lV~e.il!" .ÍAoHyU- AUte. U brstty/affc. su.it a. f o.tl as ~tk¿* , I s&O.IL .éfiiíjz tiotkun-) of-ttmlr1¿*y ¿i^n. sta¿rs ! Jfo» ura.ve ...


They close with this:
" Imperfect OCR is only the first challenge in the ultimate goal of moving from collections of page images to extracted-text based books.  Our computer algorithms also have to automatically determine the structure of the book (what are the headers and footers, where images are placed, whether text is verse or prose, and so forth).  Getting this right allows us to render the book in a way that follows the format of the original book.

The technical challenges are daunting, but we'll continue to make enhancements to our OCR and book structure extraction technologies.  With this launch, we believe that we've taken an important step toward more universal access to books.

...If you do bump into some rough patches where the text seems, well, weird, you can just tap on the text to see the original page image for that section of text. "

That ability, when reading on a computer monitor, or a smartphone (the Kindle doesn't do this), to click your mouse on the text or tap the text on a touch screen, and then see the original text in the alternate image-scan format, is amazing.



Also see:
  A Million Free Google Books in ePub - for the Kindle
  Read foreign-language Google-books in English online Below are ways to Share this post if you'd like others to see it.
-- The Send to Kindle button works well only on Firefox currently.

Send to Kindle


(Older posts have older Kindle model info. For latest models, see CURRENT KINDLES page. )
If interested, you can also follow my add'l blog-related news at Facebook and Twitter
Questions & feedback are welcome in the Comment areas (tho' spam is deleted). Thanks!

No comments:

Post a Comment

NOTE: TO AVOID SPAM being posted instantly, this blog uses the blogger.com "DELAY" feature.

Am often away much of the day, and postings won't show up right away. Posts done to use referrer-links may never show up.

Usually, am online enough to release comments within a day though, so the hard-to-read match-text tests for commenting won't be needed this way.

Feedback and questions are welcome. Thanks for participating.

Technical Problems?
If you're having problems leaving a Comment, Google's blogger-help asks that you clear the 'blogger.com' cookies on your browser's Tools or Options menu bar and that will fix the Comment-box problems (until they have a permanent fix).

IF that doesn't work either, then UNcheck the "keep me signed in" box -- Google-help says that should allow your comment to post (it's a workaround to a current bug).
Apologies for the problems.

TIP: There's a size limit. If longer than 3500 characters or so, in a text editor, make two posts out of it.

[Valid RSS]