Mechanical Music Digest  Archives
You Are Not Logged In Login/Get New Account
Please Log In. Accounts are free!
Logged In users are granted additional features including a more current version of the Archives and a simplified process for submitting articles.
Home Archives Calendar Gallery Store Links Info
MMD > Archives > February 2007 > 2007.02.19 > 05Prev  Next


Scanning & Digitizing Original Literature
By Terry Smythe

The recent discussion about PDF files is opening an opportunity for
many of our participants to become actively involved in making their
own personal collection of original literature freely available in
electronic form, in the spirit of preserving our unique slice of
musical heritage.

Like most collectors, I have acquired over some 40+ years a modest
collection of original literature.  A year ago, I wondered why I should
be the only one to view and appreciate them.  But how?  I knew about PDF
files, but had no idea how they were created, so I climbed up onto a
rather steep learning curve.  My background in government administration
was no help at all, but if I can do it, anybody can do it!

I learned there are two fundamentally different forms of PDF documents:

* From an existing computer file, such as Microsoft Word

* From an existing document where an original computer file, if any,
is no longer available.

From an existing computer file, there are number of utilities out
there, many of them free, that will create PDF files by "printing" it
to a PDF file.  The print command is simply redirected into this utility
which creates a PDF file out of the document just created in a utility
such as Microsoft Word.  A Google Advanced search on ""Word to PDF"
download" or ""WordPerfect to PDF" download" or ""make PDF" download"
will turn up a blizzard of these utilities.

Some of these PDF creation utilities are free and most do work.  That's
the "up" side.  The "down" side is that many of these freebies attract
parasites that provoke unwanted popup ads, or monitor your keystrokes
to send to some marketeer harvesting this data without your knowledge
or concurrence.  Perhaps others on this discussion group have found
inoffensive PDF utilities known to be safe, and could let us all know.

Existing documents such as original literature from the turn-of-the-
century present both an opportunity and a totally different approach
to PDF files.  For about a year now, I have been scanning documents
from out of my personal collection, such as roll catalogs, promotional
pamphlets, periodicals, etc., and posting them off my web site in MMD
web space kindly made available to me by Robbie Rhodes and Jody Kravitz.
Thus far, 47 documents are now posted that may be freely downloaded.
See  http://members.shaw.ca/paud122/docs.htm

Ideally the documents so posted should be useful for more than just
visual pleasure.  If possible, the ability to search, sort, edit are
desirable attributes.  Some of these old documents make some of these
objectives possible, most do not.  At very least, search capability is
an absolute minimum.  PDF files emerging through a print option may or
may not have a search capability.

Some old documents, such as roll catalogs are sometimes found printed
in columnar format.  These are good candidates for conversion into
spreadsheets, rather than PDF format.  The end product fulfils all the
requirements of searching, sorting, and editing.

Initially I used an old copy of Adobe Acrobat 6 that came with an older
computer as bundled software.  I was not comfortable with it, as it
would often mishandle pages containing a mix of pics, line drawings and
text.  Then I tried a variety of OCR programs that came bundled with
a couple of flat bed scanners I've acquired over the years.  None of
them handled a mix of these kinds of requirements.

Then Robbie tipped me off about an OCR utility known as "Presto! OCR
Pro."  Presto is now available under the name of ABBYY FineReader;
Version 8 is current.  It does an outstanding job of recognition, even
down to tiny 4 point characters.

But like all OCR utilities, it has difficulty with older documents
containing a mix of fonts, point sizes, bold, italics, variable
layouts, faded ink, and colored or faded paper.  Many of the documents
that I have posted off my site fall into this category.  They are old,
fragile, faded and carry a mix of data formats.  Even ABBYY has trouble.

With ABBYY, I will OCR scan a page and send it to Microsoft Word.  If
the original is clean white with rich black ink, the scan is 100%
accurate.  If the original has tiny characters, faded ink, faded fragile
paper, ABBY can do the job better than all others I have tried, but
under such conditions, accuracy suffers, forcing extensive, time
consuming editing.  There had to be a better way to share this unique
slice of our musical heritage.

I have found that Adobe's latest version 8 of Acrobat installs its own
"print" option directly into Microsoft Word.  This option makes it
possible to create a searchable, moderately editable PDF documents from
an existing computer file.  With Acrobat 8 installed, Word now has
within it a "make PDF" clickable icon that sends the file into Acrobat
8, which then creates the most desirable form of PDF.

My early attempts at scanning some of my older original documents
directly into Acrobat 6 did not emerge very useful.  More current scans
have me making a decision up front as whether to use ABBYY Fine Reader
into Word, then into PDF; or Acrobat 8 direct to PDF.  It's a judgment
call.

A recent example is a 300 page 1922 QRS Dealer catalog that is
currently posted off my site.  Although complete, it was in poor
physical condition with a very poor image to work with: tiny broken
characters, faded ink, faded paper, non-standard layout, etc., ad
nauseum.  I tried ABBY, but an enormous amount of editing would have
been required.  So, I elected to use Acrobat.  Judge for yourself the
usefulness of this document.

Using Acrobat 8, I have found that a minimum 300 dpi is essential for
text that may be a target for searching.  As Acrobat 8 works it way
through a scanned page, it automatically runs its own internal OCR to
create its own internal dictionary of sorts.  It's not perfect, but
does a credible job, particularly if the original is in good visual
condition.

Assuming a typical 35 page document, I will ordinarily scan the cover
at 150 dpi, then scan all other pages at 300 dpi.  When the document is
complete, I then send a copy of it to another partition or directory as
a backup, then execute Acrobat's "optimization" routine, which goes
through the whole of the document and compresses what and where it can.

As an example, the 1922 QRS catalog emerged as a 132 megabyte file, but after
optimization, emerged at 58 megs, a 44% shrinkage.  Its internal search
dictionary does not compress, search capability does not degrade.

If the original document is in tolerably good visual condition, the PDF
file emerging from Acrobat allows a form of editing.  Text within may
be "selected, copied, and pasted" into your favorite editor.  The OCR
translation may not be perfect, but the editable result is tolerable.
Makes possible selective extracts for insertion into other documents
under construction.

Over the past couple years, I have used a couple different flat bed
scanners.  All have given me acceptable results.  All required
considerable handling of fragile original documents.  The nature of
typical flat bed scanners is such that extensive handling is
unavoidable.

Then recently Dave Kerr tipped me off about a marvelous flat bed
scanner that, for whatever reason, failed in the marketplace and was
terminated a few years ago.  It is an HP 4670 "See Thru" flat bed
scanner.  See http://www.fisheyesystems.com/products/details.asp?cat1=peripherals&id=186 or
http://www1.epinions.com/pr-Hewlett_Packard_HP_ScanJet_4670_Flatbed_Scanner/display_~reviews/sec_~opinion_list/pp_~2

The scanner itself is only 1/2" thick, comes only in 8 1/2" x 11" size.
It looks just like an open window.  It sits in a cradle at about a 15
degree angle.  For single sheets, all normally fall in correct
orientation every time by simple gravity.

For books and pamphlets is where this scanner really shines.  The
scanner can be removed from its cradle and placed on an open pamphlet
or book on desktop and the scan area is instantly visible.  The target
document has absolute minimum handling other than page turning.

For thick hard bound books, the trick is to support both sides of an
open book with a variety of shims such that both sides of the book are
reasonably level to one another.  I use a small stack of slim magazines.
The "valley" between pages is not a problem if a piece of cardboard is
slipped under the top couple of pages, thus holding the topmost pages
flat against the glass.

This is an amazing little scanner that I highly recommend.  New ones
can still be found on eBay, but used ones on eBay are quite reasonable
and turn up routinely.  Check it out.

My apologies for such a long document, but I hope that this will
inspire and provoke others to have another look at their collection of
original literature and enlarge upon what I have started.  I have found
it is not difficult.  This is an opportunity share with others a unique
slice of our musical heritage.

Regards,
Terry Smythe
Winnipeg, Manitoba, Canada
http://members.shaw.ca/smythe/rebirth.htm


(Message sent Mon 19 Feb 2007, 15:29:32 GMT, from time zone GMT-0600.)

Key Words in Subject:  Digitizing, Literature, Original, Scanning

Home    Archives    Calendar    Gallery    Store    Links    Info   


Enter text below to search the MMD Website with Google



CONTACT FORM: Click HERE to write to the editor, or to post a message about Mechanical Musical Instruments to the MMD

Unless otherwise noted, all opinions are those of the individual authors and may not represent those of the editors. Compilation copyright 1995-2024 by Jody Kravitz.

Please read our Republication Policy before copying information from or creating links to this web site.

Click HERE to contact the webmaster regarding problems with the website.

Please support publication of the MMD by donating online

Please Support Publication of the MMD with your Generous Donation

Pay via PayPal

No PayPal account required

                                     
Translate This Page