Re: [vox-tech] OCR notes
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [vox-tech] OCR notes
På 2007-04-11, skrev Dylan Beaudette:
> Hi everyone,
>
> I am about to embark on an exciting adventure into the land of original
> character recognition, processing nearly 1,000 documents and extracting
> numbers from them. I am interested in any anecdotal wisdom regarding:
>
> 1. efficient scanning parameters:
> DPI
> color / BW / grayscale
B&W, as high DPI as feasible.
> 2. pre-processing steps one might do with imagemagick
Clipping off borders is recommended.
> 3. any filtering that one might do to get ready for the OCR
Make sure there are no handwritten notes, post-it pieces, or other
miscellaneous cruft on the documents before scanning them. If the paper
is colored or there are ghost images (such as the back-side printing
showing through thin paper), scan in grayscale and then carefully reduce
to B&W with an appropriate hand-picked threshhold. I think I used
pnmremap to do that the last time that need came up for me.
> I plan to use Google's new OCR project, ocropus, which currently uses
> the 'tesseract' engine. Naive attempts to OCR these documents is resulting in
> marginal accuracy, so any help is appreciated. Vertical and horizontal lines
> on the original documents are confusing the OCR, so removing them might be a
> start. I have thought about extracting each 'cell' of data with imagemagick,
> and then running the resulting mini-images though the OCR... that might be a
> last resort though...
Neat. I've never tried that. The only OCR engine I've sucessfully used
is gocr, which was pretty decent and worked out of the box with minimal
tweaking. I tried Clara but it seemed unstable and I gave up before I
could figure out how to make it work.
--
Henry House
+1 530 753 3361 ext. 13
Please don't send me HTML mail! My mail system frequently rejects it.
The unintelligible text that may follow is a digital signature.
See <http://hajhouse.org/pgp> to find out how to use it.
My OpenPGP key: <http://hajhouse.org/hajhouse.asc>.
Attachment:
signature.asc
Description: Digital signature
_______________________________________________
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech
|