l i n u x - u s e r s - g r o u p - o f - d a v i s
L U G O D
 
Next Meeting:
January 6: Social gathering
Next Installfest:
TBD
Latest News:
Nov. 18: Club officer elections
Page last updated:
2007 Apr 11 10:50

The following is an archive of a post made to our 'vox-tech mailing list' by one of its subscribers.

Report this post as spam:

(Enter your email address)
[vox-tech] OCR notes
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[vox-tech] OCR notes



Hi everyone,

I am about to embark on an exciting adventure into the land of original 
character recognition, processing nearly 1,000 documents and extracting 
numbers from them. I am interested in any anecdotal wisdom regarding:

1. efficient scanning parameters:
DPI
color / BW / grayscale

2. pre-processing steps one might do with imagemagick

3. any filtering that one might do to get ready for the OCR

I plan to use Google's new OCR project, ocropus, which currently uses 
the 'tesseract' engine. Naive attempts to OCR these documents is resulting in 
marginal accuracy, so any help is appreciated. Vertical and horizontal lines 
on the original documents are confusing the OCR, so removing them might be a 
start. I have thought about extracting each 'cell' of data with imagemagick, 
and then running the resulting mini-images though the OCR... that might be a 
last resort though...

thanks!

-- 
Dylan Beaudette
Soils and Biogeochemistry Graduate Group
University of California at Davis
530.754.7341
_______________________________________________
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech



LinkedIn
LUGOD Group on LinkedIn
Sign up for LUGOD event announcements
Your email address:
facebook
LUGOD Group on Facebook
'Like' LUGOD on Facebook:

Hosting provided by:
Sunset Systems
Sunset Systems offers preconfigured Linux systems, remote system administration and custom software development.

LUGOD: Linux Users' Group of Davis
PO Box 2082, Davis, CA 95617
Contact Us

LUGOD is a 501(c)7 non-profit organization
based in Davis, California
and serving the Sacramento area.
"Linux" is a trademark of Linus Torvalds.

Sponsored in part by:
EDGE Tech Corp.
For donating some give-aways for our meetings.