l i n u x - u s e r s - g r o u p - o f - d a v i s
L U G O D
 
Next Meeting:
October 7: Social gathering
Next Installfest:
TBD
Latest News:
Aug. 18: Discounts to "Velocity" in NY; come to tonight's "Photography" talk
Page last updated:
2007 Mar 20 23:10

The following is an archive of a post made to our 'vox-tech mailing list' by one of its subscribers.

Report this post as spam:

(Enter your email address)
Re: [vox-tech] How to tell if a pdf is text or image?
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [vox-tech] How to tell if a pdf is text or image?



hajhouse wrote:
På 2007-03-20, skrev Alex Mandel:
Anyone know a way to tap into a pdf programmatically to tell if it contains text vs was scanned as an image?

I basically just want to sort a directory with many thousands of pdfs.
I figured there must be something in the header or in the file info that either says that it's an image or it has text, or to be more complicated gives you a quick percentage of document is text, which I could use to set a sort threshold.

Alternately if it can be done more easily on a ps file there's no reason why I can't do a pdf2ps on it and then decide how to sort.
It's really a one time deal so I'll take the overhead on that operation.
What about converting the PDF files to postscript then running ps2ascii?



Well, I don't actually need the text, I just need to know if it is text.
The idea is that once I separate them, all the ones that are images can then be ocr corrected to text versions.
So my idea was either a yes/no answer or to say something like, if the document is more than 20%(arbitrary) text consider it text.

So far pdffont tells me what fonts I have, and if it's an image I get nothing after the header lines. So that might work if I write a program that makes a temp pdffont and sees if it's longer than just the headers.

I guess I should clarify when I say image, I'm talking about pdf that were made by scanning a document straight to tiff with no ocr. I know none of them have pictures, since it's all legal docs at a law firm.

Alex
_______________________________________________
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech



LinkedIn
LUGOD Group on LinkedIn
Sign up for LUGOD event announcements
Your email address:
facebook
LUGOD Group on Facebook
'Like' LUGOD on Facebook:

Hosting provided by:
Sunset Systems
Sunset Systems offers preconfigured Linux systems, remote system administration and custom software development.

LUGOD: Linux Users' Group of Davis
PO Box 2082, Davis, CA 95617
Contact Us

LUGOD is a 501(c)7 non-profit organization
based in Davis, California
and serving the Sacramento area.
"Linux" is a trademark of Linus Torvalds.

Sponsored in part by:
O'Reilly and Associates
For numerous book donations.