l i n u x - u s e r s - g r o u p - o f - d a v i s
Next Meeting:
July 7: Social gathering
Next Installfest:
Latest News:
Jun. 14: June LUGOD meeting cancelled
Page last updated:
2007 Mar 23 03:42

The following is an archive of a post made to our 'vox-tech mailing list' by one of its subscribers.

Report this post as spam:

(Enter your email address)
Re: Fwd: Re: [vox-tech] How to tell if a pdf is text or image?
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fwd: Re: [vox-tech] How to tell if a pdf is text or image?

Well, I don't actually need the text, I just need to know if it is text.
The idea is that once I separate them, all the ones that are images can then be ocr corrected to text versions.
So my idea was either a yes/no answer or to say something like, if the document is more than 20%(arbitrary) text consider it text.
PDF is a scripting language. You can look at the raw PDF with a text editor and you'll see plain text PDF operators interspersed with possibly binary data. In principle PDF is a programming language and the only way to tell what it produces is to run it. But in practice, PDF code is all machine-written, and you could probably learn to distinguish font-using PDFs from pure-image PDFs by examining the raw PDF file.

You could look for the font embedding operators. A document consisting only of scanned page images probably won't have any fonts embedded in it. Or, if the scanned-paper PDFs are all made by a particular program, you might be able to identify particular PDF operator sequences that it uses.
vox-tech mailing list

LUGOD Group on LinkedIn
Sign up for LUGOD event announcements
Your email address:
LUGOD Group on Facebook
'Like' LUGOD on Facebook:

Hosting provided by:
Sunset Systems
Sunset Systems offers preconfigured Linux systems, remote system administration and custom software development.

LUGOD: Linux Users' Group of Davis
PO Box 2082, Davis, CA 95617
Contact Us

LUGOD is a 501(c)7 non-profit organization
based in Davis, California
and serving the Sacramento area.
"Linux" is a trademark of Linus Torvalds.

Sponsored in part by:
Sunset Systems
Who graciously hosts our website & mailing lists!