l i n u x - u s e r s - g r o u p - o f - d a v i s
Next Meeting:
July 7: Social gathering
Next Installfest:
Latest News:
Jun. 14: June LUGOD meeting cancelled
Page last updated:
2007 Mar 26 11:06

The following is an archive of a post made to our 'vox-tech mailing list' by one of its subscribers.

Report this post as spam:

(Enter your email address)
RE: Fwd: Re: [vox-tech] How to tell if a pdf is text or image?
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Fwd: Re: [vox-tech] How to tell if a pdf is text or image?

Hi Alex:

We do similar things here where I work.

We have been through this a few times.

The problem is that some PDFs are font info plus text (with script like
"draw a 12pt arial font with text 'hello world' at 30,30).  Some pdfs
are images with 'hidden' text behind regions with no font info (text
'hello world' occupies 30,30 through 90,30). 

So by going after the font, you might miss some of the hidden text

Here, we finally decided that running pdftotext (from poppler) and
looking if there is any output is the easiest/fastest way (without
pulling apart the pdf with custom code).  If pdftotext exits with 0
status and has no output, then we decide that the pdf is image-based
(kinda kludgy, but it works for us)...

This comes from my shaky knowledge of pdf's, so good luck and HTHO.



> >Well, I don't actually need the text, I just need to know if it is
> text.
> >The idea is that once I separate them, all the ones that are images
> can 
> >then be ocr corrected to text versions.
> >So my idea was either a yes/no answer or to say something like, if
> the 
> >document is more than 20%(arbitrary) text consider it text.

"The most potent weapon in the hands of the oppressor is the 
mind of the oppressed."
-- Steven Biko
("White Racism and Black Consciousness", in I Write What I Like)

Expecting? Get great news right away with email Auto-Check. 
Try the Yahoo! Mail Beta.
vox-tech mailing list

LUGOD Group on LinkedIn
Sign up for LUGOD event announcements
Your email address:
LUGOD Group on Facebook
'Like' LUGOD on Facebook:

Hosting provided by:
Sunset Systems
Sunset Systems offers preconfigured Linux systems, remote system administration and custom software development.

LUGOD: Linux Users' Group of Davis
PO Box 2082, Davis, CA 95617
Contact Us

LUGOD is a 501(c)7 non-profit organization
based in Davis, California
and serving the Sacramento area.
"Linux" is a trademark of Linus Torvalds.

Sponsored in part by:
O'Reilly and Associates
For numerous book donations.