l i n u x - u s e r s - g r o u p - o f - d a v i s
L U G O D
 
Next Meeting:
August 5: Social gathering
Next Installfest:
TBD
Latest News:
Jul. 4: July, August and September: Security, Photography and Programming for Kids
Page last updated:
2010 Aug 18 16:23

The following is an archive of a post made to our 'vox-tech mailing list' by one of its subscribers.

Report this post as spam:

(Enter your email address)
Re: [vox-tech] Suggestions for cleaning up repetitive HTML tags?
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [vox-tech] Suggestions for cleaning up repetitive HTML tags?



On Wed, 2010-08-18 at 10:48 -0700, Bill Kendrick wrote:
> I've come across some documents that are formatted in
> such a way that, when converted to HTML, they come out
> something like this:
> 
>   <font face="Arial">And</font> <font face="Arial">then</font>
>   <font face="Arial">they</font> <font face="Arial">looked</font>
> 
> or even worse:
> 
>   <font face="Arial">A</font><font face="Arial">n</font><font
>   face="Arial">d</font>
>   ...
> 
> 
> I've come up with a way, using PHP's DOMDocument system, to
> scrape a file clear of these, but it's very slow, and it's
> basically something that can be done on a stream of text
> (rather than having to worry about the document's structure).
> 
> I'm thinking of writing something in PHP or C to clean stuff
> like this up, but am wondering if anyone else has any experience
> and suggestions?
> 
> (And yes, I've used "htmltidy", but while that can merge _nested_
> styles, e.g., a "<font face="Arial"><font size=+1>" get
> combined into its own CSS stype, e.g., "<span class="c123">",
> it doesn't seem to be able to merge _consecutive_ styles,
> as shown in the examples above. :^/ )

Consider writing a SAX filter that just drops the offending <font> and
</font>.

Also consider using XPath, like my following example in Ruby (using the
Nokogiri XML library)

require 'nokogiri'
def reform xml
  xml.xpath('//font[1]').each do |x|
    newcontent=x.content.to_s.dup
    textnodes=x.xpath('(following-sibling::text() | following-sibling::font/text())')
    x.content=x.content+textnodes.map{|y| y.to_s}.join
    textnodes.unlink
    x.xpath('following-sibling::font').unlink
  end
  xml
end

xml=Nokogiri::XML('<test><font face="Arial">And</font> <font face="Arial">then</font></test>')
puts reform(xml).to_xml

xml=Nokogiri::XML('<test><font face="Arial">And</font> <font face="Arial">then</font> <b>More</b></test>')
puts reform(xml).to_xml

xml=Nokogiri::XML('<test><font face="Arial">And</font> <font face="Arial">then</font> More</test>')
puts reform(xml).to_xml

#That last example probably does the wrong thing
#to fix that you might want the following more complicated version of
#the XPath

def reform xml
  xml.xpath('//font[1]').each do |x|
    newcontent=x.content.to_s.dup
    textnodes=x.xpath('(following-sibling::text()[following-sibling::node()[1][self::font]] | following-sibling::font/text())')
    x.content=x.content+textnodes.map{|y| y.to_s}.join
    textnodes.unlink
    x.xpath('following-sibling::font').unlink
  end
  xml
end

#More hackage may be necessary depending on the exact structure of your data.
_______________________________________________
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech



LinkedIn
LUGOD Group on LinkedIn
Sign up for LUGOD event announcements
Your email address:
facebook
LUGOD Group on Facebook
'Like' LUGOD on Facebook:

Hosting provided by:
Sunset Systems
Sunset Systems offers preconfigured Linux systems, remote system administration and custom software development.

LUGOD: Linux Users' Group of Davis
PO Box 2082, Davis, CA 95617
Contact Us

LUGOD is a 501(c)7 non-profit organization
based in Davis, California
and serving the Sacramento area.
"Linux" is a trademark of Linus Torvalds.

Sponsored in part by:
Sunset Systems
Who graciously hosts our website & mailing lists!