BeautifulSoup html5lib parsing strange phenomenon ..is that a bug? -

August 15, 2013

python2.6 + htmllib0.99 + bs4

when run next code throw exception

#!/usr/bin/python # -------_*_ coding: utf-8 _*_  bs4 import beautifulsoup import html5lib     html  = ''' <html> <head> <title> test </title> </head> <body> <div id="tcp">hello</div> </body> </html> ''' cs = beautifulsoup(html,"html5lib") print cs.contents[0].contents[2].contents[1]['id'] main_tag = cs.find('div', id='tcp')  print main_tag.text   ####result####  #tcp #traceback (most recent  phone call last): #  file "c:\users\xxxxxxxx\desktop\test.py", line 21, in < #    print main_tag.text #attributeerror: 'nonetype' object has no attribute 'text'

after removing space between "<title>" , "test" ,the programme run successfully

this known bug in bs4. see:

https://bugs.launchpad.net/beautifulsoup/+bug/1430633

there situations in bs4 generates malformed tree. "find" runs off end of tree, , returns none.

beautifulsoup html5lib

Search This Blog

New Th

BeautifulSoup html5lib parsing strange phenomenon ..is that a bug? -

Comments

Post a Comment

Popular posts from this blog

xslt - DocBook 5 to PDF transform failing with error: "fo:flow" is missing child elements. Required content model: marker* -

mediawiki - How do I insert tables inside infoboxes on Wikia pages? -

SQL Server : need assitance parsing delimted data and returning a long concatenated string -