BeautifulSoup html5lib parsing strange phenomenon ..is that a bug? -



BeautifulSoup html5lib parsing strange phenomenon ..is that a bug? -

python2.6 + htmllib0.99 + bs4

when run next code throw exception

#!/usr/bin/python # -------_*_ coding: utf-8 _*_ bs4 import beautifulsoup import html5lib html = ''' <html> <head> <title> test </title> </head> <body> <div id="tcp">hello</div> </body> </html> ''' cs = beautifulsoup(html,"html5lib") print cs.contents[0].contents[2].contents[1]['id'] main_tag = cs.find('div', id='tcp') print main_tag.text ####result#### #tcp #traceback (most recent phone call last): # file "c:\users\xxxxxxxx\desktop\test.py", line 21, in < # print main_tag.text #attributeerror: 'nonetype' object has no attribute 'text'

after removing space between "<title>" , "test" ,the programme run successfully

this known bug in bs4. see:

https://bugs.launchpad.net/beautifulsoup/+bug/1430633

there situations in bs4 generates malformed tree. "find" runs off end of tree, , returns none.

beautifulsoup html5lib

Comments

Popular posts from this blog

xslt - DocBook 5 to PDF transform failing with error: "fo:flow" is missing child elements. Required content model: marker* -

mediawiki - How do I insert tables inside infoboxes on Wikia pages? -

Local Service User Logged into Windows -