BeautifulSoup html5lib parsing strange phenomenon ..is that a bug? -
BeautifulSoup html5lib parsing strange phenomenon ..is that a bug? -
python2.6 + htmllib0.99 + bs4
when run next code throw exception
#!/usr/bin/python # -------_*_ coding: utf-8 _*_ bs4 import beautifulsoup import html5lib html = ''' <html> <head> <title> test </title> </head> <body> <div id="tcp">hello</div> </body> </html> ''' cs = beautifulsoup(html,"html5lib") print cs.contents[0].contents[2].contents[1]['id'] main_tag = cs.find('div', id='tcp') print main_tag.text ####result#### #tcp #traceback (most recent phone call last): # file "c:\users\xxxxxxxx\desktop\test.py", line 21, in < # print main_tag.text #attributeerror: 'nonetype' object has no attribute 'text'
after removing space between "<title>" , "test" ,the programme run successfully
this known bug in bs4. see:
https://bugs.launchpad.net/beautifulsoup/+bug/1430633
there situations in bs4 generates malformed tree. "find" runs off end of tree, , returns none.
beautifulsoup html5lib
Comments
Post a Comment