parsing - none of the parsers are finding all beautiful soup python -
i trying simple parsing of html file contains unit test results in body
url = urllib2.urlopen('file:/randomstuff/results.txt').read() soup = beautifulsoup(url, 'lxml') save = soup.body.findall(text = re.compile("failed"))   the best can out of 1 instance of text (when there closer 50) lxml , html5lib. other parsers find none. there anyway can work around broken html?
an example of body this
********* finished testing of logleveltypetest *********
 ********* start testing of apploggerconfigtest *********
 config: using qtest library 4.8.1, qt 4.8.1
 pass   : inittestcase
 pass   : testsetfromenvironment
 pass   : cleanuptestcase
 totals: 3 passed, 0 failed, 0 skipped  
html looks this
<html>    <head></head>    <body>    <pre style="word-wrap: break-word; white-space: pre-wrap;">       "common unit test results"       ...       ...    </pre>  </body>         
 
  
Comments
Post a Comment