python - BeautifulSoup4 with lxml xml parser removes xmlns attributes from inline svg in xhtml file -
i have beautifulsoup4 v4.6.0 , lxml v3.8.0 installed. trying parse following xhtml.
my code parse:
from bs4 import beautifulsoup xhtml_string = """ <?xml version="1.0" encoding="utf-8" standalone="no"?> <!doctype html public "-//w3c//dtd xhtml 1.1//en" "http://www.w3.org/tr/xhtml11/dtd/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> </head> <body class="sgc-1"> <svg xmlns="http://www.w3.org/2000/svg" height="100%" preserveaspectratio="xmidymid meet" version="1.1" viewbox="0 0 600 800" width="100%" xmlns:xlink="http://www.w3.org/1999/xlink"> <image height="800" width="573" xlink:href="../images/cover.jpg"></image> </svg> </body> </html> """ soup = beautifulsoup(xhtml_string, 'xml') however when inspect soup, appears beautifulsoup has stripped xmlns="http://www.w3.org/2000/svg" , xmlns:xlink="http://www.w3.org/1999/xlink" on <svg> tag , xlink prefix on href attribute on <image> tag.
i.e. soup.prettify() returns following
<?xml version="1.0" encoding="unicode-escape"?> <!doctype html public "-//w3c//dtd xhtml 1.1//en" "http://www.w3.org/tr/xhtml11/dtd/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> </head> <body class="sgc-1"> <svg height="100%" preserveaspectratio="xmidymid meet" version="1.1" viewbox="0 0 600 800" width="100%"> <image height="800" href="../images/cover.jpg" width="573"/> </svg> </body> </html> i not have option change source xhtml , i've seen xmlns declarations valid. there way make beautifulsoup preserve xhtml is?
you should use lxml parser instead of xml.
soup = beautifulsoup(xhtml_string, 'lxml')
Comments
Post a Comment