python - BeautifulSoup4 with lxml xml parser removes xmlns attributes from inline svg in xhtml file -
i have beautifulsoup4 v4.6.0 , lxml v3.8.0 installed. trying parse following xhtml.  
my code parse:
from bs4 import beautifulsoup  xhtml_string = """ <?xml version="1.0" encoding="utf-8" standalone="no"?> <!doctype html public "-//w3c//dtd xhtml 1.1//en" "http://www.w3.org/tr/xhtml11/dtd/xhtml11.dtd">  <html xmlns="http://www.w3.org/1999/xhtml">     <head>     </head>      <body class="sgc-1">       <svg xmlns="http://www.w3.org/2000/svg" height="100%" preserveaspectratio="xmidymid meet" version="1.1" viewbox="0 0 600 800" width="100%" xmlns:xlink="http://www.w3.org/1999/xlink">         <image height="800" width="573" xlink:href="../images/cover.jpg"></image>       </svg>     </body> </html> """  soup = beautifulsoup(xhtml_string, 'xml') however when inspect soup, appears beautifulsoup has stripped xmlns="http://www.w3.org/2000/svg" , xmlns:xlink="http://www.w3.org/1999/xlink" on <svg> tag , xlink prefix on href attribute on <image> tag.
i.e. soup.prettify() returns following
<?xml version="1.0" encoding="unicode-escape"?> <!doctype html public "-//w3c//dtd xhtml 1.1//en" "http://www.w3.org/tr/xhtml11/dtd/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml">   <head>   </head>   <body class="sgc-1">     <svg height="100%" preserveaspectratio="xmidymid meet" version="1.1" viewbox="0 0 600 800" width="100%">       <image height="800" href="../images/cover.jpg" width="573"/>     </svg>   </body> </html> i not have option change source xhtml , i've seen xmlns declarations valid. there way make beautifulsoup preserve xhtml is?
you should use lxml parser instead of xml.
soup = beautifulsoup(xhtml_string, 'lxml') 
Comments
Post a Comment