python - BeautifulSoup4 with lxml xml parser removes xmlns attributes from inline svg in xhtml file -

- February 15, 2010

i have beautifulsoup4 v4.6.0 , lxml v3.8.0 installed. trying parse following xhtml.

my code parse:

from bs4 import beautifulsoup  xhtml_string = """ <?xml version="1.0" encoding="utf-8" standalone="no"?> <!doctype html public "-//w3c//dtd xhtml 1.1//en" "http://www.w3.org/tr/xhtml11/dtd/xhtml11.dtd">  <html xmlns="http://www.w3.org/1999/xhtml">     <head>     </head>      <body class="sgc-1">       <svg xmlns="http://www.w3.org/2000/svg" height="100%" preserveaspectratio="xmidymid meet" version="1.1" viewbox="0 0 600 800" width="100%" xmlns:xlink="http://www.w3.org/1999/xlink">         <image height="800" width="573" xlink:href="../images/cover.jpg"></image>       </svg>     </body> </html> """  soup = beautifulsoup(xhtml_string, 'xml')

however when inspect soup, appears beautifulsoup has stripped xmlns="http://www.w3.org/2000/svg" , xmlns:xlink="http://www.w3.org/1999/xlink" on <svg> tag , xlink prefix on href attribute on <image> tag.

i.e. soup.prettify() returns following

<?xml version="1.0" encoding="unicode-escape"?> <!doctype html public "-//w3c//dtd xhtml 1.1//en" "http://www.w3.org/tr/xhtml11/dtd/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml">   <head>   </head>   <body class="sgc-1">     <svg height="100%" preserveaspectratio="xmidymid meet" version="1.1" viewbox="0 0 600 800" width="100%">       <image height="800" href="../images/cover.jpg" width="573"/>     </svg>   </body> </html>

i not have option change source xhtml , i've seen xmlns declarations valid. there way make beautifulsoup preserve xhtml is?

you should use lxml parser instead of xml.

soup = beautifulsoup(xhtml_string, 'lxml')

Search This Blog

ANy

python - BeautifulSoup4 with lxml xml parser removes xmlns attributes from inline svg in xhtml file -

Comments

Post a Comment

Popular posts from this blog

ZeroMQ on Windows, with Qt Creator -

ios - MKAnnotationView layer is not of expected type: MKLayer -

python - Error while using APScheduler: 'NoneType' object has no attribute 'now' -