python - BeautifulSoup4 with lxml xml parser removes xmlns attributes from inline svg in xhtml file -
i have beautifulsoup4 v4.6.0 , lxml v3.8.0 installed. trying parse following xhtml
.
my code parse:
from bs4 import beautifulsoup xhtml_string = """ <?xml version="1.0" encoding="utf-8" standalone="no"?> <!doctype html public "-//w3c//dtd xhtml 1.1//en" "http://www.w3.org/tr/xhtml11/dtd/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> </head> <body class="sgc-1"> <svg xmlns="http://www.w3.org/2000/svg" height="100%" preserveaspectratio="xmidymid meet" version="1.1" viewbox="0 0 600 800" width="100%" xmlns:xlink="http://www.w3.org/1999/xlink"> <image height="800" width="573" xlink:href="../images/cover.jpg"></image> </svg> </body> </html> """ soup = beautifulsoup(xhtml_string, 'xml')
however when inspect soup, appears beautifulsoup has stripped xmlns="http://www.w3.org/2000/svg"
, xmlns:xlink="http://www.w3.org/1999/xlink"
on <svg>
tag , xlink
prefix on href
attribute on <image>
tag.
i.e. soup.prettify()
returns following
<?xml version="1.0" encoding="unicode-escape"?> <!doctype html public "-//w3c//dtd xhtml 1.1//en" "http://www.w3.org/tr/xhtml11/dtd/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> </head> <body class="sgc-1"> <svg height="100%" preserveaspectratio="xmidymid meet" version="1.1" viewbox="0 0 600 800" width="100%"> <image height="800" href="../images/cover.jpg" width="573"/> </svg> </body> </html>
i not have option change source xhtml
, i've seen xmlns
declarations valid. there way make beautifulsoup preserve xhtml
is?
you should use lxml
parser instead of xml
.
soup = beautifulsoup(xhtml_string, 'lxml')
Comments
Post a Comment