pyglossary icon indicating copy to clipboard operation
pyglossary copied to clipboard

The --remove-html=tag only removes the tags themselves but not their contents

Open retifrav opened this issue 2 years ago • 0 comments

If I convert a StarDict dictionary to AppleDict format and set --remove-html=script to remove <script> tags, so like this:

$ pyglossary ./original.ifo ./output \
    --read-format=Stardict --write-format=AppleDict \
    --remove-html=script

then resulting output.xml indeed does not contain <script> tags but the code that was enclosed in those tags is still present. The same happens for other tags such as <style>.

To illustrate this behavior, here's a fragment from original.dict file:

<ordbokuibno><style>ordbokuibno grammar-s {font-family: sans-serif;}ordbokuibno span.b {font-weight: bold;}ordbokuibno .artikkeloppslagsord {font-size: 100%;font-weight: bold;}ordbokuibno span.oppsgramordklasse {color: seagreen; text-decoration: none;}ordbokuibno a.oppsgramordklasse {color: seagreen;text-decoration: none; border-bottom: 1px dotted;}ordbokuibno .henvisning {color: #557FBD;text-decoration: none;}ordbokuibno .henvisning:hover {color: #0000FF; cursor: pointer;text-decoration: underline;}ordbokuibno .etymtilvising {color: #557FBD;text-decoration: none; font-style: italic;}ordbokuibno .etymtilvising:hover {color: #0000FF; cursor: pointer; text-decoration: underline;}ordbokuibno .tilvising {font-style: italic;}ordbokuibno table.paradigmetabell {background-color: #FFFFFF;}ordbokuibno tr.nospace {border: 0;}ordbokuibno th {font-weight: bold;color: #557FBD;border-right: 1px solid #557FBD;border-bottom: 1px solid #557FBD;border-top: 1px solid #557FBD;letter-spacing: 2px;text-align: center;padding: 6px 6px 6px 12px;background: -webkit-gradient(linear, left top, left bottom, from(#DFE8F1), to(#DFE8F1));background: -moz-linear-gradient(top,  #DFE8F1,  #DFE8F1);}ordbokuibno th.nola {font-weight: bold;color: #557FBD;border-right: 1px solid #557FBD;border-top: 1px solid #557FBD;border-bottom: 0px;letter-spacing: 2px;text-align: center;padding: 6px 6px 6px 12px;background: -webkit-gradient(linear, left top, left bottom, from(#DFE8F1), to(#DFE8F1));background: -moz-linear-gradient(top,  #DFE8F1,  #DFE8F1);}ordbokuibno th.nobg {color: #557FBD;border-top: 0px;border-left: 0px;border-right: 1px solid #557FBD;border-bottom: 1px solid #557FBD;background: none;}ordbokuibno th.nobgnola {color: #557FBD;border-right: 1px solid #557FBD;border-top: 0;border-left: 0;border-bottom: 0;background: none;}ordbokuibno td.vanlig {color: #557FBD;border-right: 1px solid #557FBD;border-bottom: 1px solid #557FBD;background: #fff;padding: 6px 6px 6px 12px;}ordbokuibno td.ledetekst {color: #557FBD;border-left: 1px solid #557FBD;border-right: 1px solid #557FBD;border-bottom: 1px solid #557FBD;border-top: 0;background: #fff;}ordbokuibno .grunnord {font-size: 106%}ordbokuibno .head {font-weight: bold;}ordbokuibno .klInf {text-align: left;display: none;padding: 3px; margin: 3px;border: 1px solid #557FBD;max-width: 600px;}ordbokuibno .klInfHead {text-decoration: underline;margin-bottom: 5px;}ordbokuibno .klInfText {margin-bottom: 2px;}ordbokuibno grammar-x {color: #2579bc; text-decoration: none; font-weight: bold;}ordbokuibno grammar-x:hover {cursor: pointer; text-decoration: underline;}ordbokuibno .paradigmetabell td {text-align: center;}</style><dic style="display:inline"><a name="ordbokuibno-art28"></a><div class="artikkelinnhold"> <span class="oppslagsord b" id="32">-abel</span> <a class="oppsgramordklasse" vise_fullformer="32">a5</a> (fr. <span style="font-style: italic">-able</span>, lat. <span style="font-style: italic">-abilis</span>)<span class="utvidet"> suffiks : i stand til, som kan, i ord som <a class="henvisning" href="diskutabel#ordbokuibno-art10293">diskutabel</a>, <a class="henvisning" href="durabel#ordbokuibno-art11214">durabel</a>, <a class="henvisning" href="kapabel#ordbokuibno-art28591">kapabel</a>, <a class="henvisning" href="variabel#ordbokuibno-art66683">variabel (II)</a>; jamfør <a class="henvisning" href="-lig#ordbokuibno-art35107">-lig (3)</a>, <a class="henvisning" href="-bar#ordbokuibno-art4475">-bar (2)</a></span></div></dic><grammar-c style="display:none"> <div id="32"><table class="paradigmetabell" cellspacing="0" style="margin: 25px;"><tr><th class="nobgnola"><span class="grunnord">-abel</span></th><th class="nola" colspan="3">Entall</th><th class="nola">Flertall</th></tr><tr><th class="nobg">&nbsp;&nbsp;</th><th>Hankjønn og hunkjønn</th><th>Intetkjønn</th><th>Bestemt form</th><th>&nbsp;&nbsp;</th></tr><tr><td class="ledetekst">a5</td><td class="vanlig">-abel</td><td class="vanlig">-abelt</td><td class="vanlig">-able</td><td class="vanlig">-able</td></tr></table></div></grammar-c><grammar-t style="display:none"><table style="border:1px solid #dddddd;border-spacing:3px;background-color:#FFF;box-shadow: 0px 2px 2px rgba(0,0,0,0.1);"><tr><td style="background-color:#DFE8F1; color:#2579bc;padding:6px 6px 6px 12px" align="left"><span style="font-weight:bold;font-size: 106%">Bøying i samsvar med gjeldende rettskriving:</span></td><td style="background-color:#DFE8F1;padding:6px 6px 6px 6px" width=60px align=center><grammar-x>close</grammar-x></td></tr><tr><td colspan=2><grammar-s></grammar-s></td></tr></table><p></p></grammar-t><script>(function () {var scr = document.getElementsByTagName('script');scr = scr[scr.length - 1];var art = scr.parentNode;var dic = art.getElementsByTagName('dic')[0];var gram = art.getElementsByTagName('grammar-c')[0];var tab = art.getElementsByTagName('grammar-t')[0];var show = art.getElementsByTagName('grammar-s')[0];var close = art.getElementsByTagName('grammar-x')[0];var opp = art.getElementsByClassName('oppsgramordklasse');for (var i = 0; i < opp.length; i++){if (opp[i].tagName === 'A'){opp[i].addEventListener("click", function(){var id = this.getAttribute("vise_fullformer");var divs = gram.getElementsByTagName('div');for (var j = 0; j < divs.length; j++){if (divs[j].getAttribute("id") === id){show.innerHTML = divs[j].outerHTML;dic.style.display = 'none';tab.style.display = 'inline';return true;}}});}}close.addEventListener("click", function(){tab.style.display = 'none';dic.style.display = 'inline';});})(); </script></ordbokuibno>

here's the same fragment from output.xml when ran with --remove-html=script:

<d:index d:value="-abel" d:title="-abel"/><h1>-abel</h1><ordbokuibno><style>ordbokuibno grammar-s {font-family: sans-serif;}ordbokuibno span.b {font-weight: bold;}ordbokuibno .artikkeloppslagsord {font-size: 100%;font-weight: bold;}ordbokuibno span.oppsgramordklasse {color: seagreen; text-decoration: none;}ordbokuibno a.oppsgramordklasse {color: seagreen;text-decoration: none; border-bottom: 1px dotted;}ordbokuibno .henvisning {color: #557FBD;text-decoration: none;}ordbokuibno .henvisning:hover {color: #0000FF; cursor: pointer;text-decoration: underline;}ordbokuibno .etymtilvising {color: #557FBD;text-decoration: none; font-style: italic;}ordbokuibno .etymtilvising:hover {color: #0000FF; cursor: pointer; text-decoration: underline;}ordbokuibno .tilvising {font-style: italic;}ordbokuibno table.paradigmetabell {background-color: #FFFFFF;}ordbokuibno tr.nospace {border: 0;}ordbokuibno th {font-weight: bold;color: #557FBD;border-right: 1px solid #557FBD;border-bottom: 1px solid #557FBD;border-top: 1px solid #557FBD;letter-spacing: 2px;text-align: center;padding: 6px 6px 6px 12px;background: -webkit-gradient(linear, left top, left bottom, from(#DFE8F1), to(#DFE8F1));background: -moz-linear-gradient(top,  #DFE8F1,  #DFE8F1);}ordbokuibno th.nola {font-weight: bold;color: #557FBD;border-right: 1px solid #557FBD;border-top: 1px solid #557FBD;border-bottom: 0px;letter-spacing: 2px;text-align: center;padding: 6px 6px 6px 12px;background: -webkit-gradient(linear, left top, left bottom, from(#DFE8F1), to(#DFE8F1));background: -moz-linear-gradient(top,  #DFE8F1,  #DFE8F1);}ordbokuibno th.nobg {color: #557FBD;border-top: 0px;border-left: 0px;border-right: 1px solid #557FBD;border-bottom: 1px solid #557FBD;background: none;}ordbokuibno th.nobgnola {color: #557FBD;border-right: 1px solid #557FBD;border-top: 0;border-left: 0;border-bottom: 0;background: none;}ordbokuibno td.vanlig {color: #557FBD;border-right: 1px solid #557FBD;border-bottom: 1px solid #557FBD;background: #fff;padding: 6px 6px 6px 12px;}ordbokuibno td.ledetekst {color: #557FBD;border-left: 1px solid #557FBD;border-right: 1px solid #557FBD;border-bottom: 1px solid #557FBD;border-top: 0;background: #fff;}ordbokuibno .grunnord {font-size: 106%}ordbokuibno .head {font-weight: bold;}ordbokuibno .klInf {text-align: left;display: none;padding: 3px; margin: 3px;border: 1px solid #557FBD;max-width: 600px;}ordbokuibno .klInfHead {text-decoration: underline;margin-bottom: 5px;}ordbokuibno .klInfText {margin-bottom: 2px;}ordbokuibno grammar-x {color: #2579bc; text-decoration: none; font-weight: bold;}ordbokuibno grammar-x:hover {cursor: pointer; text-decoration: underline;}ordbokuibno .paradigmetabell td {text-align: center;}</style><dic style="display:inline"><a name="ordbokuibno-art28"></a><div class="artikkelinnhold"> <span class="oppslagsord b" id="32">-abel</span> <a class="oppsgramordklasse" vise_fullformer="32">a5</a> (fr. <span style="font-style: italic">-able</span>, lat. <span style="font-style: italic">-abilis</span>)<span class="utvidet"> suffiks : i stand til, som kan, i ord som <a class="henvisning" href="x-dictionary:d:diskutabel#ordbokuibno-art10293">diskutabel</a>, <a class="henvisning" href="x-dictionary:d:durabel#ordbokuibno-art11214">durabel</a>, <a class="henvisning" href="x-dictionary:d:kapabel#ordbokuibno-art28591">kapabel</a>, <a class="henvisning" href="x-dictionary:d:variabel#ordbokuibno-art66683">variabel (II)</a>; jamfør <a class="henvisning" href="x-dictionary:d:-lig#ordbokuibno-art35107">-lig (3)</a>, <a class="henvisning" href="x-dictionary:d:-bar#ordbokuibno-art4475">-bar (2)</a></span></div></dic><grammar-c style="display:none"> <div id="32"><table cellspacing="0" class="paradigmetabell" style="margin: 25px;"><tr><th class="nobgnola"><span class="grunnord">-abel</span></th><th class="nola" colspan="3">Entall</th><th class="nola">Flertall</th></tr><tr><th class="nobg">  </th><th>Hankjønn og hunkjønn</th><th>Intetkjønn</th><th>Bestemt form</th><th>  </th></tr><tr><td class="ledetekst">a5</td><td class="vanlig">-abel</td><td class="vanlig">-abelt</td><td class="vanlig">-able</td><td class="vanlig">-able</td></tr></table></div></grammar-c><grammar-t style="display:none"><table style="border:1px solid #dddddd;border-spacing:3px;background-color:#FFF;box-shadow: 0px 2px 2px rgba(0,0,0,0.1);"><tr><td align="left" style="background-color:#DFE8F1; color:#2579bc;padding:6px 6px 6px 12px"><span style="font-weight:bold;font-size: 106%">Bøying i samsvar med gjeldende rettskriving:</span></td><td align="center" style="background-color:#DFE8F1;padding:6px 6px 6px 6px" width="60px"><grammar-x>close</grammar-x></td></tr><tr><td colspan="2"><grammar-s></grammar-s></td></tr></table><p></p></grammar-t>(function () {var scr = document.getElementsByTagName('script');scr = scr[scr.length - 1];var art = scr.parentNode;var dic = art.getElementsByTagName('dic')[0];var gram = art.getElementsByTagName('grammar-c')[0];var tab = art.getElementsByTagName('grammar-t')[0];var show = art.getElementsByTagName('grammar-s')[0];var close = art.getElementsByTagName('grammar-x')[0];var opp = art.getElementsByClassName('oppsgramordklasse');for (var i = 0; i &lt; opp.length; i++){if (opp[i].tagName === 'A'){opp[i].addEventListener("click", function(){var id = this.getAttribute("vise_fullformer");var divs = gram.getElementsByTagName('div');for (var j = 0; j &lt; divs.length; j++){if (divs[j].getAttribute("id") === id){show.innerHTML = divs[j].outerHTML;dic.style.display = 'none';tab.style.display = 'inline';return true;}}});}}close.addEventListener("click", function(){tab.style.display = 'none';dic.style.display = 'inline';});})(); </ordbokuibno>

and here's the same fragment from output.xml when ran with --remove-html=style:

<d:index d:value="-abel" d:title="-abel"/><h1>-abel</h1><ordbokuibno>ordbokuibno grammar-s {font-family: sans-serif;}ordbokuibno span.b {font-weight: bold;}ordbokuibno .artikkeloppslagsord {font-size: 100%;font-weight: bold;}ordbokuibno span.oppsgramordklasse {color: seagreen; text-decoration: none;}ordbokuibno a.oppsgramordklasse {color: seagreen;text-decoration: none; border-bottom: 1px dotted;}ordbokuibno .henvisning {color: #557FBD;text-decoration: none;}ordbokuibno .henvisning:hover {color: #0000FF; cursor: pointer;text-decoration: underline;}ordbokuibno .etymtilvising {color: #557FBD;text-decoration: none; font-style: italic;}ordbokuibno .etymtilvising:hover {color: #0000FF; cursor: pointer; text-decoration: underline;}ordbokuibno .tilvising {font-style: italic;}ordbokuibno table.paradigmetabell {background-color: #FFFFFF;}ordbokuibno tr.nospace {border: 0;}ordbokuibno th {font-weight: bold;color: #557FBD;border-right: 1px solid #557FBD;border-bottom: 1px solid #557FBD;border-top: 1px solid #557FBD;letter-spacing: 2px;text-align: center;padding: 6px 6px 6px 12px;background: -webkit-gradient(linear, left top, left bottom, from(#DFE8F1), to(#DFE8F1));background: -moz-linear-gradient(top,  #DFE8F1,  #DFE8F1);}ordbokuibno th.nola {font-weight: bold;color: #557FBD;border-right: 1px solid #557FBD;border-top: 1px solid #557FBD;border-bottom: 0px;letter-spacing: 2px;text-align: center;padding: 6px 6px 6px 12px;background: -webkit-gradient(linear, left top, left bottom, from(#DFE8F1), to(#DFE8F1));background: -moz-linear-gradient(top,  #DFE8F1,  #DFE8F1);}ordbokuibno th.nobg {color: #557FBD;border-top: 0px;border-left: 0px;border-right: 1px solid #557FBD;border-bottom: 1px solid #557FBD;background: none;}ordbokuibno th.nobgnola {color: #557FBD;border-right: 1px solid #557FBD;border-top: 0;border-left: 0;border-bottom: 0;background: none;}ordbokuibno td.vanlig {color: #557FBD;border-right: 1px solid #557FBD;border-bottom: 1px solid #557FBD;background: #fff;padding: 6px 6px 6px 12px;}ordbokuibno td.ledetekst {color: #557FBD;border-left: 1px solid #557FBD;border-right: 1px solid #557FBD;border-bottom: 1px solid #557FBD;border-top: 0;background: #fff;}ordbokuibno .grunnord {font-size: 106%}ordbokuibno .head {font-weight: bold;}ordbokuibno .klInf {text-align: left;display: none;padding: 3px; margin: 3px;border: 1px solid #557FBD;max-width: 600px;}ordbokuibno .klInfHead {text-decoration: underline;margin-bottom: 5px;}ordbokuibno .klInfText {margin-bottom: 2px;}ordbokuibno grammar-x {color: #2579bc; text-decoration: none; font-weight: bold;}ordbokuibno grammar-x:hover {cursor: pointer; text-decoration: underline;}ordbokuibno .paradigmetabell td {text-align: center;}<dic style="display:inline"><a name="ordbokuibno-art28"></a><div class="artikkelinnhold"> <span class="oppslagsord b" id="32">-abel</span> <a class="oppsgramordklasse" vise_fullformer="32">a5</a> (fr. <span style="font-style: italic">-able</span>, lat. <span style="font-style: italic">-abilis</span>)<span class="utvidet"> suffiks : i stand til, som kan, i ord som <a class="henvisning" href="x-dictionary:d:diskutabel#ordbokuibno-art10293">diskutabel</a>, <a class="henvisning" href="x-dictionary:d:durabel#ordbokuibno-art11214">durabel</a>, <a class="henvisning" href="x-dictionary:d:kapabel#ordbokuibno-art28591">kapabel</a>, <a class="henvisning" href="x-dictionary:d:variabel#ordbokuibno-art66683">variabel (II)</a>; jamfør <a class="henvisning" href="x-dictionary:d:-lig#ordbokuibno-art35107">-lig (3)</a>, <a class="henvisning" href="x-dictionary:d:-bar#ordbokuibno-art4475">-bar (2)</a></span></div></dic><grammar-c style="display:none"> <div id="32"><table cellspacing="0" class="paradigmetabell" style="margin: 25px;"><tr><th class="nobgnola"><span class="grunnord">-abel</span></th><th class="nola" colspan="3">Entall</th><th class="nola">Flertall</th></tr><tr><th class="nobg">  </th><th>Hankjønn og hunkjønn</th><th>Intetkjønn</th><th>Bestemt form</th><th>  </th></tr><tr><td class="ledetekst">a5</td><td class="vanlig">-abel</td><td class="vanlig">-abelt</td><td class="vanlig">-able</td><td class="vanlig">-able</td></tr></table></div></grammar-c><grammar-t style="display:none"><table style="border:1px solid #dddddd;border-spacing:3px;background-color:#FFF;box-shadow: 0px 2px 2px rgba(0,0,0,0.1);"><tr><td align="left" style="background-color:#DFE8F1; color:#2579bc;padding:6px 6px 6px 12px"><span style="font-weight:bold;font-size: 106%">Bøying i samsvar med gjeldende rettskriving:</span></td><td align="center" style="background-color:#DFE8F1;padding:6px 6px 6px 6px" width="60px"><grammar-x>close</grammar-x></td></tr><tr><td colspan="2"><grammar-s></grammar-s></td></tr></table><p></p></grammar-t><script>(function () {var scr = document.getElementsByTagName('script');scr = scr[scr.length - 1];var art = scr.parentNode;var dic = art.getElementsByTagName('dic')[0];var gram = art.getElementsByTagName('grammar-c')[0];var tab = art.getElementsByTagName('grammar-t')[0];var show = art.getElementsByTagName('grammar-s')[0];var close = art.getElementsByTagName('grammar-x')[0];var opp = art.getElementsByClassName('oppsgramordklasse');for (var i = 0; i < opp.length; i++){if (opp[i].tagName === 'A'){opp[i].addEventListener("click", function(){var id = this.getAttribute("vise_fullformer");var divs = gram.getElementsByTagName('div');for (var j = 0; j < divs.length; j++){if (divs[j].getAttribute("id") === id){show.innerHTML = divs[j].outerHTML;dic.style.display = 'none';tab.style.display = 'inline';return true;}}});}}close.addEventListener("click", function(){tab.style.display = 'none';dic.style.display = 'inline';});})(); </script></ordbokuibno>
</d:entry>

I think, this is a bug, because my understanding is that not only tags but also their contents should have been removed, so for instance here's what one would have gotten as a result of removing both <script> and <style> tags (--remove-html=script,style):

<d:index d:value="-abel" d:title="-abel"/><h1>-abel</h1><ordbokuibno><dic style="display:inline"><a name="ordbokuibno-art28"></a><div class="artikkelinnhold"> <span class="oppslagsord b" id="32">-abel</span> <a class="oppsgramordklasse" vise_fullformer="32">a5</a> (fr. <span style="font-style: italic">-able</span>, lat. <span style="font-style: italic">-abilis</span>)<span class="utvidet"> suffiks : i stand til, som kan, i ord som <a class="henvisning" href="x-dictionary:d:diskutabel#ordbokuibno-art10293">diskutabel</a>, <a class="henvisning" href="x-dictionary:d:durabel#ordbokuibno-art11214">durabel</a>, <a class="henvisning" href="x-dictionary:d:kapabel#ordbokuibno-art28591">kapabel</a>, <a class="henvisning" href="x-dictionary:d:variabel#ordbokuibno-art66683">variabel (II)</a>; jamfør <a class="henvisning" href="x-dictionary:d:-lig#ordbokuibno-art35107">-lig (3)</a>, <a class="henvisning" href="x-dictionary:d:-bar#ordbokuibno-art4475">-bar (2)</a></span></div></dic><grammar-c style="display:none"> <div id="32"><table cellspacing="0" class="paradigmetabell" style="margin: 25px;"><tr><th class="nobgnola"><span class="grunnord">-abel</span></th><th class="nola" colspan="3">Entall</th><th class="nola">Flertall</th></tr><tr><th class="nobg">  </th><th>Hankjønn og hunkjønn</th><th>Intetkjønn</th><th>Bestemt form</th><th>  </th></tr><tr><td class="ledetekst">a5</td><td class="vanlig">-abel</td><td class="vanlig">-abelt</td><td class="vanlig">-able</td><td class="vanlig">-able</td></tr></table></div></grammar-c><grammar-t style="display:none"><table style="border:1px solid #dddddd;border-spacing:3px;background-color:#FFF;box-shadow: 0px 2px 2px rgba(0,0,0,0.1);"><tr><td align="left" style="background-color:#DFE8F1; color:#2579bc;padding:6px 6px 6px 12px"><span style="font-weight:bold;font-size: 106%">Bøying i samsvar med gjeldende rettskriving:</span></td><td align="center" style="background-color:#DFE8F1;padding:6px 6px 6px 6px" width="60px"><grammar-x>close</grammar-x></td></tr><tr><td colspan="2"><grammar-s></grammar-s></td></tr></table><p></p></grammar-t></ordbokuibno>

My environment:

$ system_profiler SPSoftwareDataType
System Version: macOS 11.6.5 (20G527)
Kernel Version: Darwin 20.6.0

$ python --version
Python 3.9.10

$ pyglossary --version
PyGlossary 4.5.0

$ pip install lxml beautifulsoup4 html5lib
Requirement already satisfied: lxml in /usr/local/lib/python3.9/site-packages (4.8.0)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.9/site-packages (4.11.1)
Requirement already satisfied: html5lib in /usr/local/lib/python3.9/site-packages (1.1)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.9/site-packages (from beautifulsoup4) (2.3.2)
Requirement already satisfied: six>=1.9 in /usr/local/lib/python3.9/site-packages (from html5lib) (1.16.0)
Requirement already satisfied: webencodings in /usr/local/lib/python3.9/site-packages (from html5lib) (0.5.1)

retifrav avatar Apr 14 '22 17:04 retifrav