selectolax icon indicating copy to clipboard operation
selectolax copied to clipboard

Using LexborHTMLParser seems to remove some HTML tags

Open BarryThrill opened this issue 3 years ago • 3 comments

Hello there Mr.Selectolax :)

I have been using selectolax for a very long time and I do really like it and will continue using it. I have found a small issue where I seem to get a return of html with removed tags:

from selectolax.lexbor import LexborHTMLParser

html_test = """
<tr class="clickable" data-price="1800">
   <td>
      <img width="80" src="https://media.restocks.net/products/DD1869-103/nike-dunk-high-black-white-w-1-80.png"/>
   </td>
   <td>
      <input class="productid" type="hidden" value="1882303"/>
      <input class="baseproductid" type="hidden" value="12107"/>
      <input class="sizeid" type="hidden" value="1"/>
      <input class="price" type="hidden" value="1800"/>
      <span>Nike Dunk High Black White Panda (W)</span>
      <br/>
      EU: 36
      <br/>
      ID: 1882303
      <br/>
      Ship before:
      13/05/22
   </td>
   <td>
      <span class="storeprice ">
      <span class="storeprice__value">1.800 kr</span>
      </span>
   </td>
   <td>
      <div onclick="window.open('https://restocks.net/en/account/sales/send-label/1882303')" class="download__send__label c-badge c-badge--small pull-right" style="background-color: #df9033">download shipping label</div>
   </td>
   <td>
      <i class="fas fa-pencil-alt listing__edit__icon"></i></span>
   </td>
</tr>
<tr class="clickable" data-price="4000">
   <td>
      <img width="80" src="https://media.restocks.net/products/DC0774-114/air-jordan-1-low-marina-blue-w-1-80.png"/>
   </td>
   <td>
      <input class="productid" type="hidden" value="1882293"/>
      <input class="baseproductid" type="hidden" value="13815"/>
      <input class="sizeid" type="hidden" value="48"/>
      <input class="price" type="hidden" value="4000"/>
      <span>Air Jordan 1 Low Marina Blue (W)</span>
      <br/>
      EU: 38 ½
      <br/>
      ID: 1882293
      <br/>
      Ship before:
      13/05/22
   </td>
   <td>
      <span class="storeprice ">
      <span class="storeprice__value">4.000 kr</span>
      </span>
   </td>
   <td>
      <div onclick="window.open('https://restocks.net/en/account/sales/send-label/1882293')" class="download__send__label c-badge c-badge--small pull-right" style="background-color: #df9033">download shipping label</div>
   </td>
   <td>
      <i class="fas fa-pencil-alt listing__edit__icon"></i></span>
   </td>
</tr>
<tr class="clickable" data-price="4000">
   <td>
      <img width="80" src="https://media.restocks.net/products/DC0774-114/air-jordan-1-low-marina-blue-w-1-80.png"/>
   </td>
   <td>
      <input class="productid" type="hidden" value="1882294"/>
      <input class="baseproductid" type="hidden" value="13815"/>
      <input class="sizeid" type="hidden" value="48"/>
      <input class="price" type="hidden" value="4000"/>
      <span>Air Jordan 1 Low Marina Blue (W)</span>
      <br/>
      EU: 38 ½
      <br/>
      ID: 1882294
      <br/>
      Ship before:
      13/05/22
   </td>
   <td>
      <span class="storeprice ">
      <span class="storeprice__value">4.000 kr</span>
      </span>
   </td>
   <td>
      <div onclick="window.open('https://restocks.net/en/account/sales/send-label/1882294')" class="download__send__label c-badge c-badge--small pull-right" style="background-color: #df9033">download shipping label</div>
   </td>
   <td>
      <i class="fas fa-pencil-alt listing__edit__icon"></i></span>
   </td>
</tr>
<tr class="clickable" data-price="4000">
   <td>
      <img width="80" src="https://media.restocks.net/products/DC0774-114/air-jordan-1-low-marina-blue-w-1-80.png"/>
   </td>
   <td>
      <input class="productid" type="hidden" value="1882295"/>
      <input class="baseproductid" type="hidden" value="13815"/>
      <input class="sizeid" type="hidden" value="4"/>
      <input class="price" type="hidden" value="4000"/>
      <span>Air Jordan 1 Low Marina Blue (W)</span>
      <br/>
      EU: 39
      <br/>
      ID: 1882295
      <br/>
      Ship before:
      13/05/22
   </td>
   <td>
      <span class="storeprice ">
      <span class="storeprice__value">4.000 kr</span>
      </span>
   </td>
   <td>
      <div onclick="window.open('https://restocks.net/en/account/sales/send-label/1882295')" class="download__send__label c-badge c-badge--small pull-right" style="background-color: #df9033">download shipping label</div>
   </td>
   <td>
      <i class="fas fa-pencil-alt listing__edit__icon"></i></span>
   </td>
</tr>
<tr class="clickable" data-price="4000">
   <td>
      <img width="80" src="https://media.restocks.net/products/DC0774-114/air-jordan-1-low-marina-blue-w-1-80.png"/>
   </td>
   <td>
      <input class="productid" type="hidden" value="1882296"/>
      <input class="baseproductid" type="hidden" value="13815"/>
      <input class="sizeid" type="hidden" value="4"/>
      <input class="price" type="hidden" value="4000"/>
      <span>Air Jordan 1 Low Marina Blue (W)</span>
      <br/>
      EU: 39
      <br/>
      ID: 1882296
      <br/>
      Ship before:
      13/05/22
   </td>
   <td>
      <span class="storeprice ">
      <span class="storeprice__value">4.000 kr</span>
      </span>
   </td>
   <td>
      <div onclick="window.open('https://restocks.net/en/account/sales/send-label/1882296')" class="download__send__label c-badge c-badge--small pull-right" style="background-color: #df9033">download shipping label</div>
   </td>
   <td>
      <i class="fas fa-pencil-alt listing__edit__icon"></i></span>
   </td>
</tr>
<tr class="clickable" data-price="1630">
   <td>
      <img width="80" src="https://media.restocks.net/products/DC0774-114/air-jordan-1-low-marina-blue-w-1-80.png"/>
   </td>
   <td>
      <input class="productid" type="hidden" value="1882297"/>
      <input class="baseproductid" type="hidden" value="13815"/>
      <input class="sizeid" type="hidden" value="5"/>
      <input class="price" type="hidden" value="1630"/>
      <span>Air Jordan 1 Low Marina Blue (W)</span>
      <br/>
      EU: 40
      <br/>
      ID: 1882297
      <br/>
      Ship before:
      13/05/22
   </td>
   <td>
      <span class="storeprice ">
      <span class="storeprice__value">1.630 kr</span>
      </span>
   </td>
   <td>
      <div onclick="window.open('https://restocks.net/en/account/sales/send-label/1882297')" class="download__send__label c-badge c-badge--small pull-right" style="background-color: #df9033">download shipping label</div>
   </td>
   <td>
      <i class="fas fa-pencil-alt listing__edit__icon"></i></span>
   </td>
</tr>
<tr class="clickable" data-price="4000">
   <td>
      <img width="80" src="https://media.restocks.net/products/DC0774-114/air-jordan-1-low-marina-blue-w-1-80.png"/>
   </td>
   <td>
      <input class="productid" type="hidden" value="1882288"/>
      <input class="baseproductid" type="hidden" value="13815"/>
      <input class="sizeid" type="hidden" value="1"/>
      <input class="price" type="hidden" value="4000"/>
      <span>Air Jordan 1 Low Marina Blue (W)</span>
      <br/>
      EU: 36
      <br/>
      ID: 1882288
      <br/>
      Ship before:
      13/05/22
   </td>
   <td>
      <span class="storeprice ">
      <span class="storeprice__value">4.000 kr</span>
      </span>
   </td>
   <td>
      <div onclick="window.open('https://restocks.net/en/account/sales/send-label/1882288')" class="download__send__label c-badge c-badge--small pull-right" style="background-color: #df9033">download shipping label</div>
   </td>
   <td>
      <i class="fas fa-pencil-alt listing__edit__icon"></i></span>
   </td>
</tr>
<tr class="clickable" data-price="4000">
   <td>
      <img width="80" src="https://media.restocks.net/products/DC0774-114/air-jordan-1-low-marina-blue-w-1-80.png"/>
   </td>
   <td>
      <input class="productid" type="hidden" value="1882289"/>
      <input class="baseproductid" type="hidden" value="13815"/>
      <input class="sizeid" type="hidden" value="13"/>
      <input class="price" type="hidden" value="4000"/>
      <span>Air Jordan 1 Low Marina Blue (W)</span>
      <br/>
      EU: 36 ½
      <br/>
      ID: 1882289
      <br/>
      Ship before:
      13/05/22
   </td>
   <td>
      <span class="storeprice ">
      <span class="storeprice__value">4.000 kr</span>
      </span>
   </td>
   <td>
      <div onclick="window.open('https://restocks.net/en/account/sales/send-label/1882289')" class="download__send__label c-badge c-badge--small pull-right" style="background-color: #df9033">download shipping label</div>
   </td>
   <td>
      <i class="fas fa-pencil-alt listing__edit__icon"></i></span>
   </td>
</tr>
<tr class="clickable" data-price="4000">
   <td>
      <img width="80" src="https://media.restocks.net/products/DC0774-114/air-jordan-1-low-marina-blue-w-1-80.png"/>
   </td>
   <td>
      <input class="productid" type="hidden" value="1882290"/>
      <input class="baseproductid" type="hidden" value="13815"/>
      <input class="sizeid" type="hidden" value="44"/>
      <input class="price" type="hidden" value="4000"/>
      <span>Air Jordan 1 Low Marina Blue (W)</span>
      <br/>
      EU: 37 ½
      <br/>
      ID: 1882290
      <br/>
      Ship before:
      13/05/22
   </td>
   <td>
      <span class="storeprice ">
      <span class="storeprice__value">4.000 kr</span>
      </span>
   </td>
   <td>
      <div onclick="window.open('https://restocks.net/en/account/sales/send-label/1882290')" class="download__send__label c-badge c-badge--small pull-right" style="background-color: #df9033">download shipping label</div>
   </td>
   <td>
      <i class="fas fa-pencil-alt listing__edit__icon"></i></span>
   </td>
</tr>
<tr class="clickable" data-price="4000">
   <td>
      <img width="80" src="https://media.restocks.net/products/DC0774-114/air-jordan-1-low-marina-blue-w-1-80.png"/>
   </td>
   <td>
      <input class="productid" type="hidden" value="1882291"/>
      <input class="baseproductid" type="hidden" value="13815"/>
      <input class="sizeid" type="hidden" value="44"/>
      <input class="price" type="hidden" value="4000"/>
      <span>Air Jordan 1 Low Marina Blue (W)</span>
      <br/>
      EU: 37 ½
      <br/>
      ID: 1882291
      <br/>
      Ship before:
      13/05/22
   </td>
   <td>
      <span class="storeprice ">
      <span class="storeprice__value">4.000 kr</span>
      </span>
   </td>
   <td>
      <div onclick="window.open('https://restocks.net/en/account/sales/send-label/1882291')" class="download__send__label c-badge c-badge--small pull-right" style="background-color: #df9033">download shipping label</div>
   </td>
   <td>
      <i class="fas fa-pencil-alt listing__edit__icon"></i></span>
   </td>
</tr>
<tr class="clickable" data-price="4000">
   <td>
      <img width="80" src="https://media.restocks.net/products/DC0774-114/air-jordan-1-low-marina-blue-w-1-80.png"/>
   </td>
   <td>
      <input class="productid" type="hidden" value="1882292"/>
      <input class="baseproductid" type="hidden" value="13815"/>
      <input class="sizeid" type="hidden" value="3"/>
      <input class="price" type="hidden" value="4000"/>
      <span>Air Jordan 1 Low Marina Blue (W)</span>
      <br/>
      EU: 38
      <br/>
      ID: 1882292
      <br/>
      Ship before:
      13/05/22
   </td>
   <td>
      <span class="storeprice ">
      <span class="storeprice__value">4.000 kr</span>
      </span>
   </td>
   <td>
      <div onclick="window.open('https://restocks.net/en/account/sales/send-label/1882292')" class="download__send__label c-badge c-badge--small pull-right" style="background-color: #df9033">download shipping label</div>
   </td>
   <td>
      <i class="fas fa-pencil-alt listing__edit__icon"></i></span>
   </td>
</tr>
"""

doc = LexborHTMLParser(html_test)
print(doc.html)

When running the example, we do not see the <tr class="clickable" anymore and is removed which shouldn't happen. I wonder if you could look at why it does it?

BarryThrill avatar May 06 '22 16:05 BarryThrill

I think that's because there is no <table>. https://github.com/rushter/selectolax/issues/2#issuecomment-355850317

rushter avatar May 07 '22 17:05 rushter

I think that's because there is no <table>. #2 (comment)

Oh I see! Any suggestions on what I can do to be able to scrape the <tr> in that case? The reason is that doing a GET on a webpage that I am using, the HTML is actually the whole output as I showed previously here so I would really like to know if there is any chance at all or maybe I should try with another parser?

BarryThrill avatar May 07 '22 17:05 BarryThrill

If you know when you have such HTML, just wrap it with <table> content </table>

rushter avatar May 07 '22 18:05 rushter