parsel icon indicating copy to clipboard operation
parsel copied to clipboard

.extract() is unable to get data properly from sparse tables

Open shubham-MLwiz opened this issue 4 years ago • 5 comments

I created a manual table to reproduce the bug which I am facing

<!DOCTYPE html>
<html lang="en">
<table class="manual_table">
   <thead>
      <tr>
        <th class="">Mar 2008</th>
        <th class="">Mar 2009</th>
        <th class="">Mar 2010</th>
      </tr>
   </thead>
   <tbody>
      <tr>
        <td class="">8,626</td>
        <td class="">8,427</td>
        <td class="">11,525</td>
      </tr>
      <tr>
        <td class="">16,408</td>
        <td class="">19,582</td>
        <td class=""></td>
      </tr>
      <tr>        
        <td class=""></td>
        <td class="">22,574</td>
        <td class="">21,755</td> 
      </tr>
   </tbody>
</table>

Now when I try to run the below code on the above html. This is the output I get

>>> rows = response.css(".manual_table tbody tr")
>>> rows[0].css("td::text").extract()
['8,626', '8,427', '11,525']
>>> rows[1].css("td::text").extract()
['16,408', '19,582']
>>> rows[2].css("td::text").extract()
['22,574', '21,755']

As you can notice, It is unable to give proper output for empty data cells. It is ignoring all empty values and that seems a bug.

Similarly if you run below code you will find some weird results. I am confused because it is not supposed to be like this.

>>> len(rows[2].css("td::text").extract())
2
>>> len(rows[2].css("td::text"))
2
>>> len(rows[2].css("td"))
3

Both .getall() and .extract() give the same issue.

shubham-MLwiz avatar May 28 '20 13:05 shubham-MLwiz

AFAICT, this is expected. "td::text" does not exist if there is no text, that's why it's not included in the results and why len(rows[2].css("td")) != len(rows[2].css("td::text")).

Were you expecting some other value, None for instance?

PS: to reproduce in parsel:

In [1]: html = """<!DOCTYPE html> 
   ...: <html lang="en"> 
   ...: <table class="manual_table"> 
   ...:    <thead> 
   ...:       <tr> 
   ...:         <th class="">Mar 2008</th> 
   ...:         <th class="">Mar 2009</th> 
   ...:         <th class="">Mar 2010</th> 
   ...:       </tr> 
   ...:    </thead> 
   ...:    <tbody> 
   ...:       <tr> 
   ...:         <td class="">8,626</td> 
   ...:         <td class="">8,427</td> 
   ...:         <td class="">11,525</td> 
   ...:       </tr> 
   ...:       <tr> 
   ...:         <td class="">16,408</td> 
   ...:         <td class="">19,582</td> 
   ...:         <td class=""></td> 
   ...:       </tr> 
   ...:       <tr>         
   ...:         <td class=""></td> 
   ...:         <td class="">22,574</td> 
   ...:         <td class="">21,755</td>  
   ...:       </tr> 
   ...:    </tbody> 
   ...: </table>"""

In [2]: from parsel import Selector

In [3]: s = Selector(text=html)

In [4]: rows = s.css(".manual_table tbody tr")

In [5]: rows[0].css("td::text").extract()
Out[5]: ['8,626', '8,427', '11,525']

In [6]: rows[1].css("td::text").extract()
Out[6]: ['16,408', '19,582']

In [7]: rows[2].css("td::text").extract()
Out[7]: ['22,574', '21,755']

elacuesta avatar May 28 '20 14:05 elacuesta

Thank for the clarification. But I still think that if I am scraping a table, I should be able to get all the td values properly with empty cells included. Currently I am getting it by putting it in a for loop and using .get() with default argument.

rows = response.css(".manual_table tbody tr")
dt=[]
for row in rows:
    for data in row.css("td"):
         dt.append(data.css("::text").get(default=''))

Is there a better way to parse a sparse table other than the looping method?

What I suggest is that similar default argument should be there for .getall() and .extract() as well. So if some tag is available but corresponding "::text" is not there then we should be able to assign a default value to it, rather than totally ignoring it.

shubham-MLwiz avatar May 28 '20 17:05 shubham-MLwiz

Is anyone looking into this?

shubham-MLwiz avatar May 31 '20 10:05 shubham-MLwiz

Is there a better way to parse a sparse table other than the looping method?

I believe that is the right way to do it with Parsel.

Gallaecio avatar Jun 01 '20 06:06 Gallaecio

@shubham-MLwiz xpath("normalize-space()").getall() returns None from the empty data cells unlike text().

>>> s.css(".manual_table tbody tr td").xpath("normalize-space()").getall()
['8,626', '8,427', '11,525', '16,408', '19,582', '', '', '22,574', '21,755']

Full code

from parsel import Selector

html = """<!DOCTYPE html> 
<html lang="en"> 
<table class="manual_table"> 
  <thead> 
    <tr> 
      <th class="">Mar 2008</th> 
      <th class="">Mar 2009</th> 
      <th class="">Mar 2010</th> 
    </tr> 
  </thead> 
  <tbody> 
    <tr> 
      <td class="">8,626</td> 
      <td class="">8,427</td> 
      <td class="">11,525</td> 
    </tr> 
    <tr> 
      <td class="">16,408</td> 
      <td class="">19,582</td> 
      <td class=""></td> 
    </tr> 
    <tr>         
      <td class=""></td> 
      <td class="">22,574</td> 
      <td class="">21,755</td>  
    </tr> 
  </tbody> 
</table>
</html>"""

s = Selector(text=html)

rows = s.css(".manual_table tbody tr")

dt = []
for row in rows:
    for data in row.css("td"):
        dt.append(data.css("::text").get(default=''))

print("Loop:", dt)

dt2 = s.css(".manual_table tbody tr td").xpath("normalize-space()").getall()

print("One-liner:", dt2)

Output

Loop: ['8,626', '8,427', '11,525', '16,408', '19,582', '', '', '22,574', '21,755']
One-liner: ['8,626', '8,427', '11,525', '16,408', '19,582', '', '', '22,574', '21,755']

I'm commenting on this old issue because I've faced it today.

ilyazub avatar Feb 16 '22 20:02 ilyazub