parsel
parsel copied to clipboard
.extract() is unable to get data properly from sparse tables
I created a manual table to reproduce the bug which I am facing
<!DOCTYPE html>
<html lang="en">
<table class="manual_table">
<thead>
<tr>
<th class="">Mar 2008</th>
<th class="">Mar 2009</th>
<th class="">Mar 2010</th>
</tr>
</thead>
<tbody>
<tr>
<td class="">8,626</td>
<td class="">8,427</td>
<td class="">11,525</td>
</tr>
<tr>
<td class="">16,408</td>
<td class="">19,582</td>
<td class=""></td>
</tr>
<tr>
<td class=""></td>
<td class="">22,574</td>
<td class="">21,755</td>
</tr>
</tbody>
</table>
Now when I try to run the below code on the above html. This is the output I get
>>> rows = response.css(".manual_table tbody tr")
>>> rows[0].css("td::text").extract()
['8,626', '8,427', '11,525']
>>> rows[1].css("td::text").extract()
['16,408', '19,582']
>>> rows[2].css("td::text").extract()
['22,574', '21,755']
As you can notice, It is unable to give proper output for empty data cells. It is ignoring all empty values and that seems a bug.
Similarly if you run below code you will find some weird results. I am confused because it is not supposed to be like this.
>>> len(rows[2].css("td::text").extract())
2
>>> len(rows[2].css("td::text"))
2
>>> len(rows[2].css("td"))
3
Both .getall()
and .extract()
give the same issue.
AFAICT, this is expected. "td::text"
does not exist if there is no text, that's why it's not included in the results and why len(rows[2].css("td")) != len(rows[2].css("td::text"))
.
Were you expecting some other value, None
for instance?
PS: to reproduce in parsel
:
In [1]: html = """<!DOCTYPE html>
...: <html lang="en">
...: <table class="manual_table">
...: <thead>
...: <tr>
...: <th class="">Mar 2008</th>
...: <th class="">Mar 2009</th>
...: <th class="">Mar 2010</th>
...: </tr>
...: </thead>
...: <tbody>
...: <tr>
...: <td class="">8,626</td>
...: <td class="">8,427</td>
...: <td class="">11,525</td>
...: </tr>
...: <tr>
...: <td class="">16,408</td>
...: <td class="">19,582</td>
...: <td class=""></td>
...: </tr>
...: <tr>
...: <td class=""></td>
...: <td class="">22,574</td>
...: <td class="">21,755</td>
...: </tr>
...: </tbody>
...: </table>"""
In [2]: from parsel import Selector
In [3]: s = Selector(text=html)
In [4]: rows = s.css(".manual_table tbody tr")
In [5]: rows[0].css("td::text").extract()
Out[5]: ['8,626', '8,427', '11,525']
In [6]: rows[1].css("td::text").extract()
Out[6]: ['16,408', '19,582']
In [7]: rows[2].css("td::text").extract()
Out[7]: ['22,574', '21,755']
Thank for the clarification.
But I still think that if I am scraping a table, I should be able to get all the td values properly with empty cells included.
Currently I am getting it by putting it in a for loop and using .get()
with default argument.
rows = response.css(".manual_table tbody tr")
dt=[]
for row in rows:
for data in row.css("td"):
dt.append(data.css("::text").get(default=''))
Is there a better way to parse a sparse table other than the looping method?
What I suggest is that similar default
argument should be there for .getall()
and .extract()
as well. So if some tag is available but corresponding "::text"
is not there then we should be able to assign a default value to it, rather than totally ignoring it.
Is anyone looking into this?
Is there a better way to parse a sparse table other than the looping method?
I believe that is the right way to do it with Parsel.
@shubham-MLwiz xpath("normalize-space()").getall()
returns None
from the empty data cells unlike text()
.
>>> s.css(".manual_table tbody tr td").xpath("normalize-space()").getall()
['8,626', '8,427', '11,525', '16,408', '19,582', '', '', '22,574', '21,755']
Full code
from parsel import Selector
html = """<!DOCTYPE html>
<html lang="en">
<table class="manual_table">
<thead>
<tr>
<th class="">Mar 2008</th>
<th class="">Mar 2009</th>
<th class="">Mar 2010</th>
</tr>
</thead>
<tbody>
<tr>
<td class="">8,626</td>
<td class="">8,427</td>
<td class="">11,525</td>
</tr>
<tr>
<td class="">16,408</td>
<td class="">19,582</td>
<td class=""></td>
</tr>
<tr>
<td class=""></td>
<td class="">22,574</td>
<td class="">21,755</td>
</tr>
</tbody>
</table>
</html>"""
s = Selector(text=html)
rows = s.css(".manual_table tbody tr")
dt = []
for row in rows:
for data in row.css("td"):
dt.append(data.css("::text").get(default=''))
print("Loop:", dt)
dt2 = s.css(".manual_table tbody tr td").xpath("normalize-space()").getall()
print("One-liner:", dt2)
Output
Loop: ['8,626', '8,427', '11,525', '16,408', '19,582', '', '', '22,574', '21,755']
One-liner: ['8,626', '8,427', '11,525', '16,408', '19,582', '', '', '22,574', '21,755']
I'm commenting on this old issue because I've faced it today.