html-table-extractor
html-table-extractor copied to clipboard
cell extraction
your package works great but I had to modify it slightly..
self._insert(row_ind, col_ind, row_span, col_span, self._transformer(cell.get_text()))
This is fine if the content is text but if it contains links you want to keep then it's problematic
I have modified it to:
class Extractor(object):
def __init__(self, table, id_=None, cell_transformer=None):
...
self._cell_transformer = cell_transformer if cell_transformer else lambda x: x.get_text()
def parse(self):
...
self._insert(row_ind, col_ind, row_span, col_span, self._cell_transformer(cell))
this allows the callee to implement the cell extraction if required.
Also, having to do 3 lines..
ext = Extractor(html)
ext.parse()
print ext.return_list()
would be nicer to just do
result = Extractor().parse(html)
Thanks, this package is small but useful :)
That is a great suggestion, for it provides people with more flexibility of dealing with cell data.
However, one situation that I am think of is, for people who simply want to transform text into int, they need to define lambda cell: int(cell.get_text()), which is quite burdensome and error-prone for people who are not familiar with beautifulsoup.
Let me think of a way to deal with this.
well they could just pass in a function.
def to_int(cell):
return int(cell.get_text())
ext = Extractor(html, cell_transformer=to_int)
Could probably simplify it by detecting the argument?
ext = Extractor(html, cell_transformer=int)
...
if isinstance(cell_transformer, int):
call_transformer = lambda cell: int(cell.get_text())
else:
# method
pass
@hampsterx Sorry for forgetting to reply. I have committed the changes in the master and thanks again for the great suggestion!~
I also would benefit from getting the cell object rather than the cell text.
Perhaps, since self._transformer is only used in the parse() method, you could add an optional "cell_transformer" kwarg (to the parse() method) that is a function like your class init "transformer" argument.
In your init code you can set a class flag indicating whether "transformer" was passed in. The logic of parse() could be updated to check if the flag indicates no custom transformer, then use the cell_transformer argument instead (if it was passed in) and pass the cell object. If the flags indicate there is a custom transformer, then use self._transformer and pass the cell.get_text() to it.
All current code that uses your library should still work, but those who want the cell object can get it as well.