html-table-extractor cell extraction

your package works great but I had to modify it slightly..

self._insert(row_ind, col_ind, row_span, col_span, self._transformer(cell.get_text()))

This is fine if the content is text but if it contains links you want to keep then it's problematic

I have modified it to:

class Extractor(object):
    def __init__(self, table, id_=None, cell_transformer=None):
        ...
        self._cell_transformer = cell_transformer if cell_transformer else lambda x: x.get_text()

    def parse(self):
      ...
      self._insert(row_ind, col_ind, row_span, col_span, self._cell_transformer(cell))

this allows the callee to implement the cell extraction if required.

Also, having to do 3 lines..

ext = Extractor(html)
ext.parse()
print ext.return_list()

would be nicer to just do

result = Extractor().parse(html)

Thanks, this package is small but useful :)

May 09 '17 01:05 hampsterx

That is a great suggestion, for it provides people with more flexibility of dealing with cell data. However, one situation that I am think of is, for people who simply want to transform text into int, they need to define lambda cell: int(cell.get_text()), which is quite burdensome and error-prone for people who are not familiar with beautifulsoup.

Let me think of a way to deal with this.

May 10 '17 13:05 yuanxu-li

well they could just pass in a function.

def to_int(cell):
  return int(cell.get_text())

ext = Extractor(html, cell_transformer=to_int)

Could probably simplify it by detecting the argument?

ext = Extractor(html, cell_transformer=int)
...
if isinstance(cell_transformer, int):
   call_transformer = lambda cell: int(cell.get_text())
else:
  # method
   pass

May 10 '17 23:05 hampsterx

@hampsterx Sorry for forgetting to reply. I have committed the changes in the master and thanks again for the great suggestion!~

Jul 15 '17 00:07 yuanxu-li

I also would benefit from getting the cell object rather than the cell text.

Perhaps, since self._transformer is only used in the parse() method, you could add an optional "cell_transformer" kwarg (to the parse() method) that is a function like your class init "transformer" argument.

In your init code you can set a class flag indicating whether "transformer" was passed in. The logic of parse() could be updated to check if the flag indicates no custom transformer, then use the cell_transformer argument instead (if it was passed in) and pass the cell object. If the flags indicate there is a custom transformer, then use self._transformer and pass the cell.get_text() to it.

All current code that uses your library should still work, but those who want the cell object can get it as well.

Apr 30 '23 16:04 samadhicsec

html-table-extractor html-table-extractor copied to clipboard

cell extraction

html-table-extractor
html-table-extractor copied to clipboard