Adrien Barbaresi
Adrien Barbaresi
Thanks for the detailed example, this seems to be related to `` in tables. Nested tables structures are difficult to process. I'll leave the issue open for now and see...
@naktinis Thanks for the PR, the code looks good but tests fails in `test_table_processing()` (part of `unit_tests.py`). Could you please have a look, change the code (or the test if...
@naktinis Thanks, it's much better now! I have a few questions: - You're introducing a `span` attribute because otherwise it would be difficult to keep track of the total number...
Then I'd be in favor of removing the attribute from the output: - somewhere after `for subelement in table_elem.iterdescendants():` - `del newrow.attrib["span"]` (if it's present) Could you please implement the...
Thanks!
@Aprilistic Thank you for describing the bug and suggesting a solution. I just decided to use the commit you mentioned in a pull request. Do you have anything to add?
This enhancement is not implemented yet. If you set the `extensive_search` option to `False` you'll restrict the search to less error-prone patterns.
Hi @RadhiFadlillah Thanks for your feedback, I'll have a look.
@unsleepy22 Thanks for the suggestion, the heuristics for titles could be improved. The problem with your approach is when the head title also entails the name of the website or...
It's a bug indeed but as a side note it's possible to use Trafilatura's duplicate filter with a low threshold to prevent it.