Adrien Barbaresi comments

Results 412 comments of


                                            Adrien Barbaresi

Table markdown syntax incorrect in some cases

Thanks for the detailed example, this seems to be related to `` in tables. Nested tables structures are difficult to process. I'll leave the issue open for now and see...

Markdown table fixes

@naktinis Thanks for the PR, the code looks good but tests fails in `test_table_processing()` (part of `unit_tests.py`). Could you please have a look, change the code (or the test if...

Markdown table fixes

@naktinis Thanks, it's much better now! I have a few questions: - You're introducing a `span` attribute because otherwise it would be difficult to keep track of the total number...

Then I'd be in favor of removing the attribute from the output: - somewhere after `for subelement in table_elem.iterdescendants():` - `del newrow.attrib["span"]` (if it's present) Could you please implement the...

Markdown table fixes

Thanks!

fix error with float32 error with json.dumps in server mode

@Aprilistic Thank you for describing the bug and suggesting a solution. I just decided to use the commit you mentioned in a pull request. Do you have anything to add?

Keep track of where the date has been found

This enhancement is not implemented yet. If you set the `extensive_search` option to `False` you'll restrict the search to less error-prone patterns.

Possibly wrong Mediacloud test data?

Hi @RadhiFadlillah Thanks for your feedback, I'll have a look.

Question regarding title extraction

@unsleepy22 Thanks for the suggestion, the heuristics for titles could be improved. The problem with your approach is when the head title also entails the name of the website or...

Duplicated lines when nested in <article> and <main>, with <br> in front

It's a bug indeed but as a side note it's possible to use Trafilatura's duplicate filter with a low threshold to prevent it.