Adrien Barbaresi

Results 412 comments of Adrien Barbaresi

The general output of code blocks should now be better, however certain elements are not fully converted to markdown. A few lines could be added to this function if anyone...

I agree, it's uncommon to find bullet points in tables but it would indeed be a useful addition.

If the parent element is a table it's a duplicate issue, otherwise it's a problem with the extraction of nested elements.

I see that the first case now appears to be solved. The second one can be addressed by focusing on recall: `favor_recall=True` with Python, `--recall` on the command-line.

Hi @Amaimersion, I believe you can do it without adding a new feature. You can work on cleaner text and on nodes by using XML as output format: 1. `extract(your_document,...

The links could indeed be added as a metadata field, at the cost of a bit of rewiring in the code.

Hi @stdweird, extraction using regexes is brittle but I thought it's a reasonable way to tackle sitemaps which are fairly regular in their form. It'd be great if you could...

Thanks for your feedback, neither the GUI nor the underlying package (Gooey) are actively maintained. As such the possibility exists but I can really recommend to use the command-line interface...

Hi @sampathmende, thanks for your feedback. - iframes are tricky, they could be missing although I couldn't find an example in the webpage you mention - images are a problem...

The library is geared towards text extraction, in the page you mention all of the main text is extracted correctly. Keeping elements containing Youtube videos would require additional code.