Adrien Barbaresi comments

Results 412 comments of


                                            Adrien Barbaresi

Codeblock Markdown formatting is missing

The general output of code blocks should now be better, however certain elements are not fully converted to markdown. A few lines could be added to this function if anyone...

Doesn't detect bullet points within tables

I agree, it's uncommon to find bullet points in tables but it would indeed be a useful addition.

Doesn't detect bullet points within tables

If the parent element is a table it's a duplicate issue, otherwise it's a problem with the extraction of nested elements.

Doesn't detect bullet points within tables

I see that the first case now appears to be solved. The second one can be addressed by focusing on recall: `favor_recall=True` with Python, `--recall` on the command-line.

Collected links as metadata field?

Hi @Amaimersion, I believe you can do it without adding a new feature. You can work on cleaner text and on nodes by using XML as output format: 1. `extract(your_document,...

Collected links as metadata field?

The links could indeed be added as a metadata field, at the cost of a bit of rewiring in the code.

xml namespace support in sitemaps

Hi @stdweird, extraction using regexes is brittle but I thought it's a reasonable way to tackle sitemaps which are fairly regular in their form. It'd be great if you could...

colors and error in running gui

Thanks for your feedback, neither the GUI nor the underlying package (Gooey) are actively maintained. As such the possibility exists but I can really recommend to use the command-line interface...

Extraction of Youtube iframes and img elements with links

Hi @sampathmende, thanks for your feedback. - iframes are tricky, they could be missing although I couldn't find an example in the webpage you mention - images are a problem...

Extraction of Youtube iframes and img elements with links

The library is geared towards text extraction, in the page you mention all of the main text is extracted correctly. Keeping elements containing Youtube videos would require additional code.