Adrien Barbaresi comments

Results 319 comments of


                                            Adrien Barbaresi

Issue with LXML on M1 / Apple arm64 platforms

@naftalibeder It's doesn't appear to be going forward. Did you try building [LXML from source](https://lxml.de/build.html)?

`include_images` changes text extraction

Hi @carschno, I can reproduce the bug. Extraction with images isn't my priority but I'll try to look into it.

`include_images` changes text extraction

No it isn't expected but it looks quite convoluted. The backup algorithm (internal fork of readability-lxml but identical here) triggers the error: - No images, backup algorithm used, everything is...

`include_images` changes text extraction

I could be wrong but I don't see any line in the code which could be affected by that. The vertical bars are between quotation marks so they are part...

[Feature request] Add site configuration

Hi @phongtnit, thanks for the suggestion. It looks like an interesting additional functionality. Would you be interested in drafting a corresponding pull request?

anchor issue

@pieterhartel There was a small issue here which I fixed, the rest can be explained by the orphan text at the bottom. If you write `The quick brown fox jumps...

anchor issue

I get your point, but the last title in your example is followed by orphan text without a tag, so the last tag seen by the parser is ``.

Extract inline structured data from page <body>

Hi @Seirdy, it seems like an interesting idea but I don't quite see what is currently lacking in the software. Could you please provide a concrete example of what you...

Extract inline structured data from page <body>

Thanks for the info, I get your point. I don't know how rare it is but I assume it is uncommon for web pages to convey information in the HTML...

Add include_video parameter (iframe elements are missing)

Hi @fraseInc, I tend indeed to discard iframes by design as embedded content is usually not as relevant text-wise. Do you have examples of elements which should be included?