Adrien Barbaresi comments

Results 412 comments of


                                            Adrien Barbaresi

feat(cli): Add --filename-template and --max-length options

Please also improve test coverage.

feat(cli): Add --filename-template and --max-length options

@AdamQuadmon Are you still working on the PR?

Empty Results When Using Spider Function with Category URL

Hi @felipehertzer, I cannot reproduce the issue, I get results for your example with the latest version of the code (from the Github repository). Did you make other changes?

Empty Results When Using Spider Function with Category URL

I still cannot reproduce it, `probe_alternative_homepage()` works as expected, it returns the HTML code, `https://www.australiandefence.com.au/news/news` and `https://www.australiandefence.com.au`. Besides, the lines `if response.url not in homepage and response.url != "/":` you're...

Empty Results When Using Spider Function with Category URL

Thanks for the details, this is tricky, it may be a bug in urllib3. How do you think we can solve this?

Review HTML element list and ensure complete XML conversion coverage

Hi @eyupcanakman, the idea looks good but as it stands your code isn't actually used during the extraction. So it's hard to tell what would be the benefit here.

Review HTML element list and ensure complete XML conversion coverage

@eyupcanakman Your PR doesn't change anything in the way documents are processed, I will close it if you don't integrate it into the actual code.

Review HTML element list and ensure complete XML conversion coverage

@eyupcanakman It works but it doesn't make much sense to keep both conversions active, or am I getting it wrong? - for elem in tree.iter(CONVERSIONS.keys()): - for elem in tree.iter(*_ALL_TAGS_TO_CONVERT):...

Review HTML element list and ensure complete XML conversion coverage

@eyupcanakman The last change looks good but I still need to think about the PR. There is a small negative impact on the benchmark. You get more coverage if you...

extract function runs indefinitely on large HTML body content

Hi @hitesh1997, there was such a timeout function but the underlying `signal` library prevents use of the extract function in certain contexts, see #202 for details. You can write a...