Mukesh
Mukesh
@sebastian-nagel Thanks for your response. While I understand the reasoning behind reusing the existing logic, I believe it’s cleaner and more intuitive to ensure positive matches are sitemaps and do...
@sebastian-nagel I created a PR #67 based on your suggestion and my previous comment. Can you please review it?
Do we want to exclusively support pysimdjson, or should we consider implementing adapter classes to support multiple parsers? This would allow users to switch parsers at runtime, similar to what...
@wumpus Thanks for the comment! I wanted to add a key point that `pysimdjson` provides a highly performant API, as detailed here: [pysimdjson Performance](https://pysimdjson.tkte.ch/performance.html) These APIs, such as `parse` offer...
After trying out simdjson, there are a few pitfalls that make me hesitant to use simdjson as a direct alternative to the standard json module. 1. The API of the...
@sebastian-nagel I went forward with using pysimdjson in #49. I didn't use https://github.com/tktech/py_yyjson as the API is quite different to the standard json python module which might not be familiar...
Thanks for the review @sebastian-nagel. > What about keeping traces in the status index for any sitemap from which robots.txt it was detected. A similar feature is already available in...
@sebastian-nagel I made some commits to check the path using [MetadataTransfer](https://github.com/apache/incubator-stormcrawler/wiki/MetadataTransfer) like you suggested. Can you see if that makes sense?
> cross-submits within the pay-level domain are definitely not safe for large hosting domains (blogspot.com, github.io, etc.) and would allow to inject spam links @sebastian-nagel I addressed most of your...
> > The [Apache Http Client](https://hc.apache.org/httpcomponents-client-5.4.x/current/apidocs/org/apache/hc/client5/http/psl/PublicSuffixList.html) which we use to get the root domain does not differentiate these private domains. We need to create a separate list from the [Suffix...