Mukesh comments

Results 81 comments of


                                            Mukesh

NewsSiteMapParserBolt: do not detect feeds as sitemaps

@sebastian-nagel Thanks for your response. While I understand the reasoning behind reusing the existing logic, I believe it’s cleaner and more intuitive to ensure positive matches are sitemaps and do...

NewsSiteMapParserBolt: do not detect feeds as sitemaps

@sebastian-nagel I created a PR #67 based on your suggestion and my previous comment. Can you please review it?

Use simdjson to read WAT payloads

Do we want to exclusively support pysimdjson, or should we consider implementing adapter classes to support multiple parsers? This would allow users to switch parsers at runtime, similar to what...

Use simdjson to read WAT payloads

@wumpus Thanks for the comment! I wanted to add a key point that `pysimdjson` provides a highly performant API, as detailed here: [pysimdjson Performance](https://pysimdjson.tkte.ch/performance.html) These APIs, such as `parse` offer...

Use simdjson to read WAT payloads

After trying out simdjson, there are a few pitfalls that make me hesitant to use simdjson as a direct alternative to the standard json module. 1. The API of the...

Use simdjson to read WAT payloads

@sebastian-nagel I went forward with using pysimdjson in #49. I didn't use https://github.com/tktech/py_yyjson as the API is quite different to the standard json python module which might not be familiar...

Check for cross submits

Thanks for the review @sebastian-nagel. > What about keeping traces in the status index for any sitemap from which robots.txt it was detected. A similar feature is already available in...

Check for cross submits

@sebastian-nagel I made some commits to check the path using [MetadataTransfer](https://github.com/apache/incubator-stormcrawler/wiki/MetadataTransfer) like you suggested. Can you see if that makes sense?

Check for cross submits

> cross-submits within the pay-level domain are definitely not safe for large hosting domains (blogspot.com, github.io, etc.) and would allow to inject spam links @sebastian-nagel I addressed most of your...

Check for cross submits

> > The [Apache Http Client](https://hc.apache.org/httpcomponents-client-5.4.x/current/apidocs/org/apache/hc/client5/http/psl/PublicSuffixList.html) which we use to get the root domain does not differentiate these private domains. We need to create a separate list from the [Suffix...