robotstxt Why does allowed_by_robots and one_agent_allowed_by_robots parse robotstxt for each request?

Why does allowed_by_robots and one_agent_allowed_by_robots parse robotstxt for each request?

Open let4be opened this issue 4 years ago • 2 comments

This API and example are really confusing... Why cannot we simply parse once and then call methods to check if URL is allowed?...

May 31 '21 10:05 let4be

Good point. This crate simply ported the Google original library to Rust. Therefore we kept the logic remain: parse -> emit for each request. Indeed, this could be optimized to one parse for multiple requests. (p.s. I don't know why Google never did this.) Of course, contributions are always welcome.

Jun 03 '21 15:06 Folyd

I was looking for a decent robots.txt library written in rust to integrate into my Broad Web Crawler(open source toy project) and so far this one seems like the best bet because of "google" and tests...

But I don't like "parse for each request approach", seems hurtful and unnecessary for performance reasons, from a swift look at source code I think the change would be somewhere here https://github.com/Folyd/robotstxt/blob/d46c028d63f15c52ec5ebd321db7782b7c033e81/src/matcher.rs#L350

If I get some time in the next couple of weeks I might go for it :)

Jun 03 '21 18:06 let4be

robotstxt robotstxt copied to clipboard

Why does allowed_by_robots and one_agent_allowed_by_robots parse robotstxt for each request?

robotstxt
robotstxt copied to clipboard