grab-site icon indicating copy to clipboard operation
grab-site copied to clipboard

Add option to automatically crawl up to any potential directory listing

Open dkl3 opened this issue 9 years ago • 8 comments
trafficstars

Hi, when I run grab-site, I get the feeling that it doesn't check the directories to see if they're unprotected. To have them do this is crucial to creating a complete site archive (like _vti_cnf directories that are unlinked). Not all sites have "index of" directories, though.

In the past I've had to manually check myself from Google/Bing for a site's unprotected directories.

Adding this as either a Wpull or a grab-site argument would mean a lot.

dkl3 avatar Mar 30 '16 16:03 dkl3

wpull probably needs a new hook/API for this. Ideally accept_url could just generate parent URLs and feed them into wpull to be queued.

(Actually, maybe I can navigate to the wpull object I need with some kind of wpull_hook.factory.get(...) call? I haven't explored this in detail.)

ivan avatar Mar 30 '16 16:03 ivan

Is there an actual function for "accept_url"? I don't we have a way to use this yet.

dkl3 avatar Mar 30 '16 21:03 dkl3

Will you add the new hook for checking "index of" directories? I'd love that.

dkl3 avatar Apr 05 '16 19:04 dkl3

@chfoo is it safe to call wpull_hook.factory.get('URLTable').add_many(...) to feed in extra URLs to crawl?

ivan avatar Apr 05 '16 20:04 ivan

Seems to be working, in any case

ivan avatar Apr 05 '16 20:04 ivan

Implementing this in grab-site means duplicating some wpull logic (e.g. knowing not to go up above any of the start URLs; parsing and getting the parent URL; making up inline=0, referrer=url, ... values for add_many), so it might actually be better to implement in wpull instead.

Unfinished code is in https://github.com/ludios/grab-site/commits/find-parent-indexes

ivan avatar Apr 05 '16 21:04 ivan

Has there been any progress with this lately?

dkl3 avatar Apr 20 '16 23:04 dkl3

No immediate plans to do this in grab-site partly for the reasons mentioned above. Maybe someone can try to do this in wpull, or if that doesn't work out, finish the unfinished code above.

ivan avatar Apr 20 '16 23:04 ivan