is it possible to output regular files instead of warc?
i only want files, not warc.
can grab-site output regular files (like html and images) for me like wget can? (links must be converted to relative links)
side question: has anyone here actually had good results with getting files back out of warc? this wouldn't be such a big deal if that were possible. i've never seen a util that can exract files from warcs with 100% success rate (and it's usually insanely slow).
i've tried:
-
jwat-tools: seemed the best coded of the bunch but gave me nonsensical filenames like
extracted.001, and idk how to get past that - warcat: slow and fails on many warcs
- warc-extractor: the easiest to use of the bunch (it can hit a bunch of warcs in a single dir), but it's insanely slow, and it also fails on many warcs
- the unarchiver: fails on some warcs
May be of interest to you: https://replayweb.page/ can load WARCs and allow you to browse them. It works best on websites that don't heavily rely on JavaScript.
I'd suggest to use wpull on its own (grab-site is basically wpull but tuned for easier crawling) but the current state of wpull outside of wrappers like this is awful. :/
thanks. i'm familiar with replayweb, but warc is really not for me.
i want the option to be able to do things like:
- host the archive as static content on nginx
- iterate over files to scrape content with certain tools
it's just easier for me to work with files.
tbh, i would just use wget, but i'm having problems with it staying logged in even when using the various cookie options. sigh
i've tried:
-
--load-cookies exported_from_firefox.txt --keep-session-cookies -
--load-cookies exported_from_firefox.txt --keep-session-cookies --save-cookies exported_from_firefox.txt
neither works. any tips?
it's very frustrating because i've had luck using curl with the same cookie file like this:
-
--cookie exported_from_firefox.txt --cookie-jar exported_from_firefox.txt
but curl has no crawling functionality.
Does grab-site work with the cookie issue?
Go into the exported_from_firefox.txt file and check for any #HtttpOnly lines. Those are a common problem with cookies.txt parsers as they aren't part of any official specification. I've had luck occasionally with removing the #HttpOnly from the beginning of the line (don't do that for the dot though, I don't think) but your mileage may vary.
i was so super frustrated with trying to extract files from old WARCs from another project that i didn't even bother trying grab-site without first determining that it could save plain files, haha. that's kind of a prerequisite for me now.
httrack is starting to look like one of the only candidates at this point.
i'll look into your cookie tips and see if i can get wget working first though since i'm already pretty familiar with wget.
at first glance, i think your #HttpOnly tip fixed it for me. i'll stick with wget for now until i need something more complex. many thanks.
@TheTechRobo Seconding this about plain HTML files but for the reason of plugging it into AI document parsers like Khoj or GPT4All, summarizing blogs and making personal assistants out of it is kinda lit.