Scraping from archives feature

Open catfromplan9 opened this issue 2 years ago • 0 comments

Add feature to scrape from archive site. Using that flag will detect for archive.today (theres a few backup domains ppl use so dont hardcode domain) and if it finds it, edit the html and remove the divs that contain the scraper stuff leaving behind just site contents. I did this manually and im sure it could be automated. And for archive.org you can parse out some html field on the site that contains a link to the un-archive.orgified webpage just as it was originally.

Also, another flag to disable the behaviour of converting links on the page if this archiving archive option is on. Converting links can work by looking for a second https:// or http:// after start of link

You could support other archive sites with this feature but i only know of these two. I did this manually with a site i archived using monolith and I havent seen any tool for parsing archive.org or archive.today sites into original format

May 07 '23 17:05 catfromplan9