Shaarli Save a copy of the linked content in case of link rot

Save a copy of the linked content in case of link rot

Open ihavenogithub opened this issue 9 years ago • 9 comments

From years of using del.icio.us then yahoo's delicious then self-hosted scuttle and semantic scuttle, I really miss the ability to save a local copy to still have a copy of the content when link rot eventually occurs.

Oct 19 '14 16:10 ihavenogithub

@ihavenogithub You're right, this has been proposed a long time ago (https://github.com/sebsauvage/Shaarli/issues/58). We have been triaging bugs and fixing some issues at https://github.com/shaarli/Shaarli/ , and concluded that Shaarli should not include complex features like web scraping (or keep them as plugins, but we don't have a plugin system yet).

I'm working on a python script that:

Downloads HTML exports from Shaarli
Saves the linked pages, with ability to filter tags, download audio/video media and more.

The script can be used fro a client machine (laptop, whatever) or can be placed on the server itself and run periodically (if the host supports python and cronjobs). At the moment, the script works perfectly for me, but needs some cleanup. Would this solve your problem?

Oct 20 '14 12:10 nodiscc

Probably for a while but I'd rather have this process be done automatically. Would you mind giving me a link to your script? I'd like to give it a try.

Oct 20 '14 16:10 ihavenogithub

It will be automatic if you add it as a scheduled task (cron job). I'm now formatting the script so that it's usable/readable for everyone and will keep this updated.

Oct 20 '14 17:10 nodiscc

hey @ihavenogithub I've started rewriting my script from scratch (it was too damn ugly), check https://github.com/nodiscc/shaarchiver

For now it only downloads html exports, and downloads audio/video media (with tag filtering), not pages. Rather limited but it's a clean start and more is planned (see the issues). Contributions welcome

Nov 05 '14 14:11 nodiscc

Hi, Your archiver script may use Wallabag scraper, you'll be able to scrape many websites without having to re-code the wheel Wallabag does what you need but needs shaarli integration and automation I think

May 26 '15 12:05 Epy

@Epy

This tool is written in Python
It's a command line tool
This tool is for local offline archiving, not on a remote server
This tool leverages youtube-dl for media downloads (supports more than 500 websites)
Once the page download features are in, it will download exact copies of pages, not "readable" versions (except ads removed).

So I don't think Wallabag could be useful for me.

However I agree that wallabag should be able to automatically archive pages from RSS feeds. Did you report a bug for this on the Wallabag issue tracker?

May 26 '15 15:05 nodiscc

The wallabag remote server could be at your home ^_^ but I was only suggesting to use some components as a library, if possible. Maybe re-use patterns only: https://github.com/wallabag/wallabag/tree/master/inc/3rdparty/site_config

It would be a great thing to have a standard library to download webpages, available for all open-source and free softwares

I understand that can't be done if you're developping in Python and wallabag is PHP

Thank you for your tool BTW :]

May 27 '15 08:05 Epy

Thanks for the feedback @Epy I guess the script could also be run by your home server automatically if set up with a cron job

Maybe re-use patterns only: https://github.com/wallabag/wallabag/tree/master/inc/3rdparty/site_config

The patterns are very interesting, as they contain what to strip/extract to obtain "readable" versions of pages (example for bbc.co.uk). This feature could be added in the long run (in another script, or as a command-line option).

For now I want to concentrate on keeping exact copies of the pages, then removing just ads (don't know where I saved it but I have a draft for this, basically download ad blocking lists and fetch pages through a proxy that removes them).

I'm rewriting it (again...) as the script was getting overcomplicated. Next version should be able to download, video, audio (already implemented), webpages, and generate a markdown and html index of the archive. Next next version should make the html index filterable/searchable (text/tags). Next nex next version should support ad blocking.

Feel free to report a feature request so that I won't forget your ideas.

I also think wallabag should really support auto-downloading articles from RSS feeds...

May 28 '15 21:05 nodiscc

Okay, I just made the feature request in your github repo :)

FreshRSS is a RSS feed reader and can export to wallabag (as it is able to do with Shaarli to export links only) http://freshrss.org/ With 3 Self hostable open source (and KISS) apps connected we should be able to have a nice system, no ? :)

May 29 '15 10:05 Epy

Shaarli Shaarli copied to clipboard

Save a copy of the linked content in case of link rot

Shaarli
Shaarli copied to clipboard