shiori icon indicating copy to clipboard operation
shiori copied to clipboard

Support Obelisk archiving

Open fmartingr opened this issue 2 years ago • 12 comments

It seems that shiori depends on warc which is currently archived. We need to find a replacement for warc. Maybe obelisk?

Acceptance criteria

  • Add a migration that will define in which archiver type the content is (put warc for already existing rows, but obelisk as default)
  • Add logic to allow multiple archivers to be used, do not remove Warc logic, just refactor it.
  • Allow the /bookmark/:id/archive handler to load multiple archive types (to load old and new)
  • Allow the POST /api/v1/bookmarks/POST /api/v1/bookmarks/cache/POST /api/v1/bookmarks/:id/cache to select which archiver to use (but hardcode/default it to obelisk).
  • Add a documentation page describing the archivers, available options, pro-cons.
  • Determine if different extensions should be used from now on (leave current filename expectations intact)
  • All code logic should be properly tested
  • Swagger documentation should be updated

fmartingr avatar Feb 10 '22 20:02 fmartingr

obelisk is great, I have just tested the latest release on a few examples and it does a good job at preserving the original layout and content.

efrecon avatar Feb 18 '22 15:02 efrecon

I still haven't tested/checked it yet, but the other day I stumbled randomly with https://github.com/gildas-lormeau/SingleFile and it also seemed quite good (and having just a single HTML as output it's quite useful as well).

fmartingr avatar Feb 18 '22 19:02 fmartingr

I'm in the process of packaging shiori for the AUR, and I strongly recommend staying within the Go ecosystem (obelisk can be imported as a go module!) as relying on external tools (e.g. SingleFile) defeats one of shiori's major selling points.

EDIT: Additionally, SingleFile requires a browser binary to be present, which is a Pandora's box in itself.

grawlinson avatar Feb 19 '22 02:02 grawlinson

I'm in the process of packaging shiori for the AUR, and I strongly recommend staying within the Go ecosystem (obelisk can be imported as a go module!) as relying on external tools (e.g. SingleFile) defeats one of shiori's major selling points.

EDIT: Additionally, SingleFile requires a browser binary to be present, which is a Pandora's box in itself.

Just to clarify (because I didn't express myself very well): I like how SingleFile works (the single HTML file output) but I do not plan to replace warc with it. The plan still is to go for Obelisk. :)

Edit: Yeah, when I made my first comment I didn't realise that Obelisk's output is also a Single HTML file :sweat_smile:

fmartingr avatar Feb 19 '22 07:02 fmartingr

Thanks for clarifying that!

A package is now available on the AUR, so if there are any bug reports relating to Arch Linux, tag me and I'll attempt to help out.

grawlinson avatar Feb 19 '22 07:02 grawlinson

EDIT: Additionally, SingleFile requires a browser binary to be present, which is a Pandora's box in itself.

For the record, this statement is false. SingleFile can work with JSDOM. Anyway, good luck!

gildas-lormeau avatar Oct 13 '22 18:10 gildas-lormeau

EDIT: Additionally, SingleFile requires a browser binary to be present, which is a Pandora's box in itself.

For the record, this statement is false. SingleFile can work with JSDOM. Anyway, good luck!

Thanks for the clarification, and even if I love SingleFile (I has helped me a ton while moving out to a new flat!), it would add unnecessary complexity for us. So far, obelisk seems to provide the expected results, and we could use this migration to move that project further in the go world :)

fmartingr avatar Oct 14 '22 10:10 fmartingr

Thanks for the feedback! Personally, I think that in 2022, you have to use a web browser for this kind of tasks. Also, it's really becoming essential when it comes to determining what to really save. This is where SingleFile, generally, stands out. A very large part of the code consists in optimizing the size of the saved page. To do this, a browser is unfortunately required.

gildas-lormeau avatar Oct 14 '22 12:10 gildas-lormeau

What's the status on this? One of the reasons why we choose to run software like Shiori is for archiving purposes, to prevent link-rot and preserve information/knowledge. Having our bookmarks stored in a binary data format as opposed to plain text hurts data preservation. Do you need any help with the transition to Obelisk? Is anyone working on this at the moment?

ivanrg99 avatar Jan 18 '24 12:01 ivanrg99

Personally try to make it ready to use later. currently i work on https://github.com/go-shiori/obelisk/pull/96 and https://github.com/go-shiori/obelisk/pull/98 we have some open issue there too. you can work on any aspect that you like.

Monirzadeh avatar Jan 18 '24 12:01 Monirzadeh

I need to sit down and pave the way for people to start implementing this features. I started a draft under #481 some time ago but didn't sat down again on that since there were other things that had priority like the API. I guess the API migration will get faster over time while we refactor the logic in different components, but that's still the main priority now.

For this to work, we will need to isolate the archiving logic in its own domain and provide backwards compatibility, which will require a migration adding a new column specifying which archive format a bookmark is currently in.

What I'm trying to say is that it can be done and on my radar, but is not trivial. Once 1.6 is released I need to sit down and work on the roadmap again, defining some issues that we need to work on several things and probably making some PRs to preprare for that to happen.

fmartingr avatar Feb 04 '24 09:02 fmartingr

Hey,

I am eagerly awaiting the work on this issue. I would like to migrate my catalog of bookmarks saved in instapaper to shiori and self host this on my local network. However what is holding me back is that the current implementation stores the archived bookmark in a bolt database. I am now wondering whether I should wait for obelisk support in shiori or if it makes sense to migrate right away. I do not want to import all my bookmarks again whenever obelisk is added and am wondering how likely it is there will exist a migration path for previously archived bookmarks to be converted from the bolt db to an html output created by obelisk.

As I understand it is definitely on your radar but it's just something you didn't find time to look at yet. My comment shouldn't pressure you in any way it's more of a +1 for this feature and to be subscribed to the ongoing discussion. Whenever you have new information I am very keen to hear them regarding this issue :)

dehlen avatar Feb 26 '24 15:02 dehlen