obelisk icon indicating copy to clipboard operation
obelisk copied to clipboard

Allow the option to archive with a headless browser

Open hellodword opened this issue 2 years ago • 16 comments

Just like archivebox, I think archivebox is very nice, but there're two issues:

  1. slow, not a big deal;
  2. custom automation for special pages (lazy loading for example), this issue is working on it.

And I found a great golang lib rod, how about adding a mode of using headless (or headful, it depends) chromium?

hellodword avatar Feb 24 '22 03:02 hellodword

That is a fantastic idea. Given the original requirement, we implemented similar features in screenshot, but it is still not what you expected.

Perhaps we can take things further and develop a piecemeal approach here.

waybackarchiver avatar Feb 24 '22 11:02 waybackarchiver

The biggest challenge for me is developing or choosing a script and its interpreter.

I have no experience of this before, but rod has a good api. :smile:

I will try to implement this, I really prefer this mode, dealing with all elements is too hard.

hellodword avatar Feb 24 '22 13:02 hellodword

If this happens, I'd like the feature to be optional if it require a lot /complex of external dependencies. I'm trying to maintain shiori as simple as possible and we just get rid of CGO switching the SQLite driver, and since wasm is obsolete and I have to start replacing it with Obelisk, the more seamless the experience it can be, the better :)

fmartingr avatar Feb 24 '22 18:02 fmartingr

If this happens, I'd like the feature to be optional if it require a lot /complex of external dependencies. I'm trying to maintain shiori as simple as possible and we just get rid of CGO switching the SQLite driver, and since wasm is obsolete and I have to start replacing it with Obelisk, the more seamless the experience it can be, the better :)

I'm also a big fan of CGo-free and fewer dependencies, chromedp and rod are based on Chrome DevTools Protocol, without CGo or tons of dependencies. :smiley:

hellodword avatar Feb 25 '22 01:02 hellodword

@fmartingr Please don't be worried about complex external dependencies. Perhaps we can look forward to the given works.

Anyway, pr is wecome.

waybackarchiver avatar Feb 25 '22 11:02 waybackarchiver

Hey, I created a simple demo.

https://github.com/hellodword/web-archiving-with-headless-chromium-demo

env rod=show,bin=/path/to/chrome go run .

It's very simple, but provides custom post.js for hooking and pre.js for scroll/click/...

And use singlefile for saving.

hellodword avatar Feb 25 '22 15:02 hellodword

This implementation is creative and far superior to our current archiving solution. However, it is heavily dependant on SingleFile, and it appears that obelisk is no longer required. If we stick to this plan, a new project might be a better alternative.

@fmartingr What do you think?

waybackarchiver avatar Feb 25 '22 16:02 waybackarchiver

it is heavily dependant on SingleFile

Right, and it's buggy in this demo. 😂

But it's optional, just like the archivebox, archivebox has multi saving modes, singlefile is only one of them.

The thing I want to show is ability of custom script, and, a highly recommend cdp library of golang, I think it's much better than chromedp.

hellodword avatar Feb 25 '22 16:02 hellodword

Appreciate the time and effort. Personally, I prefer the option of trying to inject the script in headless over the one implemented in the screenshot project.

It appears that making it an option would be reasonable, so if SingleFile is added as a browser extension, I would prefer to put the gitmodule in the .github/thridparty directory.

An example of archiving results using screenshot:

image

waybackarchiver avatar Feb 26 '22 04:02 waybackarchiver

This implementation is creative and far superior to our current archiving solution. However, it is heavily dependant on SingleFile, and it appears that obelisk is no longer required. If we stick to this plan, a new project might be a better alternative.

@fmartingr What do you think?

I still haven't started migrating to obelisk just yet... it will be an interesting amount of work to perform and I do not have much time to spare this weeks (and most is invested in replying issues and PRs, yay FOSS! :joy:).

My comment was regarding more the current state of shiori and some comments by our packages in regard of external dependecies or ecosystems. For me the ideal solution is to import obelisk without much trouble and don't lose the ability to cross compile or requiring external software for the archive to work. If you want to add that to obelisk, I'd say it to be optional for users (you can either build it with --tags XXXX or require anything else).

That said, I don't want my comments/vision to halt obelisk's progress! I'm just expressing my fears from an user perspective, not imposing anything. I haven't use any library like this in a while (and not in the Go world, anyway) so I just wanted to make sure I don't create future problems for shiori. You folks are the experts here :)

fmartingr avatar Feb 26 '22 11:02 fmartingr

Right, it was a demo so I directly use singlefile as an embedded dependency, it could or should be act as a plugin.

I think nowadays archiving tool do not necessarily need a chromium, but need ability of scripting extension, one reason is there're too much anti-bot stuff (captcha, WAF, and so on) on the internet.

hellodword avatar Feb 26 '22 12:02 hellodword

there're too much anti-bot stuff (captcha, WAF, and so on) on the internet.

I'm interested in this somehow, so let's do it.

Related to wabarc/wayback#92

waybackarchiver avatar Feb 26 '22 12:02 waybackarchiver

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days

github-actions[bot] avatar Jul 02 '22 03:07 github-actions[bot]

It seems to me that the ideal solution would be the ability to prepare the page for saving not on the server, but on the client. And send it to the server.

Now a lot of sites use dynamic image loading, captcha checking, they load comments only if you scroll the page to them (and comments are sometimes more interesting than the article itself), they don’t load all comments (hide discussion threads until you force them to open). Lots of dynamics. Therefore, it is better to save the page after having previously examined it with your own eyes, that all that is needed is loaded and displayed. There is no universal solution here, so it is preferable to inspect the page yourself.

I just looked into my Pocket archive and it became very sad - many domains are already partitioned, there are no sites. And the pages themselves (at a premium tariff) are far from being completely saved, sometimes they don’t even have text. And now I'm looking for a solution to this problem. I have now started saving pages through SingleFile, but if you tie it to shiori, it will be just the perfect bookmark manager.

At the same time, I would like shiori not to save the text to its database (perhaps only for a quick search), but always retrieve it again from the saved page. Because text content recognition algorithms will always improve, and the content stored in the database may be incorrectly recognized and no longer relevant from the new version of the application.

Katarn avatar Aug 26 '22 10:08 Katarn

@Katarn Thank you for your offer, it's a fantastic idea. As intended, obelisk should support both headless and non-headless mode for archiving webpage.

if you tie it to shiori, it will be just the perfect bookmark manager.

Makes shiori work with obelisk is related to https://github.com/go-shiori/shiori/issues/353

waybackarchiver avatar Aug 26 '22 13:08 waybackarchiver