node-warc
node-warc copied to clipboard
Capturing two URLs are not being properly read by Webrecorder Player?
I successfully (I think) captured and generated a warc file using https://electronjs.org/docs/api/debugger.
I tried a simple site: www.drupal.org
If I capture the first load, it seems to work nicely, Webrecorder Player shows it perfect.
However if I navigate to "Developers" and then store both the homepage and this page into the warc file, it doesn't seem to work. I see the data on the warcfile though.
I guess something is missing on the Warc file or I am missing something, any ideas?
Other than that, I am super happy of seeing this working. Might even worth contributing this warc generator into this package.
node-warc
welcomes all contributions!
My guess is that you are not writing a single warc info record using writeWebrecorderBookmarksInfoRecord that contains all all the URLs of the pages you wish to be viewable via WR player
To fix that you can wait to append that record till the very end of capturing all the pages or view them using pywb which has no such restriction. Ultimately WR player and WR itself use pywb as the replay system
Hmm, oddly I also tried pywb, but it didn't display anything. Will look. I am basically just capturing everything from a webview tag, and I navigated just to one URL, and then store all packages to a warcfile almost the same as with the remoteChromeGenerator
Have you tried to use puppeteer rather than electron? I have found that using a full browser either Chrome or Chromium (brought in via puppeteer) controllable via puppeteer or chrome-remote-interface produces better results and is easier to use.
Ultimately the best advice I can give without seeing how you are doing the capturing (either src or minimal working example) is to treat each page as a standalone WARC that is either appended to a single WARC or written to its own WARC with concatenation done afterwards.
If you can, here's my source:
https://pastebin.com/61bBUiyg
I'll eventually wrap this better, for now is a PoC.
I took out your RemoteChromeWARCGenerator and RemoteChromeRequestCapturer, change the network interface for Electron's Debugger which gave me access to the same events. So it should be basically the same.
The writing of the warc file is as per your example for chrome on the project's page.
I only tried puppeteer for a quick test, might do some better one next week but I would have expected to work.
😱 I didn't see them or knew they were there! Sorry. Quick look at the code looks like I ended up doing something very similar.
Will try it anyway to see if I get the same Warc.
I am not capturing maybeNetworkMessage
though.
This is the warc file I got: warc.zip
It should have both https://www.drupal.org/ and https://www.drupal.org/developers
I see them on the warc file
Will try yours anyway and see what I do. Thanks, might get back properly next week.
maybeNetworkMessage
is a utility function in order to allow you to not have to add an additional message
listener to the debugger
:smile:
As far as your shared src code I can not infer when you are writing to the WARC and from what I can infer from the discussion here when that is being done is likely the reason for your issues.
I am doing that manually on a context menu, so basically I just wait a reasonable while and trigger it:
const menuItem2 = new MenuItem({
label: 'Warc it yo!',
click: (menuItem, browserWindow, event) => {
const warcGen = new DebuggerWARCGenerator()
console.log(cap)
warcGen.generateWARC(cap, debug, {
warcOpts: {
warcPath: 'myWARC.warc'
},
winfo: {
description: 'I created a warc!',
isPartOf: 'My awesome electron1 collection'
}
})
}
})
Did the electron request capturer and writer not work for you?
I just tried this and got the exact same behavior, maybe I am missing something related to the warc file that is currently beyond me, but would probably soon get to it. This was mainly making sure this is a workable solution, which it definitely is.
If there's something to follow up here you may want to suggest or for me to help debugging or attempting to get to the root of this, I rather have this small thing working.
You are not adding the pages array, and the warc is not being written to in appending mode.
See the electron generator docs for more details.
Correcting those issues should help you get your desired results :relaxed:
See also https://github.com/N0taN3rd/Squidwarc/blob/next/lib/crawler/chrome.js for an example of warc generation