puppeteer-extra
puppeteer-extra copied to clipboard
[Proposal] Session persistence using cookies, local/session storage and indexedDB stores
Plugin Name: Session Persist
I wanna add a new plugin that persists the session when get/set cookies fail, recently I stumbled on an issue regarding a website that uses redux for session persistence and the only way to keep the session active for later use was to save all the cookies using CDP (get all cookies from all domains, but might work with just a simple cookie get/set), get local and session storage and the indexedDB stores.
For this to be effective those need to be saved on each frameNavigated (url change), it doesn't seem to be a big overhead but if you have any other suggestions I'm listening.
To set the cookies, what i did is navigate to the domain first, clear everything, set our previously saved values, then navigate to the target url.
I'm ready to work on it and follow the coding guidelines
I like the idea of a plugin like this 😄
Looking at Devtools/Application:
So the plugin would be able to save/restore everything under Storage right?
To set the cookies, what i did is navigate to the domain first, clear everything, set our previously saved values, then navigate to the target url.
Without have looked closer at the CDP methods I'm wondering if there's a way to avoid that page navigation.
In terms of user-facing API this could be most flexible?
const sessionJSON = await session.save() // Run that at the end of a scrape
await session.restore(sessionJSON) // Run that at the beginning of a scrape
For reference: https://chromedevtools.github.io/devtools-protocol/tot/Storage/#method-getCookies
or: https://chromedevtools.github.io/devtools-protocol/tot/Network/#method-getAllCookies
I didn't look into Web SQL yet cuz it wasn't needed for my particular use-case but ye it's the idea, saving most of what's under Storage
In terms of user-facing API this could be most flexible?
const sessionJSON = await session.save() // Run that at the end of a scrape await session.restore(sessionJSON) // Run that at the beginning of a scrape
I tried that already, it only works if nothing had changed after the session.save(). If for example there was a navigation or even just a quick url change (react SPA style) right after saving, the redux store under indexedDB would've changed (without being saved) and then the session we saved earlier would be useless
What i did is this
// Save session on each url change
page.on('framenavigated', async () => await session.save().catch((e) => console.error(e.message)));
this is solution works mooost of the time, the rare cases where I've seen it fail is when the bot crashes while the session is in the middle of saving and it didn't have time to finish
Without have looked closer at the CDP methods I'm wondering if there's a way to avoid that page navigation.
Yeah the first page navigation was unavoidable for me, i tried a solution where we load a page from the domain but intercept the request and send a dummy body response instead, then setting the cookies and storage etc.. (reference: https://github.com/puppeteer/puppeteer/issues/3692#issuecomment-453186180) That didn't work, but i was in a hurry and didn't try too much on it so this could be a good solution
or: https://chromedevtools.github.io/devtools-protocol/tot/Network/#method-getAllCookies
yeah this is what i used
// Get all cookies from all domains
const { cookies } = await this.page._client.send('Network.getAllCookies');
I was also planning on creating that kind of plugin!
About IndexedDB persistence
On some targets you need to recreate the database completely and edit the schema. This implies saving the version of the database. Reference: https://stackoverflow.com/questions/46541435/check-and-increment-the-version-number-of-an-indexeddb
Nice catch! If you have time on your hand start on the plugin and I'll help whenever I can
I've uploaded a first draft yesterday. The code is there, it only needs a few modifications and being wired up in an easy-to-use plugin.
I would like the API to look like this:
puppeteer.use(SessionPlugin());
// [...]
await page.goto('https://github.com')
// [...]
const session = await page.dumpSession({ securityOrigin: "https://github.com"}); // indexedDB security origin
// session is an editable and serializable object
// [...]
await page.restoreSession(session); // easy session restoration
What do you think of it?