puppeteer-extra icon indicating copy to clipboard operation
puppeteer-extra copied to clipboard

[Proposal] Session persistence using cookies, local/session storage and indexedDB stores

Open Shisuki opened this issue 3 years ago • 6 comments

Plugin Name: Session Persist

I wanna add a new plugin that persists the session when get/set cookies fail, recently I stumbled on an issue regarding a website that uses redux for session persistence and the only way to keep the session active for later use was to save all the cookies using CDP (get all cookies from all domains, but might work with just a simple cookie get/set), get local and session storage and the indexedDB stores.

For this to be effective those need to be saved on each frameNavigated (url change), it doesn't seem to be a big overhead but if you have any other suggestions I'm listening.

To set the cookies, what i did is navigate to the domain first, clear everything, set our previously saved values, then navigate to the target url.

I'm ready to work on it and follow the coding guidelines

Shisuki avatar Mar 22 '21 13:03 Shisuki

I like the idea of a plugin like this 😄

Looking at Devtools/Application: image

So the plugin would be able to save/restore everything under Storage right?

To set the cookies, what i did is navigate to the domain first, clear everything, set our previously saved values, then navigate to the target url.

Without have looked closer at the CDP methods I'm wondering if there's a way to avoid that page navigation.

In terms of user-facing API this could be most flexible?

const sessionJSON = await session.save() // Run that at the end of a scrape
await session.restore(sessionJSON) // Run that at the beginning of a scrape

berstend avatar Mar 24 '21 10:03 berstend

For reference: https://chromedevtools.github.io/devtools-protocol/tot/Storage/#method-getCookies

image

or: https://chromedevtools.github.io/devtools-protocol/tot/Network/#method-getAllCookies

berstend avatar Mar 24 '21 10:03 berstend

I didn't look into Web SQL yet cuz it wasn't needed for my particular use-case but ye it's the idea, saving most of what's under Storage


In terms of user-facing API this could be most flexible?

const sessionJSON = await session.save() // Run that at the end of a scrape
await session.restore(sessionJSON) // Run that at the beginning of a scrape

I tried that already, it only works if nothing had changed after the session.save(). If for example there was a navigation or even just a quick url change (react SPA style) right after saving, the redux store under indexedDB would've changed (without being saved) and then the session we saved earlier would be useless

What i did is this

// Save session on each url change
page.on('framenavigated',  async () => await session.save().catch((e) => console.error(e.message)));

this is solution works mooost of the time, the rare cases where I've seen it fail is when the bot crashes while the session is in the middle of saving and it didn't have time to finish


Without have looked closer at the CDP methods I'm wondering if there's a way to avoid that page navigation.

Yeah the first page navigation was unavoidable for me, i tried a solution where we load a page from the domain but intercept the request and send a dummy body response instead, then setting the cookies and storage etc.. (reference: https://github.com/puppeteer/puppeteer/issues/3692#issuecomment-453186180) That didn't work, but i was in a hurry and didn't try too much on it so this could be a good solution


or: https://chromedevtools.github.io/devtools-protocol/tot/Network/#method-getAllCookies

yeah this is what i used

// Get all cookies from all domains
const { cookies } = await this.page._client.send('Network.getAllCookies');

Shisuki avatar Mar 24 '21 12:03 Shisuki

I was also planning on creating that kind of plugin!

About IndexedDB persistence

On some targets you need to recreate the database completely and edit the schema. This implies saving the version of the database. Reference: https://stackoverflow.com/questions/46541435/check-and-increment-the-version-number-of-an-indexeddb

clouedoc avatar Apr 07 '21 18:04 clouedoc

Nice catch! If you have time on your hand start on the plugin and I'll help whenever I can

Shisuki avatar Apr 12 '21 09:04 Shisuki

I've uploaded a first draft yesterday. The code is there, it only needs a few modifications and being wired up in an easy-to-use plugin.

I would like the API to look like this:

puppeteer.use(SessionPlugin()); 
// [...]
await page.goto('https://github.com')
// [...]
const session = await page.dumpSession({ securityOrigin: "https://github.com"}); // indexedDB security origin 
// session is an editable and serializable object

// [...]
await page.restoreSession(session); // easy session restoration

What do you think of it?

Here is a link to the TODO-list

clouedoc avatar Apr 12 '21 12:04 clouedoc