CloudProxy
CloudProxy copied to clipboard
Session related suggestions
Hi mate, I see you converted to TS. Looks really great now and works nicely for me, especially using custom session profile, which seems to help avoiding trigger of Captcha after a couple of retries of the request.get. I have a few suggestions:
-
when using session param without custom User-Agent, a random browser profile folder is created instead of the custom named folder. Is there a reason for this? Why not create the named folder always when the session param is used? I believe that way the session and cookies would be reused even after a server or docker app restart.
-
I also noticed that named profile folders get deleted after server restart, I guess due to the non persistent nature of the Docker image data storage. Maybe we could add a conf param for a custom temp folder so that a persistent volume could be mounted to it?
-
lastly, when using request.get with a session param and the session does not exist it returns a notice to create a session first. Why not call sesssion.create like you call it if no session param is provided?
Anyway, thanks for the great app!
Jx-
-
Well, the idea of the sessions is to speed up the requests after passing the challenge for the first time. When you want to reset your cookies you can just create a new session. I might be able to add something like
"destroy": false"to thesessions.destroycommand, but to be effective, we'd have to place the profiles somewhere other than a temp directory. Maybe add an ENV variablePATH_TO_PROFILES=<path-to-profiles> || os.tmpdir()?NOTE: The original author edited the profile to change the UA, but I'm not sure if that approach is working properly right now. I need to do some testing.
-
If we implement a
PATH_TO_PROFILESENV variable this would be really easy just by editing the Dockerfile. We could add instructions in the README.md. -
I added this only to make it "stupid-proof." If people accidentally reference the wrong session then they get a message telling them to create one. This keeps people from misusing the
sessionsparameter and thinking something is broken when they have to keep waiting for 5 seconds to solving a captcha by accident. Plus, it allows you to get the heavy lifting of starting up the browser before you start sending requests so therequest.getspeeds will always be top-notch. However, if you think it would better to revert to the original functionality I'll consider it.
Any thoughts?
Hey again. Thanks for going through my suggestions.
- Yes, I did it that way in my dev version:
dockerfile:
ENV TMPDIR=/home/node/flaresolverr/tmpsessions.js:const tmpDir = process.env.TMPDIR || os.tmpdir()
This works fine and profiles are created in the new folder. I am just an absolute newbie to docker, so I still have to sort out how to mount a persistent volume. Also, if for ex. running on some free hoster space, most of the time volume mounting is not allowed, so not usable there.
-
Yep
-
For the named session creation in request.get I tried this:
const forceCreate = sessionId === params.session && !sessions.get(sessionId)
const browser = oneTimeSession || forceCreate ? await sessions.create(sessionId, params) : sessions.get(sessionId)
which works but caused some timeouts. Maybe its better to call sessions.create explicitly before request.get. It worked though in your original version, so I have to test that a bit more. Maybe a forceCreate param would be better?
Cheers,
Jx-
One more thing: I would like to add a headers parameter, so that we can pass these to https://github.com/puppeteer/puppeteer/blob/v3.3.0/docs/api.md#pagesetextrahttpheadersheaders . Some pages need correct headers to return proper responses.
In regards to explicit session creation, I think I'm going to keep it that way since it wouldn't work any other way with the enhancements I've added via #10.
Next on my list of things to do is a variable userDataDir parent directory.
Yeah I think this is great like it is now.
In regards to saving sessions, I've been playing around with setting a custom parent to the userDataDir (0f528b4436a918ba3ad510d6db5cab948ed3d1b0). However, it seems that creating a session and then loading the session again doesn't preserve cookies like I'd have expected...
I used this to test it:
const P = require('puppeteer')
const showCookies = async page => {
const pre = await page.$('pre')
const cookies = await pre.evaluate(e => e.innerText)
console.log(JSON.parse(cookies))
}
const setCookies = async path => {
const browser = await P.launch({
headless: true,
userDataDir: path
})
const page = await browser.newPage()
await page.goto('https://httpbin.org/cookies/set/test123/123')
await showCookies(page)
await browser.close()
}
const checkCookies = async path => {
const browser = await P.launch({
headless: true,
userDataDir: path
})
const page = await browser.newPage()
await page.goto('https://httpbin.org/cookies')
await showCookies(page)
await browser.close()
}
(async () => {
const path = 'sessions/test'
await setCookies(path)
await checkCookies(path)
})().catch(console.error)
While I'd expect the output to be:
{ cookies: { test123: '123' } }
{ cookies: { test123: '123' } }
The output is actually:
{ cookies: { test123: '123' } }
{ cookies: {} }
I've done a little Googling and it doesn't seem like you can reload profiles? Maybe I'm doing something wrong? Any ideas?
Hmmm, I haven't done any concrete tests, I only notice that the CF cookies stored in the session/profile folder SQLite DB are being used, because wile the first solving can take several retries and quite a bit of time, subsequent calls to the site opens very fast, obviously without needing to solve again. So as long as the container is running all seems to be OK with the cookies but since many container hosts use volatile storage, once they go idle, the profile folder is lost. My workaround now is to store the cookies locally and send them back to the proxy if not expired yet. This seems to work more or less, though. I found one issue with the documentation, that says mandatory fields for cookies are name and value but I had to send domain, too.
I can play a bit with your code. Wasn't aware of that httpbin until now :)
PS: what I see already is that using above code, for reasons I don't understand, the Default profile folder is not created at all. This contains the cookies DB. But maybe that's a Windows issue?
PPS: using Windows temp dir as parent folder works then with Default and cookies being created. Problem is that the call to httpbin doesn't seem to store the cookies correctly, the DB file stays empty. If you call another page, such as github, cookies are stored correctly. Interestingly, in Fiddler you see the the requests from Puppeteer are working and cookies returned as expected:
GET https://httpbin.org/cookies/set/test123/123 HTTP/1.1
HTTP/1.1 302 FOUND
Date: Thu, 23 Jul 2020 11:38:40 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 223
Connection: keep-alive
Server: gunicorn/19.9.0
Location: /cookies
Set-Cookie: test123=123; Path=/
which redirects to
GET https://httpbin.org/cookies HTTP/1.1
HTTP/1.1 200 OK
Date: Thu, 23 Jul 2020 11:38:41 GMT
Content-Type: application/json
Content-Length: 44
Connection: keep-alive
Server: gunicorn/19.9.0
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
{
"cookies": {
"test123": "123"
}
}
PPPS: I fiddled the proxy while calling a page after the challenge is solved and it definitely sends back the cookies stored in the session.
Cheers,
Jx-
Very interesting. Sorry it took forever to get back to you. Are you in the Discord server? https://discord.gg/pVT8rNy
I'll continue to look into this but I've been busier lately.