crawlee
crawlee copied to clipboard
Add a new `dontCreateNewSessions` option to SessionPool
or useExistingSessions
, readOnly
, etc.
might sound counter-intuitive, but I'm trying to generate sessions elsewhere, then use only the available sessions on another actor. this setting would make getSession()
only pick sessions, and would not try to mutate the sessions array.
this would be better than, as a workaround, setting maxPoolSize
to 0
, but this will try to call createSessionFunction
if there are no usable sessions available and end up removing existing ones that isUsable()
returns false.
this could be solved by #635
Isn't random picking without new session creation already enabled after reaching the maxPoolSize
? Why would you want the new sessions not to be created in case there is less than that number after some were retired? Wouldn't that eventually leave an empty pool?
@cybairfly To be able to unpair session creation from session consumption. With this option, you could have an actor that creates the sessions and periodically persists them into KVS and then consumer actors that use those sessions.
Hey guys (mostly @pocesar at this point). Could you please elaborate on your use-case a bit?
The questions is - how session pool should act when sessions are genereated elsewhere? Should it just be initialized with these 'remote' sessions and once they are expired - just return undefined (since retired sessions will be cleaned up and there will be nothing to pick). Alternatively - it could start generating sessions only once all initial sessions are not usable anymore. This is simple simple to implement and we just need to agree on what to do once initial sessions are retired.
Another option could be the following - e.g. one actor generates sessions in certain specific way, and session pool in another actor would just get these sessions. In this case the problem is that these two actors should be synced between each other (first should know about retired sessions, second should get the new sessions). This is more complicated.
So It would be nice to understand what behavior is expected. Aslo CC @VaclavRut (you also mentioned it would be useful for one of your projects, so - the question is the same - what's expected here).
the idea is that the generated sessions there are "sticky" ones, they shouldn't disappear after closing the page, for example, since they contain authentication tokens that will be reused in a lot of URLs. one example is Twitter, you have one cookie that is responsible for the "logged in" state, and even though it comes with an expiration, I'm setting it as 'new' every time a page is opened. the only way to do this at the moment is through page.setCookies()
.
When opening a "remote" SessionPool, using getSession
does a lot of things beyond of just getting a session, so I need to rely on receiving those "sticky cookies" attached to the session (even without expiration or long-in-the-future expiration dates). the workaround is to access the sessions
property directly.
Ideally, I'd like the pool to be a in a "pick only" mode, aka, a "do not mutate". I have the https://apify.com/pocesar/login-session that expects the sessions from the pool to be always available, using a named KV. in theory, it should only append new sessions there on every run for this actors, and the actors that are the consumers of this "remote session pool" would pick from this pool. If I want to retire a session and/or cookies, I'd need to explicitly do so (like a 403 on response.status()
or a redirect to the login page)
Ideally, I'd like the pool to be a in a "pick only" mode, aka, a "do not mutate".
So you want the KV pool to be left untouched even if some of the sessions used by the consumer become unusable?
Or, since you're saying that you need to "explicitly retire", what you're actually looking for is the ability to turn off the automatic retirement of sessions that's in the crawlers?
Regarding the page.setCookies()
- I guess it's about that inconsistency between the puppeteer and tough-cookie format - should be resolved now.
Also - .getSession()
will now support getting session by id, so it's either the specific session is returned, or nothing.
- There's now a separate
.addSession()
(so you could create a session with id, add some cookies to it, and then add it to the pool). So in theory - It should work even now, If you would initialize session pool from some KV store, will get sessions by ID, set cookies manually, retire sessions manually and so on.
But still - if there will be some option to 'not mutate' the pool and thus use it 'automatically' (without manual picking of sessions, etc) - what should happen to bad sessions, what should happen once there aren't any usable session anymore?