crawlee-python Add get_run_state method to BasicCrawler

BasicCrawler has several internal state flags and also it's internal component _autoscaled_pool has specific states. Add public method that will analyze internal state flags and state of _autoscaled_pool and return public assessment of the crawler state.

Discussed here: https://github.com/apify/crawlee-python/pull/921#discussion_r1923274335 (Update the test discussed there using the newly added get_run_state method.)

Jan 21 '25 09:01 Pijukatel

The name feels a bit too similar to crawler.useState(), which I can imagine we are also missing in python?

https://github.com/apify/crawlee/blob/02a598c2a501957f04ca3a2362bcee289ef861c0/packages/basic-crawler/src/internals/basic-crawler.ts#L989

Jan 21 '25 09:01 B4nan

The name feels a bit too similar to crawler.useState(), which I can imagine we are also missing in python?

https://github.com/apify/crawlee/blob/02a598c2a501957f04ca3a2362bcee289ef861c0/packages/basic-crawler/src/internals/basic-crawler.ts#L989

I admit that crawler.useState and crawler.getState could make users think that it is related functionality. Sadly useState is somewhat strange name and I would be in favor of making the name more explicit to describe what it is actually doing, something along the line getGlobalState, or getSharedState or getSharedValues or getAutosavedAtate or useSharedValues ...

crawler.getState -> returning for example "Running" seems pretty obvious on the other hand. So I am not much in favor of choosing worse name to not collide with useState

Maybe crawler.getStatus or crawler.getCrawlingStatus would be explicit enough and not colliding with useState?

I see that in JS there is some code related to StatusMessage, but that actually seems to be somehow related to crawler internal state as well.

Jan 21 '25 12:01 Pijukatel

Even if we rename it, this sounds like a bad idea to introduce something completely different with such a similar name. And I would personally not rename it. Its a very common thing to use, unlike this new method you are proposing here. I would rather pick a longer name for that one. getCrawlingStatus sounds fine'ish.

Jan 21 '25 12:01 B4nan

How does get_run_state sound?

Jan 21 '25 14:01 janbuchar

I did some initial draft, but while doing it I started questioning how useful it is. So I would stop here until we get some actual need for similar functionality.

Jan 28 '25 12:01 Pijukatel

Hi, one use case is control from another process the crawler. Not sure if there is a better way, but i am doing it because i do heavy processing on the web content so it does not "fit well" inside the default async request handler.
For example mix an llm and crawlee to navigate, so another process to control crawlee is ok, is running or have failed.

Mar 12 '25 17:03 yohskar

Hi, one use case is control from another process the crawler. Not sure if there is a better way, but i am doing it because i do heavy processing on the web content so it does not "fit well" inside the default async request handler. For example mix an llm and crawlee to navigate, so another process to control crawlee is ok, is running or have failed.

Hi, I am thinking about your usecase and I am not sure that my draft implementation would help there.

If you run some long-running sync blocking code in request_handler then other async functions (in same process) do not have any chance to run until that blocking sync code finishes. So even that get_run_state async function would not run until that request_handleris finished. So how could other process tell that the "crawler" process is now blocked by the request_handler?

It is definitely something to think about, thanks for the feedback and use case example!

Mar 13 '25 16:03 Pijukatel

Hi, I am thinking about your usecase and I am not sure that my draft implementation would help there.

If you run some long-running sync blocking code in request_handler then other async functions (in same process) do not have any chance to run until that blocking sync code finishes. So even that get_run_state async function would not run until that request_handleris finished. So how could other process tell that the "crawler" process is now blocked by the request_handler?

It is definitely something to think about, thanks for the feedback and use case example!

Yes, agree, thats my point, ... so i have to run crawlee in its own process and from another process "query" the crawler its state.

Thanks !

Mar 13 '25 16:03 yohskar

There are certain limitations due to the current design of crawlee - somewhat related is for example this https://github.com/apify/crawlee-python/issues/908

It is good to collect real world usecases so that any future potential redesign can take them into considerations. But to be honest with you, we currently do not have any such redesign in the roadmap for the near future.

Mar 14 '25 08:03 Pijukatel