apify-client-js icon indicating copy to clipboard operation
apify-client-js copied to clipboard

`.actor().call()` method to set the correct timeout, show the progress in status message, and stream logs

Open mtrunkat opened this issue 11 months ago • 1 comments

I was trying out https://apify.com/jakub.kopecky/llmstxt-generator Actor, the experience was not great because of the following:

Timeout

The Actor above was started with a timeout of 18,000 seconds, but the WCC is triggered with a default timeout of 360,000. So, it may happen that the original Actor timeouts, but the WCC will continue running. IMHO, in this case, we should set the timeout for the remaining time for the original Actor.

There might be cases when this is not appropriate, so this behavior could be opt-in or out.

Logs

It's called WCC underneath, which may take a long time to finish in the case of a large website. This means that the Actor seem to get stuck on the following log:

2025-01-23T13:56:06.535Z ACTOR: Pulling Docker image of build OQWIcf5rmeLt4icyd from repository.
2025-01-23T13:56:08.308Z ACTOR: Creating Docker container.
2025-01-23T13:56:08.850Z ACTOR: Starting Docker container.
2025-01-23T13:56:11.052Z [apify] INFO  Initializing Actor...
2025-01-23T13:56:11.054Z [apify] INFO  System info ({"apify_sdk_version": "2.1.0", "apify_client_version": "1.8.1", "crawlee_version": "0.4.5", "python_version": "3.12.8", "os": "linux"})
2025-01-23T13:56:11.119Z [apify] INFO  Starting the "apify/website-content-crawler" actor for URL: https://docs.apify.com/

So, I am thinking about improving the .actor().call() method in SDK/client the way that it enables developers to optionally stream the log from the Actor called via a .call() to provide progress/context info.

Status message

Finally, it displays a dummy status message that does not communicate progress. The call could automatically update the status message, for example, here, with:

Running Website Content Crawler: processed 235/7876

mtrunkat avatar Jan 28 '25 13:01 mtrunkat

You can see @MQ37 improving this on the Actor side: https://github.com/apify/actor-llmstxt-generator/pull/10

mtrunkat avatar Jan 28 '25 13:01 mtrunkat

Regarding the timeouts. Currently there is no public way of getting the runtime of the actor run. Specifically runtime of last actor run segment, which in the context of this issue is the actor run. (Resurrected actor run consists of several run segments and from the point of view of API, the actor run is sum of all it's run segments)

The actor-run-get endpoint gives overall statistics that includes resurrected runs. So if the actor was resurrected, then runTimeSecs available through API is the sum of the actor run segments runtimes and started_at is the time of the first run segment start.

With current state the best that could be done is probably calculate runtime of current run if the actor was never resurrected. Use that to subtract it from current actor timeout and set it as new timeout for newly started actor through actor.call. If the actor was already resurrected, we can at least limit the actor.call timeout to the top actor run timeout, but we can't reduce it by the already used time in this case.

This might still cover quite a lot of use-cases so it might be sufficient improvement.

(Further details in: https://apify.slack.com/archives/C010Q0FBYG3/p1747384294383579)

Pijukatel avatar May 16 '25 09:05 Pijukatel

Well there seems to be information that can be used after all. configuration.timeout_at is actually saving only the timeout of the last run segment and so it should be possible to use it.

Pijukatel avatar May 16 '25 09:05 Pijukatel

Regarding the timeouts. Currently there is no public way of getting the runtime of the actor run. Specifically runtime of last actor run segment, which in the context of this issue is the actor run. (Resurrected actor run consists of several run segments and from the point of view of API, the actor run is sum of all it's run segments)

@jirimoravcik, this is something we should solve at the worker, perhaps with an env-var noting the timeout or startup time if we don't want to expand the API with this.

mtrunkat avatar May 19 '25 08:05 mtrunkat

Regarding the timeouts. Currently there is no public way of getting the runtime of the actor run. Specifically runtime of last actor run segment, which in the context of this issue is the actor run. (Resurrected actor run consists of several run segments and from the point of view of API, the actor run is sum of all it's run segments)

@jirimoravcik, this is something we should solve at the worker, perhaps with an env-var noting the timeout or startup time if we don't want to expand the API with this.

We set the runtime.timeoutAt when the Actor run is created, but it seems we don't publish that field in the API.

Also the runtime.timeoutAt doesn't work for standby runs, because the timeout is dynamic there based on incoming and in-flight requests...

jirimoravcik avatar May 20 '25 12:05 jirimoravcik

Also the runtime.timeoutAt doesn't work for standby runs, because the timeout is dynamic there based on incoming and in-flight requests..

What value does it actually have in such case?

Pijukatel avatar May 20 '25 13:05 Pijukatel

This is now fully implemented in Python version of the crawlee and it has to be migrated to the JS version. For reference, see the linked issues mentioned in this conversation.

Pijukatel avatar Jun 19 '25 12:06 Pijukatel