apify-sdk-js
apify-sdk-js copied to clipboard
Apify.call should "survive" migration
Right now, Apify.call doesn't save any state upon migration so if the user is not careful, he will loose reference to the called runs and they will be called again.
The current solution is to save the run ID into KV store and do some waiting loop manually.
Ideally Apify.call would do this automatically. Not sure if this can be done without any extra parameters automatically or we the user would have to specify it with a string param.
I can't much imagine how this would work. When you restart the actor, there is no way to make the execution jump right to Apify.call()
. The only way would be to let the actor repeat everything else, and then when it performs Apify.call()
with same parameters (???) as before, the function would just re-use an older call and return that instead? But what if the parameters are a bit different? And would the users understand this?
Sure, that is why we discuss it here.
Yes, the actor would repeat everything else.
It doesn't have be to necessarily default behavior (if we find it confusing) of Apify.call. But in 99.9% cases, you would like to continue waiting for the current call (without changing the input or settings). For example,the queue and list are also persisted so if you have a crawler and then Apify.call, after restart it will most likely just jump directly into the waiting for Apify.call to return.
Technically, we could add an idempotencyKey
parameter as we have with webhooks. Apify.call()
could then check if such key is present in KVS and if so, load the info from KVS, otherwise start a new call.
(yet another use for hidden fields in KVS)
Good idea, we can even generate the idempotencyKey
from actor run ID, so that we don't even need to store it.
@jancurn But then we could call
only once for a given target actor. If I Apify.call
multiple times, I need to be able to store those calls separately to be able to rerun them. Or am I missing something?
You can construct the idempotencyKey
from the call input params and run ID
and we need this more than ever now 😬 was going o post about call idempotencyKey, but here it is... so calling call
with the same key will either call another actor if it's not running or return the running instance when it was already invoked? how it would behave with calls using { waitSecs: 0 }
?
and thinking further, we can have an Apify.utils.waitCall
(s) function that waits for runIds, so people can use it to wait manually or it can be used for this idempotencyKey approach internally for Apify.call/callTask
https://github.com/pocesar/actor-spawn-workers/blob/15e3efe2c4969de5af61f967315d0611f1ab5a5b/src/functions.js#L11-L25
I agree to start with Apify.utils.waitCall
(maybe a bit different name - awaitRun?, waitForRunFinish?). We are copy/pasting this over again.
Then we can think further if/how to integrate this into Apify.call.
Guys, you know there is already an internal function called waitForRunToFinish ()
that does exactly that, in a slightly smarter way than active polling? See https://github.com/apifytech/apify-js/blob/f2d6e5c1f7c72be9b39d4c05912dd50b89228433/src/actor.js#L34
We can start by documenting it making it public.
exporting that right away would be nice.
right now we have 2 actors take literally takes a day to complete, and they wait on each others completion.
Apify.call()
inside the main actor, that is currently calling the actor again on each migration, leading to many instances running unnecessarily. so we need to start it once, save the call run id to the KV, use { waitSecs: 0 }
and actively poll it
Wouldn't that case be better served by a webhook?
we could, would need some refactoring and moving code outside
I think there is no harm making waitForRunToFinish()
publicly available, we just need to document it well. The taskId
param is not very nice, it's only used for the error message, so perhaps we could get rid of it in the public function and catch and rethrow better error if needed.
https://github.com/apifytech/apify-js/pull/675
I think this could be tackled similarly to the statistic class, but instead of an autoincrement id, quickly hash the input to make it unique between calls, this way allows more than 1 call/callTask:
const run = await Apify.call('apify/send-mail', {
to: 'email@example',
});
// if the run is done, it returns immediately, otherwise, waits.
the hash can be a simple CRC32 of the JSON.stringify
input, for efficiency and unlikely to have collisions inside the same run. hashable string would be like ${nameOfActor}${JSON.stringify(actorInput)}${idempotencyKey}
(add the other options as well?). to make it distinct using the same input (who knows), the user need to pass in the new idempotencyKey
to options, otherwise it's the current actor run ID, so this would require 0 changes to the platform (and works properly locally)
Calls are then saved to SDK_CALLS
KV, with the return metadata from Apify.client.tasks.runTask
{
"hexHashOfInput": {
"id": "HG7ML7M8z78YcAPEB", // the runId
"actorTaskId": "KJHSKHausidyaJKHs", // ditto
// save the bits we care about to be able to retrieve it later
}
}
the lookup for the input should be straightforward. querying the state of the run and issuing a waitForRunToFinish
in case { waitSecs: 0 }
.
Good part of using hashes: doesn't leak information about the call/callTask on KV, hash is space efficient and stable
Bad part: unintelligible and can't know before hand which call is what key in the object, harder to debug (unless it attaches the id to the call
function output)
@pocesar I think this is an ideal candidate for the apify-utils repo :)
BTW: I think your solution might be quite bloated on the storage. You would need to clean-up the records in the call as well. On the other hand, a single JSON might suffer when you call thousands of times and grow too large. Maybe some compromise :) Let's discuss it on apify-utils :)
@metalwarrior665 the thing is I plan to do a couple of utility PAs that other people can call from their actors (like a microservice). having this working at SDK level instead of having the person to require an external dependency on their code would be ideal
Guys, great points. Just letting you know that call
will be a part of apify-client
soon. But the persistence could still be part of SDK I guess. Dunno yet. Step by step. Obviously the client can do the persistence too.
example: had to write all the boilerplate in the README https://apify.com/pocesar/sitemap-to-request-queue otherwise it will do the same things again on migration and waste a couple of minutes to add 0 requests to the queue
This will be resolved by the new apify-extra package, will close this once it is released