apify-sdk-js icon indicating copy to clipboard operation
apify-sdk-js copied to clipboard

Apify.call should "survive" migration

Open metalwarrior665 opened this issue 5 years ago • 21 comments

Right now, Apify.call doesn't save any state upon migration so if the user is not careful, he will loose reference to the called runs and they will be called again.

The current solution is to save the run ID into KV store and do some waiting loop manually.

Ideally Apify.call would do this automatically. Not sure if this can be done without any extra parameters automatically or we the user would have to specify it with a string param.

metalwarrior665 avatar Feb 13 '20 12:02 metalwarrior665

I can't much imagine how this would work. When you restart the actor, there is no way to make the execution jump right to Apify.call(). The only way would be to let the actor repeat everything else, and then when it performs Apify.call() with same parameters (???) as before, the function would just re-use an older call and return that instead? But what if the parameters are a bit different? And would the users understand this?

jancurn avatar Feb 13 '20 12:02 jancurn

Sure, that is why we discuss it here.

Yes, the actor would repeat everything else.

It doesn't have be to necessarily default behavior (if we find it confusing) of Apify.call. But in 99.9% cases, you would like to continue waiting for the current call (without changing the input or settings). For example,the queue and list are also persisted so if you have a crawler and then Apify.call, after restart it will most likely just jump directly into the waiting for Apify.call to return.

metalwarrior665 avatar Feb 13 '20 13:02 metalwarrior665

Technically, we could add an idempotencyKey parameter as we have with webhooks. Apify.call() could then check if such key is present in KVS and if so, load the info from KVS, otherwise start a new call.

mnmkng avatar Feb 13 '20 14:02 mnmkng

(yet another use for hidden fields in KVS)

mnmkng avatar Feb 13 '20 14:02 mnmkng

Good idea, we can even generate the idempotencyKey from actor run ID, so that we don't even need to store it.

jancurn avatar Feb 13 '20 15:02 jancurn

@jancurn But then we could call only once for a given target actor. If I Apify.call multiple times, I need to be able to store those calls separately to be able to rerun them. Or am I missing something?

mnmkng avatar Feb 13 '20 15:02 mnmkng

You can construct the idempotencyKey from the call input params and run ID

jancurn avatar Feb 13 '20 15:02 jancurn

and we need this more than ever now 😬 was going o post about call idempotencyKey, but here it is... so calling call with the same key will either call another actor if it's not running or return the running instance when it was already invoked? how it would behave with calls using { waitSecs: 0 }?

and thinking further, we can have an Apify.utils.waitCall(s) function that waits for runIds, so people can use it to wait manually or it can be used for this idempotencyKey approach internally for Apify.call/callTask

https://github.com/pocesar/actor-spawn-workers/blob/15e3efe2c4969de5af61f967315d0611f1ab5a5b/src/functions.js#L11-L25

pocesar avatar May 04 '20 18:05 pocesar

I agree to start with Apify.utils.waitCall (maybe a bit different name - awaitRun?, waitForRunFinish?). We are copy/pasting this over again.

Then we can think further if/how to integrate this into Apify.call.

metalwarrior665 avatar May 06 '20 08:05 metalwarrior665

Guys, you know there is already an internal function called waitForRunToFinish () that does exactly that, in a slightly smarter way than active polling? See https://github.com/apifytech/apify-js/blob/f2d6e5c1f7c72be9b39d4c05912dd50b89228433/src/actor.js#L34

We can start by documenting it making it public.

jancurn avatar May 06 '20 08:05 jancurn

exporting that right away would be nice.

right now we have 2 actors take literally takes a day to complete, and they wait on each others completion. Apify.call() inside the main actor, that is currently calling the actor again on each migration, leading to many instances running unnecessarily. so we need to start it once, save the call run id to the KV, use { waitSecs: 0 } and actively poll it

pocesar avatar May 06 '20 16:05 pocesar

Wouldn't that case be better served by a webhook?

mnmkng avatar May 06 '20 16:05 mnmkng

we could, would need some refactoring and moving code outside

pocesar avatar May 06 '20 18:05 pocesar

I think there is no harm making waitForRunToFinish() publicly available, we just need to document it well. The taskId param is not very nice, it's only used for the error message, so perhaps we could get rid of it in the public function and catch and rethrow better error if needed.

jancurn avatar May 06 '20 19:05 jancurn

https://github.com/apifytech/apify-js/pull/675

mnmkng avatar May 07 '20 10:05 mnmkng

I think this could be tackled similarly to the statistic class, but instead of an autoincrement id, quickly hash the input to make it unique between calls, this way allows more than 1 call/callTask:

const run = await Apify.call('apify/send-mail', {
  to: 'email@example',
});
// if the run is done, it returns immediately, otherwise, waits.

the hash can be a simple CRC32 of the JSON.stringify input, for efficiency and unlikely to have collisions inside the same run. hashable string would be like ${nameOfActor}${JSON.stringify(actorInput)}${idempotencyKey} (add the other options as well?). to make it distinct using the same input (who knows), the user need to pass in the new idempotencyKey to options, otherwise it's the current actor run ID, so this would require 0 changes to the platform (and works properly locally)

Calls are then saved to SDK_CALLS KV, with the return metadata from Apify.client.tasks.runTask

{
   "hexHashOfInput": {
      "id": "HG7ML7M8z78YcAPEB", // the runId
      "actorTaskId": "KJHSKHausidyaJKHs", // ditto
      // save the bits we care about to be able to retrieve it later
   }
}

the lookup for the input should be straightforward. querying the state of the run and issuing a waitForRunToFinish in case { waitSecs: 0 }.

Good part of using hashes: doesn't leak information about the call/callTask on KV, hash is space efficient and stable Bad part: unintelligible and can't know before hand which call is what key in the object, harder to debug (unless it attaches the id to the call function output)

pocesar avatar Sep 30 '20 19:09 pocesar

@pocesar I think this is an ideal candidate for the apify-utils repo :)

BTW: I think your solution might be quite bloated on the storage. You would need to clean-up the records in the call as well. On the other hand, a single JSON might suffer when you call thousands of times and grow too large. Maybe some compromise :) Let's discuss it on apify-utils :)

metalwarrior665 avatar Sep 30 '20 19:09 metalwarrior665

@metalwarrior665 the thing is I plan to do a couple of utility PAs that other people can call from their actors (like a microservice). having this working at SDK level instead of having the person to require an external dependency on their code would be ideal

pocesar avatar Sep 30 '20 19:09 pocesar

Guys, great points. Just letting you know that call will be a part of apify-client soon. But the persistence could still be part of SDK I guess. Dunno yet. Step by step. Obviously the client can do the persistence too.

mnmkng avatar Sep 30 '20 19:09 mnmkng

example: had to write all the boilerplate in the README https://apify.com/pocesar/sitemap-to-request-queue otherwise it will do the same things again on migration and waste a couple of minutes to add 0 requests to the queue

pocesar avatar Oct 02 '20 17:10 pocesar

This will be resolved by the new apify-extra package, will close this once it is released

metalwarrior665 avatar Dec 14 '22 11:12 metalwarrior665