zenoh-python icon indicating copy to clipboard operation
zenoh-python copied to clipboard

[Bug] Session.get handler only calls done callback in v0.11.0-rc.2

Open fsteff opened this issue 9 months ago • 9 comments

Describe the bug

Session.get behaves differently in 0.11.0-rc.2 compared to 0.10.1rc0. When using the Closure reply handler (reply and done callbacks), the reply callback is not called.

To reproduce

Considering the following example:

import time
import zenoh

session1 = zenoh.open()
session2 = zenoh.open()

qable = session1.declare_queryable('test', lambda req: req.reply(zenoh.Sample('test', 'ok')))


session2.get('test', 
             (lambda reply: print(f'reply "{reply.ok.payload.decode()}"'),
                      lambda: print('done')))

time.sleep(1)
qable.undeclare()
session1.close()
session2.close()

With version 0.10.1rc0 the console output is as expected:

reply "ok"
done

With version 0.11.0-rc.2 only the done callback is called:

done

System info

  • Platform Win11
  • Python version 3.10.9
  • eclipse-zenoh version 0.11.0-rc.2

fsteff avatar May 06 '24 09:05 fsteff

~~Could this be related to https://github.com/eclipse-zenoh/zenoh/pull/816?~~

fsteff avatar May 06 '24 11:05 fsteff

@fsteff the PR you linked is not merge into main yet os it can't be related. @wyfo can you look into it?

Mallets avatar May 06 '24 14:05 Mallets

I can reproduce the issue, I'm on it.

wyfo avatar May 06 '24 15:05 wyfo

It seems that Session.declare_queryable changed in 0.11; the operation is no more taken in account instantly. In fact, adding time.sleep(1) after session1.declare_queryable(...) fix the issue. @Mallets Do you remind about declare_queryable behavior modification?

wyfo avatar May 06 '24 19:05 wyfo

I can confirm that a short sleep resolves the issue, also in more complex test cases. Actually even time.sleep(.000000001) is enough. Sounds like some sort of concurrency problem (OS context switch is enough?).

fsteff avatar May 07 '24 06:05 fsteff

Looks like this won't be a problem in most cases. But it breaks most of our unit tests...

fsteff avatar May 07 '24 06:05 fsteff

Hello @fsteff! The example you provided has never been guaranteed to work. It always has been a race condition between the propagation of the queryable declaration to session2 through the net stack on one side and the execution of the main thread on the second side. In 0.10.0-rc the scheduling was so that it almost always worked in python. In 0.11.0-rcX we ported zenoh from async-std to tokio which greatly impacted the scheduling.

OlivierHecart avatar May 07 '24 10:05 OlivierHecart

@OlivierHecart I suspected as much. Already fixed most unit tests and upgraded our stack :) It was convenient that this "just worked" - is there a good way to verify that queryables have been propagated? Calling sleep(...) works most of the time, but that's not really a clean solution. The only method I know of utilizes a router's admin space, but out tests work purely p2p.

fsteff avatar May 07 '24 10:05 fsteff

Arguably a new entity, let's say it a querier, should be introduce to serve a similar to a publisher. In that way, it may become easier to know when there is at least one matching queryable. But my comment applies more generic to Zenoh as whole and not only to the Python API.

Mallets avatar May 07 '24 12:05 Mallets