dora icon indicating copy to clipboard operation
dora copied to clipboard

Tracking issue for distributed dora dataflow

Open haixuanTao opened this issue 1 year ago • 17 comments

Describe the bug Dora remote connection has been long awaited. We however need to fix a couple of things to make it work:

Tracking issue

  • [x] Make hard-coded localhost address configurable in the daemon and the coordinator
  • [ ] Make working_dir configurable dependent on the machine on multi daemon situation.

Update

Additional Tracking issue

  • [x] Make CLI coordinator connection configurable.

Currently the CLI can only connect to the local preconfigured coordinator. We should make it configurable such as:

dora start dataflow.yaml --coordinator-addr IP:PORT

https://github.com/dora-rs/dora/blob/ea47a556a77f6c64f29ab07939d747edc8573671/binaries/cli/src/main.rs#L469-L472

  • [ ] Make working_dir configurable dependent on the machine on multi daemon situation.

haixuanTao avatar Apr 07 '24 09:04 haixuanTao

[ ] Make hard-coded localhost address configurable in the daemon and the coordinator

Which hardcoded addresses do you mean? This should be already configurable by passing a --coordinator-addr argument when spawning the daemon. You should also pass a --machine-id in this case.

phil-opp avatar Apr 08 '24 15:04 phil-opp

Yeah sorry could have been more precise:

  • https://github.com/dora-rs/dora/blob/9508fbb399ed6722a13497cfddc7d0441d64cdd0/binaries/daemon/src/inter_daemon.rs#L66-L88
  • https://github.com/dora-rs/dora/blob/9508fbb399ed6722a13497cfddc7d0441d64cdd0/binaries/coordinator/src/listener.rs#L9-L20

haixuanTao avatar Apr 08 '24 15:04 haixuanTao

Ah yes, we should use 0.0.0.0 in the coordinator to listen for incoming connections on all interfaces. For the daemon, the 127.0.0.1 should be fine since only local nodes should initiate connections to the daemon.

phil-opp avatar Apr 08 '24 16:04 phil-opp

I have just looked rapidly but isn't this: - https://github.com/dora-rs/dora/blob/9508fbb399ed6722a13497cfddc7d0441d64cdd0/binaries/daemon/src/inter_daemon.rs#L66-L88 for inter_daemon connection?

haixuanTao avatar Apr 08 '24 18:04 haixuanTao

Ah yes, then we should listen on 0.0.0.0 in the deamon as well.

phil-opp avatar Apr 08 '24 18:04 phil-opp

Make working_dir configurable dependent on the machine on multi daemon situation.

Should dora attempt to create the working_dir if it doesn't exist or fail out with an error?

Ah yes, then we should listen on 0.0.0.0 in the deamon as well.

Should 0.0.0.0 be the default, or should it not be configurable? (I don't have experience networking robots, but I'd assume connecting them through a VPN would be common - cargo run -- --bind <vpn-interface-ip>:<port>)

Michael-J-Ward avatar Apr 11 '24 11:04 Michael-J-Ward

Make working_dir configurable dependent on the machine on multi daemon situation.

Should dora attempt to create the working_dir if it doesn't exist or fail out with an error?

So there is couple of things that we consider: https://github.com/dora-rs/dora/pull/256#issuecomment-1514362986

I think that we could maybe as a first iteration expect the working_dir to exist with the right files in it, and error out on missing files. We can in a follow up PR handle creating missing folder.

For more, see: https://github.com/dora-rs/dora/pull/256#issuecomment-1514362986

Ah yes, then we should listen on 0.0.0.0 in the deamon as well.

Should 0.0.0.0 be the default, or should it not be configurable? (I don't have experience networking robots, but I'd assume connecting them through a VPN would be common - cargo run -- --bind <vpn-interface-ip>:<port>)

It does seem like being configurable would be a plus. I think that by revisiting the port argument it should be fairly straightforward to make addr configurable.

haixuanTao avatar Apr 11 '24 12:04 haixuanTao

As far as I know, 0.0.0.0 acts as a kind of wildcard when listening. So it should listen for incoming connections on all available interfaces.

phil-opp avatar Apr 11 '24 12:04 phil-opp

As far as I know, 0.0.0.0 acts as a kind of wildcard when listening. So it should listen for incoming connections on all available interfaces.

Correct- binding to 0.0.0.0 will listen all interfaces available to the process.

Binding to <vpn-net-ip> commonly restricts connections to only allow other machines on the VPN.

Michael-J-Ward avatar Apr 11 '24 13:04 Michael-J-Ward

Ah, that makes sense, thanks for clarifying! I agree that making this configurable is a good idea then.

phil-opp avatar Apr 11 '24 13:04 phil-opp

Many of the cli commands end up using the connect_to_coordinator function.

Should all of these be made configurable as well?

https://github.com/dora-rs/dora/blob/1c2dc46ac272ce607200f089e3c43cdf76daf2a8/binaries/cli/src/main.rs#L435-L437

Michael-J-Ward avatar Apr 17 '24 00:04 Michael-J-Ward

Reupping the above after taking a closer look at the daemon CLI command, which explicitly ignores the coordination addr when startingn a dataflow and advises

Please use the start command for remote coordinator

But start uses the default connect_to_coordinator from above, not using any passed arguments.

https://github.com/dora-rs/dora/blob/1c2dc46ac272ce607200f089e3c43cdf76daf2a8/binaries/cli/src/main.rs#L291-L313

I'm asking for clarity because I know dora is going through rapid iteration, so I'm uncertain what the current "expected" user routes are.

Michael-J-Ward avatar Apr 17 '24 17:04 Michael-J-Ward

I understand, just want to say thanks for all the effort you put in debugging this.

Truth be told, the existing daemon <-> coordinator <-> daemon was merged as an experimental block (#256) to make sure that what we build would essentially support distributed communication but never really got stabilized. This is why you may see scaffolding still on the current codebase.

We really want to get the distributed dataflow working but we have been very busy with both LLMs and ROS2-bridges.

The daemon run feature is kind of a scaffolding for running dataflow in the CI/CD, we don't really want to expose this feature as is and may also be reworked.

The exact cli pattern to connect to the coordinator and the daemon in a distributed manner is IMO not yet structured, but, i'm fully open to draft a first implementation and then adjust depending on the faced issues.

How does that sound?

haixuanTao avatar Apr 18 '24 11:04 haixuanTao

Yes, thanks a lot for all your work on this!

The correct solution for the CLI would be to add a coordinator_addr argument to the connect_to_coordinator function. We would also need some way to set this argument when running a dora CLI command. I think we can start with a new command-line argument for this and look for more convenient ways in the future. What do you think?

phil-opp avatar Apr 18 '24 12:04 phil-opp

I received feedback from two serious customers that Zenoh support across networks is a key differentiator feature.

bobd988 avatar May 16 '24 06:05 bobd988

Then we can prioritise it after https://github.com/dora-rs/dora/pull/478 and https://github.com/dora-rs/dora/pull/497 and https://github.com/dora-rs/dora/pull/495

haixuanTao avatar May 16 '24 08:05 haixuanTao

I opened an issue with the next required steps for distributed dataflows: https://github.com/dora-rs/dora/issues/535. Discussion and pull requests are welcome!

phil-opp avatar Jun 05 '24 17:06 phil-opp