Tracking issue for distributed dora dataflow
Describe the bug Dora remote connection has been long awaited. We however need to fix a couple of things to make it work:
Tracking issue
- [x] Make hard-coded
localhostaddress configurable in the daemon and the coordinator - [ ] Make
working_dirconfigurable dependent on the machine on multi daemon situation.
Update
Additional Tracking issue
- [x] Make CLI coordinator connection configurable.
Currently the CLI can only connect to the local preconfigured coordinator. We should make it configurable such as:
dora start dataflow.yaml --coordinator-addr IP:PORT
https://github.com/dora-rs/dora/blob/ea47a556a77f6c64f29ab07939d747edc8573671/binaries/cli/src/main.rs#L469-L472
- [ ] Make
working_dirconfigurable dependent on the machine on multi daemon situation.
[ ] Make hard-coded
localhostaddress configurable in the daemon and the coordinator
Which hardcoded addresses do you mean? This should be already configurable by passing a --coordinator-addr argument when spawning the daemon. You should also pass a --machine-id in this case.
Yeah sorry could have been more precise:
- https://github.com/dora-rs/dora/blob/9508fbb399ed6722a13497cfddc7d0441d64cdd0/binaries/daemon/src/inter_daemon.rs#L66-L88
- https://github.com/dora-rs/dora/blob/9508fbb399ed6722a13497cfddc7d0441d64cdd0/binaries/coordinator/src/listener.rs#L9-L20
Ah yes, we should use 0.0.0.0 in the coordinator to listen for incoming connections on all interfaces. For the daemon, the 127.0.0.1 should be fine since only local nodes should initiate connections to the daemon.
I have just looked rapidly but isn't this: - https://github.com/dora-rs/dora/blob/9508fbb399ed6722a13497cfddc7d0441d64cdd0/binaries/daemon/src/inter_daemon.rs#L66-L88 for inter_daemon connection?
Ah yes, then we should listen on 0.0.0.0 in the deamon as well.
Make working_dir configurable dependent on the machine on multi daemon situation.
Should dora attempt to create the working_dir if it doesn't exist or fail out with an error?
Ah yes, then we should listen on 0.0.0.0 in the deamon as well.
Should 0.0.0.0 be the default, or should it not be configurable? (I don't have experience networking robots, but I'd assume connecting them through a VPN would be common - cargo run -- --bind <vpn-interface-ip>:<port>)
Make working_dir configurable dependent on the machine on multi daemon situation.
Should
doraattempt to create theworking_dirif it doesn't exist or fail out with an error?
So there is couple of things that we consider: https://github.com/dora-rs/dora/pull/256#issuecomment-1514362986
I think that we could maybe as a first iteration expect the working_dir to exist with the right files in it, and error out on missing files. We can in a follow up PR handle creating missing folder.
For more, see: https://github.com/dora-rs/dora/pull/256#issuecomment-1514362986
Ah yes, then we should listen on 0.0.0.0 in the deamon as well.
Should
0.0.0.0be the default, or should it not be configurable? (I don't have experience networking robots, but I'd assume connecting them through a VPN would be common -cargo run -- --bind <vpn-interface-ip>:<port>)
It does seem like being configurable would be a plus. I think that by revisiting the port argument it should be fairly straightforward to make addr configurable.
As far as I know, 0.0.0.0 acts as a kind of wildcard when listening. So it should listen for incoming connections on all available interfaces.
As far as I know, 0.0.0.0 acts as a kind of wildcard when listening. So it should listen for incoming connections on all available interfaces.
Correct- binding to 0.0.0.0 will listen all interfaces available to the process.
Binding to <vpn-net-ip> commonly restricts connections to only allow other machines on the VPN.
Ah, that makes sense, thanks for clarifying! I agree that making this configurable is a good idea then.
Many of the cli commands end up using the connect_to_coordinator function.
Should all of these be made configurable as well?
https://github.com/dora-rs/dora/blob/1c2dc46ac272ce607200f089e3c43cdf76daf2a8/binaries/cli/src/main.rs#L435-L437
Reupping the above after taking a closer look at the daemon CLI command, which explicitly ignores the coordination addr when startingn a dataflow and advises
Please use the
startcommand for remote coordinator
But start uses the default connect_to_coordinator from above, not using any passed arguments.
https://github.com/dora-rs/dora/blob/1c2dc46ac272ce607200f089e3c43cdf76daf2a8/binaries/cli/src/main.rs#L291-L313
I'm asking for clarity because I know dora is going through rapid iteration, so I'm uncertain what the current "expected" user routes are.
I understand, just want to say thanks for all the effort you put in debugging this.
Truth be told, the existing daemon <-> coordinator <-> daemon was merged as an experimental block (#256) to make sure that what we build would essentially support distributed communication but never really got stabilized. This is why you may see scaffolding still on the current codebase.
We really want to get the distributed dataflow working but we have been very busy with both LLMs and ROS2-bridges.
The daemon run feature is kind of a scaffolding for running dataflow in the CI/CD, we don't really want to expose this feature as is and may also be reworked.
The exact cli pattern to connect to the coordinator and the daemon in a distributed manner is IMO not yet structured, but, i'm fully open to draft a first implementation and then adjust depending on the faced issues.
How does that sound?
Yes, thanks a lot for all your work on this!
The correct solution for the CLI would be to add a coordinator_addr argument to the connect_to_coordinator function. We would also need some way to set this argument when running a dora CLI command. I think we can start with a new command-line argument for this and look for more convenient ways in the future. What do you think?
I received feedback from two serious customers that Zenoh support across networks is a key differentiator feature.
Then we can prioritise it after https://github.com/dora-rs/dora/pull/478 and https://github.com/dora-rs/dora/pull/497 and https://github.com/dora-rs/dora/pull/495
I opened an issue with the next required steps for distributed dataflows: https://github.com/dora-rs/dora/issues/535. Discussion and pull requests are welcome!