Next steps for distributed deployments
This issue outlines the steps that we need to take to make dora dataflow work across multiple machines:
- [ ] update
dora checkto skip checking paths on remote machines- [ ] use the same logic when checking the dataflow in
dora start - open question: How do we know which machine ID is local and which remote?
- alternative: skip path checks completely when dataflow specifies multiple machine IDs
- [ ] use the same logic when checking the dataflow in
- [ ] figure out a way to handle relative node paths on remote machines
- For local dataflows, we use the folder containing the dataflow YAML file as working directory. This does not work for remote machines since the YAML file is not available there.
- Option 1: Use the working directory of the daemon by default (i.e. the directory where the daemon was started in)
- This would be a breaking change.
- Option 2: Only allow absolute paths for remote machines (this is probably too limiting)
- Option 3: Configure the working directory for each machine in the YAML file.
- Other ideas?
Some relevant places in the code:
- https://github.com/dora-rs/dora/blob/60e4d7dd414e7ac8b962fd98159ba19f92a71343/libraries/core/src/topics.rs#L19-L21
- The CLI sets this directory on
dora startwhen in sends the start command to the coordinator. The coordinator forwards it to the daemon on the target machine, which uses it when spawning the dataflow nodes: https://github.com/dora-rs/dora/blob/60e4d7dd414e7ac8b962fd98159ba19f92a71343/binaries/daemon/src/lib.rs#L535 - Remote machines might have a different directory structure, so it does not make sense to use the same working directory there.
- The CLI sets this directory on
- https://github.com/dora-rs/dora/blob/d4ff5868c56f5070d54b8cda67b02fe18193ac46/libraries/core/src/descriptor/validate.rs#L25-L30
- We check whether a node exists here.
- For nodes running on remote machines, this check doesn't make sense.
- https://github.com/dora-rs/dora/blob/d4ff5868c56f5070d54b8cda67b02fe18193ac46/binaries/daemon/src/spawn.rs#L81-L90
- We do support URLs as node sources. This could be useful for distributed deployments.
Option 1: Use the working directory of the daemon by default (i.e. the directory where the daemon was started in)
I think that if this is the specification it should be the same specification across local and remote node, thus breaking changes compared to current implementation.
Option 2: Only allow absolute paths for remote machines (this is probably too limiting)
I would expect this to be an available option all the time as it might not be easy to specify a specific file very far from the daemon spawning path.
If we pass parameter working_dir when we start daemon then we can manage working_dir just like machine_id, When a node is started, the coordinator passes the working_dir of the corresponding daemon so we don't have to skip checks. However, I'm not sure that's possible.
If we pass parameter working_dir when we start daemon then we can manage working_dir just like machine_id, When a node is started, the coordinator passes the working_dir of the corresponding daemon so we don't have to skip checks. However, I'm not sure that's possible.
This works when both daemons are running on the same machine. However, if a daemon runs on a remote machine, we have no access to its file system, so we cannot check the paths.
Option 1: Use the working directory of the daemon by default (i.e. the directory where the daemon was started in)
I think that if this is the specification it should be the same specification across local and remote node, thus breaking changes compared to current implementation.
Good point, I added this drawback to the list above.
Option 2: Only allow absolute paths for remote machines (this is probably too limiting)
I would expect this to be an available option all the time as it might not be easy to specify a specific file very far from the daemon spawning path.
Yes, it's always available as an option. What I meant is that we don't allow relative paths for remote machines.
In that case can we maybe try an implementation using option 2 before making Option 1.
What do you think @XxChang @Gege-Wang ?
In that case can we maybe try an implementation using option 2 before making Option 1.
What do you think @XxChang @Gege-Wang ?
I think it is good, let me do it.
I opened a draft PR that implements option 2 a few days ago, maybe that's useful: https://github.com/dora-rs/dora/pull/534
One challenge is the multiple-daemons test, which runs on multiple machines, which all resolve to the same local machine. Using absolute paths in its config is not ideal because we want to commit the test to git and run it on different machines.
I see.
Maybe we can use some environment variable to fix CI?
Otherwise, I guess it's fine to hard code GitHub CI path for now.
There are some issues about dora start and dora check to skip checking paths on remote machines:
- if cli and coordinator are local, some daemons are on remote machines, when dora start, the cli check(&working_dir) will go to this branch and resolve_path. Here the resolve_path will fail, because the remote daemon path exist check should be skip here. https://github.com/dora-rs/dora/blob/eda09cb9c8f4428f908d43d19e257e1528e7433a/libraries/core/src/descriptor/validate.rs#L54-L57
If the cli check the dataflow, this problem should be always here, because cli never know whether the daemon are local or remote.
- if the cli and coordinator is ubuntu, and some remote daemons are windows, the windows absolute path will be checked into relative. Here the check fails, even though we write the right absolute path.
let path = Pathbuf::from("C:\\dora\\tmp\\test.log");
if path.is_absolute() {
println!("Path is absolute");
} else {
println!("Path is relative");
}
- if cli and coordinator are local, some daemons are on remote machines, theoretically,we can start the dataflow like this.
# cli
dora coordinator
dora daemon --machine-id A
# remote
dora daemon --machine-id B --coordinator-addr <remote-ip>:<port>
however, it doesn't work, because the ip of machine A is 127.0.0.1,so we must start dataflow like this
# cli
dora coordinator
dora daemon --machine-id A --coordinator-add <local-ip>:<port>
# remote
dora daemon --machine-id B --coordinator-addr <remote-ip>:<port>
- I don't understand why we use one work_dir per-daemon. theoretically, We should use one work_dir per dataflow? And why we have to check the dataflow in cli, it is complex in multiple-daemon.
Thanks a lot for testing and reporting these issues! This is very useful!
- if cli and coordinator are local, some daemons are on remote machines, when dora start, the cli check(&working_dir) will go to this branch and resolve_path. Here the resolve_path will fail, because the remote daemon path exist check should be skip here.
I think we can fix this in the following way:
- For
dora check, only print a warning if the path doesn't exit, instead of failing. - For
dora start, we should get the query the list ofremote_machine_idsfrom the coordinator.
- if cli and coordinator are local, some daemons are on remote machines, theoretically,we can start the dataflow like this.
# cli dora coordinator dora daemon --machine-id A # remote dora daemon --machine-id B --coordinator-addr <remote-ip>:<port>however, it doesn't work, because the ip of machine A is 127.0.0.1,so we must start dataflow like this
# cli dora coordinator dora daemon --machine-id A --coordinator-add <local-ip>:<port> # remote dora daemon --machine-id B --coordinator-addr <remote-ip>:<port>
The issue is around these lines:
https://github.com/dora-rs/dora/blob/9d2ee36cd3681cd7a31cb8757718769a68f46267/binaries/coordinator/src/lib.rs#L181-L184
If the peer_ip is the loopback address, we know that the coordinator and the daemon run on the same machine. So other daemons should be able to reach the registered deamon through the same IP address as the coordinator. So a (hacky) fix could be to replace the 127.0.0.1 with the coordinator listen IP.
- if the cli and coordinator is ubuntu, and some remote daemons are windows, the windows absolute path will be checked into relative. Here the check fails, even though we write the right absolute path.
let path = Pathbuf::from("C:\\dora\\tmp\\test.log"); if path.is_absolute() { println!("Path is absolute"); } else { println!("Path is relative"); }
Good catch! So we need a way to check whether a path is an absolute Windows path on Linux systems (and the other way around). Maybe there are some crates that allow this? Alternatively, we could copy the libstd implementations and provide them in architecture-independent functions.