dora Next steps for distributed deployments

This issue outlines the steps that we need to take to make dora dataflow work across multiple machines:

[ ] update dora check to skip checking paths on remote machines
- [ ] use the same logic when checking the dataflow in dora start
- open question: How do we know which machine ID is local and which remote?
- alternative: skip path checks completely when dataflow specifies multiple machine IDs
[ ] figure out a way to handle relative node paths on remote machines
- For local dataflows, we use the folder containing the dataflow YAML file as working directory. This does not work for remote machines since the YAML file is not available there.
- Option 1: Use the working directory of the daemon by default (i.e. the directory where the daemon was started in)
  - This would be a breaking change.
- Option 2: Only allow absolute paths for remote machines (this is probably too limiting)
- Option 3: Configure the working directory for each machine in the YAML file.
- Other ideas?

Jun 05 '24 17:06 phil-opp

Some relevant places in the code:

https://github.com/dora-rs/dora/blob/60e4d7dd414e7ac8b962fd98159ba19f92a71343/libraries/core/src/topics.rs#L19-L21
- The CLI sets this directory on dora start when in sends the start command to the coordinator. The coordinator forwards it to the daemon on the target machine, which uses it when spawning the dataflow nodes: https://github.com/dora-rs/dora/blob/60e4d7dd414e7ac8b962fd98159ba19f92a71343/binaries/daemon/src/lib.rs#L535
- Remote machines might have a different directory structure, so it does not make sense to use the same working directory there.
https://github.com/dora-rs/dora/blob/d4ff5868c56f5070d54b8cda67b02fe18193ac46/libraries/core/src/descriptor/validate.rs#L25-L30
- We check whether a node exists here.
- For nodes running on remote machines, this check doesn't make sense.
https://github.com/dora-rs/dora/blob/d4ff5868c56f5070d54b8cda67b02fe18193ac46/binaries/daemon/src/spawn.rs#L81-L90
- We do support URLs as node sources. This could be useful for distributed deployments.

Jun 05 '24 17:06 phil-opp

Option 1: Use the working directory of the daemon by default (i.e. the directory where the daemon was started in)

I think that if this is the specification it should be the same specification across local and remote node, thus breaking changes compared to current implementation.

Option 2: Only allow absolute paths for remote machines (this is probably too limiting)

I would expect this to be an available option all the time as it might not be easy to specify a specific file very far from the daemon spawning path.

Jun 06 '24 08:06 haixuanTao

If we pass parameter working_dir when we start daemon then we can manage working_dir just like machine_id, When a node is started, the coordinator passes the working_dir of the corresponding daemon so we don't have to skip checks. However, I'm not sure that's possible.

Jun 06 '24 08:06 Gege-Wang

If we pass parameter working_dir when we start daemon then we can manage working_dir just like machine_id, When a node is started, the coordinator passes the working_dir of the corresponding daemon so we don't have to skip checks. However, I'm not sure that's possible.

This works when both daemons are running on the same machine. However, if a daemon runs on a remote machine, we have no access to its file system, so we cannot check the paths.

Jun 06 '24 10:06 phil-opp

Option 1: Use the working directory of the daemon by default (i.e. the directory where the daemon was started in)

I think that if this is the specification it should be the same specification across local and remote node, thus breaking changes compared to current implementation.

Good point, I added this drawback to the list above.

Option 2: Only allow absolute paths for remote machines (this is probably too limiting)

I would expect this to be an available option all the time as it might not be easy to specify a specific file very far from the daemon spawning path.

Yes, it's always available as an option. What I meant is that we don't allow relative paths for remote machines.

Jun 06 '24 10:06 phil-opp

In that case can we maybe try an implementation using option 2 before making Option 1.

What do you think @XxChang @Gege-Wang ?

Jun 06 '24 12:06 haixuanTao

In that case can we maybe try an implementation using option 2 before making Option 1.

What do you think @XxChang @Gege-Wang ?

I think it is good, let me do it.

Jun 06 '24 12:06 XxChang

I opened a draft PR that implements option 2 a few days ago, maybe that's useful: https://github.com/dora-rs/dora/pull/534

One challenge is the multiple-daemons test, which runs on multiple machines, which all resolve to the same local machine. Using absolute paths in its config is not ideal because we want to commit the test to git and run it on different machines.

Jun 06 '24 14:06 phil-opp

I see.

Maybe we can use some environment variable to fix CI?

Otherwise, I guess it's fine to hard code GitHub CI path for now.

Jun 06 '24 16:06 haixuanTao

There are some issues about dora start and dora check to skip checking paths on remote machines:

if cli and coordinator are local, some daemons are on remote machines, when dora start, the cli check(&working_dir) will go to this branch and resolve_path. Here the resolve_path will fail, because the remote daemon path exist check should be skip here. https://github.com/dora-rs/dora/blob/eda09cb9c8f4428f908d43d19e257e1528e7433a/libraries/core/src/descriptor/validate.rs#L54-L57

If the cli check the dataflow, this problem should be always here, because cli never know whether the daemon are local or remote.

if the cli and coordinator is ubuntu, and some remote daemons are windows, the windows absolute path will be checked into relative. Here the check fails, even though we write the right absolute path.

 let path = Pathbuf::from("C:\\dora\\tmp\\test.log");
    if path.is_absolute() {
        println!("Path is absolute");
    } else {
        println!("Path is relative");
    }

if cli and coordinator are local, some daemons are on remote machines, theoretically，we can start the dataflow like this.

# cli
dora coordinator
dora daemon --machine-id A 

# remote
dora daemon --machine-id B --coordinator-addr <remote-ip>:<port>

however, it doesn't work, because the ip of machine A is 127.0.0.1，so we must start dataflow like this

# cli
dora coordinator
dora daemon --machine-id A  --coordinator-add <local-ip>:<port>

# remote
dora daemon --machine-id B --coordinator-addr <remote-ip>:<port>

I don't understand why we use one work_dir per-daemon. theoretically, We should use one work_dir per dataflow? And why we have to check the dataflow in cli, it is complex in multiple-daemon.

Jun 13 '24 07:06 Gege-Wang

Thanks a lot for testing and reporting these issues! This is very useful!

if cli and coordinator are local, some daemons are on remote machines, when dora start, the cli check(&working_dir) will go to this branch and resolve_path. Here the resolve_path will fail, because the remote daemon path exist check should be skip here.

I think we can fix this in the following way:

For dora check, only print a warning if the path doesn't exit, instead of failing.
For dora start, we should get the query the list of remote_machine_ids from the coordinator.

if cli and coordinator are local, some daemons are on remote machines, theoretically，we can start the dataflow like this.
# cli
dora coordinator
dora daemon --machine-id A 

# remote
dora daemon --machine-id B --coordinator-addr <remote-ip>:<port>
however, it doesn't work, because the ip of machine A is 127.0.0.1，so we must start dataflow like this
# cli
dora coordinator
dora daemon --machine-id A  --coordinator-add <local-ip>:<port>

# remote
dora daemon --machine-id B --coordinator-addr <remote-ip>:<port>

The issue is around these lines:

https://github.com/dora-rs/dora/blob/9d2ee36cd3681cd7a31cb8757718769a68f46267/binaries/coordinator/src/lib.rs#L181-L184

If the peer_ip is the loopback address, we know that the coordinator and the daemon run on the same machine. So other daemons should be able to reach the registered deamon through the same IP address as the coordinator. So a (hacky) fix could be to replace the 127.0.0.1 with the coordinator listen IP.

if the cli and coordinator is ubuntu, and some remote daemons are windows, the windows absolute path will be checked into relative. Here the check fails, even though we write the right absolute path.
 let path = Pathbuf::from("C:\\dora\\tmp\\test.log");
    if path.is_absolute() {
        println!("Path is absolute");
    } else {
        println!("Path is relative");
    }

Good catch! So we need a way to check whether a path is an absolute Windows path on Linux systems (and the other way around). Maybe there are some crates that allow this? Alternatively, we could copy the libstd implementations and provide them in architecture-independent functions.

Jun 20 '24 12:06 phil-opp