dora icon indicating copy to clipboard operation
dora copied to clipboard

Add support for git repository sources for nodes

Open phil-opp opened this issue 11 months ago • 21 comments

Adds support for fetching nodes from git repositories through new git, branch, tag, and rev keys in the dataflow YAML file.

TODO:

  • [x] run build commands in dora run too (for consistency)
  • [x] delay node spawn until all nodes have been built
  • [x] Issue: dora build builds all nodes, independent of their target machine
    • [x] Make dora build behave like a dora start without spawning the process -> done in edcb7e3
    • [x] send stderr (+ stdout?) of build commands back to CLI
    • [x] proper cleanup of running_dataflows
  • [x] fetch when reusing

phil-opp avatar Mar 20 '25 10:03 phil-opp

@haixuanTao The pip-release CI job fails on armv7 and x86 with a linker error:

error: unable to find dynamic system library 'atomic' using strategy 'no_fallback'

This seems to be related to https://github.com/rust-cross/cargo-zigbuild/issues/252 , which is used by maturin. According to the issue thread, the problem should be fixed already, so I'm not sure why it's still failing. Do you have any idea?

phil-opp avatar Apr 02 '25 15:04 phil-opp

~Looks like maturin is still locked to an older version of mimalloc: https://github.com/PyO3/maturin/blob/1331f2512fcc3ebada7f6b4639b0df8810b01231/Cargo.lock#L1299~

Edit: I misunderstood their versioning scheme. Looks like the change was already included in the locked version.

phil-opp avatar Apr 02 '25 15:04 phil-opp

I tried some things to fix the linker errors, but nothing worked. I'm out of ideas for now, so any help would be appreciated!

phil-opp avatar Apr 03 '25 14:04 phil-opp

The underlying issue is that libatomic is not well supported on 32bit platform right?

Like this: https://github.com/rust-lang/cargo/issues/13546 ?

And that we depend on system installation of libatomic to support open-ssl right?

Because git requires ssl to work right ? Because you can only pull git https or ssh both requiring ssl.

In which case it seems to me that if libatomic is not well supported on 32bit platform, we can probably feature flag this feature and only make it available on 64bit machine, how does that sound?

haixuanTao avatar Apr 04 '25 09:04 haixuanTao

So FYI I was able to link some system depency to zig by copying package in /lib/x86_64-linux-gnu/ to $HOME/.rustup/toolchains/1.84-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib

This should make it possible to link dynamic library with zig and finish compilation.

haixuanTao avatar Apr 10 '25 09:04 haixuanTao

See: https://github.com/ziglang/zig/issues/16733

haixuanTao avatar Apr 10 '25 10:04 haixuanTao

So FYI I was able to link some system depency to zig by copying package in /lib/x86_64-linux-gnu/ to $HOME/.rustup/toolchains/1.84-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib

This should make it possible to link dynamic library with zig and finish compilation.

Good idea! I tried to do this with libatomic.so, but the one from lib/x86_64-linux-gnu fails with libatomic.so is incompatible with elf_i386:

ld.lld: error: /home/runner/.rustup/toolchains/1.86-x86_64-unknown-linux-gnu/lib/rustlib/i686-unknown-linux-gnu/lib/libatomic.so is incompatible with elf_i386

I also tried linking/copying /lib/i386-linux-gnu/libatomic.so.1 instead, but this fails again with unable to find dynamic system library 'atomic' using strategy 'no_fallback'.

I don't really know what to do with this information. The directory is in the search path, otherwise copying the x86_64 libatomic wouldn't result in a different error message, but the i386 libatomic seems to be ignored for some reason....

phil-opp avatar Apr 10 '25 14:04 phil-opp

Ah, trying to copy the file instead of symlinking it gave the answer: The /lib/i386-linux-gnu/libatomic.so.1 doesn't exist. I assumed that ln -s would throw an error in this case, but it will happily create a symlink to a non-existing file. Thus, the symlink shows up in the test ls call I did, but it doesn't resolve to anything, so the linker probably ignores it.

I now pushed a commit to install libatomic1-i386-cross, let's see how it goes.

phil-opp avatar Apr 10 '25 14:04 phil-opp

New error message :tada:

💥 maturin failed
  Caused by: Cannot repair wheel, because required library libatomic.so.1 could not be located.

phil-opp avatar Apr 10 '25 15:04 phil-opp

@haixuanTao I finally got it working on both 32-bit x86 and armv7. Thanks again for the symlink tip!

I also fixed a Windows issue with the build commands of our dataflow examples in c0ac912.

I rebased away my debug commits, so I think this should be ready for merging as soon as CI passes. Are you fine with merging?

phil-opp avatar Apr 11 '25 14:04 phil-opp

The musllinux failures seem unrelated to this PR. They didn't happen with the last commit, but they do happen on main:

  • https://github.com/dora-rs/dora/issues/967

phil-opp avatar Apr 15 '25 11:04 phil-opp

The node-hub failure seems to be unrelated too, as it also happens on main. See https://github.com/dora-rs/dora/pull/958#issuecomment-2804760479

phil-opp avatar Apr 15 '25 11:04 phil-opp

Converting this back to a draft because I want to do some redesigning of the spawning process. Now that we run the build as part of the spawn, we don't want the daemon main loop to wait for it to finish because it would become unresponsive. So the plan is to do it fully asynchronous instead.h

phil-opp avatar Apr 15 '25 16:04 phil-opp

So the plan is to do it fully asynchronous instead.

Done, should be ready to merge again.

phil-opp avatar Apr 16 '25 09:04 phil-opp

Edit: Oops, sorry wasn't running latest version!

haixuanTao avatar Apr 17 '25 12:04 haixuanTao

Ok so it seems that when using the run command but there is a problem with the build command, dora run will be stuck in place.

Could it be possible to fail gracefully if the build command fails?

Thanks in advance

haixuanTao avatar Apr 17 '25 12:04 haixuanTao

It seems that the run command start before the build command finished for all node. I think this might be confusing. Could we wait for all node to finish building before starting the dataflow?

haixuanTao avatar Apr 17 '25 21:04 haixuanTao

As we're now running build on each step, it could be good to filter build step to only the one if the local daemon machine id ( and if it exist ) match the dataflow builtin machine id.

Otherwise it's very hard to manage the build step especially on multi OS dataflow.

haixuanTao avatar Apr 18 '25 13:04 haixuanTao

FYI, this will make daemon not work with older version of the coordinator with error:


Caused by:
    unknown variant `dora-microphone`, expected `Local` or `GitBranch` at line 1 column 434

Location:
    /Users/xaviertao/Documents/work/dora/binaries/daemon/src/coordinator.rs:78:29

haixuanTao avatar Apr 18 '25 14:04 haixuanTao

FYI, this will make daemon not work with older version of the coordinator with error:

Yes, I don't think that we can avoid this unfortunately. We should bump the message crate to v0.5.0 when we do the next release. Then the error message should be that the dora versions are not compatible. See https://github.com/dora-rs/dora/blob/20a1ade6add4d22ea7c9324aca87ea96651f9f91/libraries/message/src/daemon_to_coordinator.rs#L31-L44

phil-opp avatar Apr 23 '25 15:04 phil-opp

As we're now running build on each step, it could be good to filter build step to only the one if the local daemon machine id ( and if it exist ) match the dataflow builtin machine id.

I don't understand what you mean with this. Each node should be built only once, on the machine where it's going to be spawned.

phil-opp avatar Apr 23 '25 15:04 phil-opp

Seems like we can't build a dataflow without a coordinator running anymore:

[ERROR]
failed to connect to dora coordinator

Caused by:
    Connection refused (os error 111)

Location:

haixuanTao avatar May 02 '25 14:05 haixuanTao

I have to say that from a user perspective it is quite confusing to now have to link the path depending on git and not the dataflow working directory, as it means that I have to make couple of changes and revert them if I want to use git which I feel is going to be the source of many errors.

We should either link everything from the git workspace or like root workspace otherwise it's impossible to understand what to do.

haixuanTao avatar May 02 '25 14:05 haixuanTao

I would prefer if start does not try to build the dataflow. It can happen that we make some environment fixes or some code fix in between start and then because of the build those changes are not persisted.

Can we split build and start/run for now to avoid the risk of breaking some in between fixes?

It's also a bit slow and might be a bit frustrating for some people just looking for starting the process.

haixuanTao avatar May 02 '25 15:05 haixuanTao

I'm also concerned about what if there is no internet connection when we try to start the dataflow? and then build and start command does not work?

haixuanTao avatar May 02 '25 15:05 haixuanTao

Thanks for the review!

Seems like we can't build a dataflow without a coordinator running anymore:

Yes, we decided in our last meeting that dora build should behave like dora start without the spawning. So now the build command is sent to the coordinator and forwarded to the daemons, which then do the actual build.

We could bring back the old behavior of dora build if the dataflow does not define any _unstable_deploy keys. What do you think? Do we want to do a git checkout in these cases too?

the path depending on git and not the dataflow working directory, as it means that I have to make couple of changes and revert them if I want to use git which I feel is going to be the source of many errors.

I think it makes sense to have the git checkout dir as working directory. This way, the build and path commands are started in your repository directory.

Which directory would you set as working directory instead? If we choose some "root workspace", how would the build and path commands refer to the checked out git directory?

You mention a "git workspace", but each repo is checked out in a different folder, so there is no single directory that we can choose as default.

I would prefer if start does not try to build the dataflow. It can happen that we make some environment fixes or some code fix in between start and then because of the build those changes are not persisted.

This is for local environments only, right? Because for distributed environments you would need to ssh into all machines, so fixing the build command would be easier.

It's also a bit slow and might be a bit frustrating for some people just looking for starting the process.

That's probably python-related right? Is there a way of making the pip install etc faster if there is nothing to do?

Can we split build and start/run for now to avoid the risk of breaking some in between fixes?

We try to reuse existing git checkouts to avoid cloning the repos again. On the other hand, we want to pull the latest changes to run the latest code. This might lead to conflicts if you run multiple dataflows at the same time that use the same git repository and branch. We don't want to change the source code while a dataflow is running as this might lead to unintended behavior and random bugs. So we do a new clone into a different directory when the second dataflow is started.

If we don't checkout+build on dora start, the following can happen:

  1. Build dataflow 1 with git branch source
    • this clones the repo
  2. Start dataflow 1
  3. Push a new commit to git branch
  4. Build dataflow 2 with same git branch source
    • there is already a clone of the repo, but we cannot use it as the other dataflow is still running
    • so we choose a new directory and create a new clone there
  5. dataflow 1 finishes
  6. Start dataflow 2
    • there are two checkouts of the repository at different commits
    • which one is the correct one?

The problem is that the coordinator doesn't know which start command corresponds to which build command. But the time of the build command is important to pick the correct git hash.

One potential solution for this could be that we assign a dataflow UUID+name on dora build already and require this token on dora start. We could make this optional for dataflows without git sources or if there is only one candidate (i.e. there was only one dora build before the dora start). What do you think about this approach? Do you have another idea?

I'm also concerned about what if there is no internet connection when we try to start the dataflow? and then build and start command does not work?

If you specify a node source on the internet, there is not much we can do without an internet connection, no?

phil-opp avatar May 03 '25 17:05 phil-opp

Maybe we add some kind of "run-local" feature:

  • Make dora run ignore the _unstable_deploy key -> all nodes are spawned locally
  • Add a local_git_checkout config option that is used by dora run.
    • If a node has a local_git_checkout key, ignore the git, branch, etc keys.
    • Instead, assume that the correct git branch is checked out at local_git_checkout
    • Skip the build process too if local_git_checkout is given
  • Add a dora build --local command that runs all the build commands on the local machine.

This way, users can quickly try out things, fix bugs, and use manual build commands while developing, using dora run. Once ready, they can set up their coordinator and daemons, and use dora start to do a distributed checkout+build+run.

phil-opp avatar May 03 '25 17:05 phil-opp

Yes, we decided in our last meeting that dora build should behave like dora start without the spawning. So now the build command is sent to the coordinator and forwarded to the daemons, which then do the actual build.

We could bring back the old behavior of dora build if the dataflow does not define any _unstable_deploy keys. What do you think? Do we want to do a git checkout in these cases too?

I think that we should probably check if the coordinator is present.

If yes, then behave by spawning build on each daemon.

If there is no coordinator:

We can either error out if there is any _unstable_deploy and build all locally if there is no key. How does that sound?

I think we can git checkout yes in that case to.

haixuanTao avatar May 05 '25 09:05 haixuanTao

I think it makes sense to have the git checkout dir as working directory. This way, the build and path commands are started in your repository directory.

Which directory would you set as working directory instead? If we choose some "root workspace", how would the build and path commands refer to the checked out git directory?

I think that what we can do is to specify the exact git path from the dataflow like:

   id: abc
   git: gitrepo
   rev: 123
   path: build/gitrepo/123/my_node.py

This way it is a bit clearer as well as people can find it if needed? and then we also only have one logic for the whole dataflow.

haixuanTao avatar May 05 '25 09:05 haixuanTao

The problem is that the coordinator doesn't know which start command corresponds to which build command. But the time of the build command is important to pick the correct git hash.

I understand. I think that specifying the git hash within the path should make this conflict free, right?

The thing with build+start is that i'm very afraid of a breaking dependency update that breaks the build or start indefinetly.

Such as the chrono compatibility issue or just our own ones.

haixuanTao avatar May 05 '25 09:05 haixuanTao