planemo icon indicating copy to clipboard operation
planemo copied to clipboard

planemo 0.70 traverses complete subtree even if tools are listed

Open bernt-matthias opened this issue 4 years ago • 12 comments

I like to run planemo test A.xml B.xml ... because sometimes the tool directory is not clean, i.e. there is a lot of other things besides tools, test-data, etc.

Used to work until 0.62.1.

Now planemo needs a large amount of time until anything happens (strace indicates that it tries to copy to tmp) and finally stumbles over a broken symlink in my case.

Traceback (most recent call last):
  File "/home/berntm/.venv3/bin/planemo", line 8, in <module>
    sys.exit(planemo())
  File "/home/berntm/.venv3/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/berntm/.venv3/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/berntm/.venv3/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/berntm/.venv3/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/berntm/.venv3/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/berntm/.venv3/lib/python3.6/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/home/berntm/.venv3/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/berntm/.venv3/lib/python3.6/site-packages/planemo/cli.py", line 190, in handle_blended_options
    return f(*args, **kwds)
  File "/home/berntm/.venv3/lib/python3.6/site-packages/planemo/commands/cmd_test.py", line 80, in cli
    runnables = for_paths(paths, temp_path=temp_path)
  File "/home/berntm/.venv3/lib/python3.6/site-packages/planemo/runnable.py", line 149, in for_paths
    return [for_path(path, temp_path=temp_path) for path in paths]
  File "/home/berntm/.venv3/lib/python3.6/site-packages/planemo/runnable.py", line 149, in <listcomp>
    return [for_path(path, temp_path=temp_path) for path in paths]
  File "/home/berntm/.venv3/lib/python3.6/site-packages/planemo/runnable.py", line 142, in for_path
    path = _copy_runnable_tree(path, runnable_type, temp_path)
  File "/home/berntm/.venv3/lib/python3.6/site-packages/planemo/runnable.py", line 115, in _copy_runnable_tree
    copy_tree(dir_to_copy, temp_path)
  File "/usr/lib/python3.6/distutils/dir_util.py", line 172, in copy_tree
    verbose=verbose, dry_run=dry_run))
  File "/usr/lib/python3.6/distutils/dir_util.py", line 172, in copy_tree
    verbose=verbose, dry_run=dry_run))
  File "/usr/lib/python3.6/distutils/dir_util.py", line 176, in copy_tree
    dry_run=dry_run)
  File "/usr/lib/python3.6/distutils/file_util.py", line 105, in copy_file
    "can't copy '%s': doesn't exist or not a regular file" % src)
distutils.errors.DistutilsFileError: can't copy '/home/berntm/projects/tools-galaxyp/tools/openms/OpenMS2.4-env/lib/libstdc++.so.6.0.21': doesn't exist or not a regular file

In this case I just need additional software for the automatic creation of some tools. Certainly I could put them somewhere else .. but I don't like to :)

bernt-matthias avatar Feb 04 '20 09:02 bernt-matthias

It's a tradeoff. In return this means you can symlink scripts and test data files from / to anywhere ... I'm not sure we can have both, so do you think we should revert https://github.com/galaxyproject/planemo/pull/988 ?

mvdbeek avatar Feb 04 '20 09:02 mvdbeek

OK I see. Maybe I need to restructure my project then.

bernt-matthias avatar Feb 04 '20 09:02 bernt-matthias

We could also put the new behavior under a switch, so that if you're using a tools-iuc like layout you can use symlinks, while the default would be what it used to be ?

mvdbeek avatar Feb 04 '20 09:02 mvdbeek

I wasn't particularly happy with #988 because I feared all the copying would be a performance hit, so if anyone has ideas how to solve the symlinks issue without copying everything...

nsoranzo avatar Feb 04 '20 10:02 nsoranzo

Well, sure, copying the tool tree isn't free, but if you attempt uploading a structure such as that used by @bernt-matthias to the tool shed it would fail too because it is too large. For a normal structure where perhaps you got a couple of MB in test-data you can surely measure a performance hit, but that's nothing compared to starting up Galaxy.

One could replace the symlinks in place of course (prefix the syminks with something and then move them back after the run), but that seems even more likely to cause issues (what if python dies abruptly, or the prefixed file already exists?).

Staging the tool directory on the Galaxy side is another option, but ultimately you'd have the same issue in another place, so that doesn't seem like a good option either.

We could limit the tool tree copying to the first level of the runnable (so tool.xml, macro.xml, some scripts shipped with the tool if they are at the base on the runnable directory, plus the test-data folder recursively). I think I like that option the most, but we can't know if somebody perhaps keeps their scripts in $__tool_directory__/scripts/some_script.py.

mvdbeek avatar Feb 04 '20 11:02 mvdbeek

How about just copying files and symlinks located in the root and dirs that are needed (test-data and datamanager rated stuff).

Limiting copying depth is also not a good idea for people that like to add structure to test-data.

Anyway, don't worry to much. I can restructure my project.

bernt-matthias avatar Feb 04 '20 11:02 bernt-matthias

Now I restructured my project. Problem is now that starting up planemo t needs more than 30min to startup on my project (admittedly there is a lot of test-data .. 750MB).

bernt-matthias avatar Mar 02 '20 18:03 bernt-matthias

Yeah, we copy the directory for every runnable. That's an easy fix.

mvdbeek avatar Mar 02 '20 21:03 mvdbeek

We should at very least make the behavior in #988 contingent on 1) the engine type being Galaxy, 2) the test data having a symlink outside the tool directory and 3) containers being used for tests - right?

Ugh - this is because we're calling run_tests.sh inside the container right and the tool-util client is coming from Galaxy then even though we've got it available to Planemo as a package. We could call it from outside the container like we do with ephemeris or something right? There is already a ton of progress in that direction...

I can work on some of thing - just want to make sure I understand the problem.

jmchilton avatar Jul 24 '20 15:07 jmchilton

Ugh - this is because we're calling run_tests.sh inside the container right and the tool-util client is coming from Galaxy then even though we've got it available to Planemo as a package.

I don't think that is the issue, run_tests.sh runs on the host, i've written this for the --biocontainers flag.

planemo test A.xml B.xml ...

will trigger the copying once for each tool, let me fix that and then we can see how much of a performance hit that is ? That should be fixed in any case and I'm sorry I didn't do this yet.

  1. the test data having a symlink outside the tool directory and 3) containers being used for tests - right?

I guess, but is it worth breaking the isolation we get with https://github.com/galaxyproject/planemo/pull/988 ? When I looked at this I thought the only viable alternative to https://github.com/galaxyproject/planemo/pull/988 is staging with pulsar, but that seems just as expensive ?

mvdbeek avatar Jul 24 '20 16:07 mvdbeek

I don't think that is the issue, run_tests.sh runs on the host, i've written this for the --biocontainers flag.

I see - I was wrong clearly. But Planemo can see the data - it could just upload it as files instead of test data paths. This is how it used to work - isn't there just some flag we can use to not assume the data is next to Galaxy 😆. I feel like we parameterized it that way - I'll try to figure that out.

I guess, but is it worth breaking the isolation we get with #988 ?

I don't get what you mean by isolation - why is isolation good in this context? I guess I'm worrying about like correctness issues but I can't come up with a counter example.

We will just keep iterating on the concept as it is - thanks for #1037

jmchilton avatar Jul 24 '20 18:07 jmchilton

it could just upload it as files instead of test data paths. This is how it used to work - isn't there just some flag we can use to not assume the data is next to Galaxy 😆. I feel like we parameterized it that way - I'll try to figure that out.

That would work! But now that I try to remember the context, I think this was data referenced from a tool_data_table_conf.xml.test file ... ?

I don't get what you mean by isolation - why is isolation good in this context? I guess I'm worrying about like correctness issues but I can't come up with a counter example.

I was thinking of --update_test_data and data managers writing directly to the test data dir ... which is arguably a bug.

mvdbeek avatar Jul 24 '20 18:07 mvdbeek