signac-flow icon indicating copy to clipboard operation
signac-flow copied to clipboard

Separating `operation.cmd` when using `@with_job` and `@cmd`

Open justinGilmer opened this issue 6 years ago • 20 comments
trafficstars

Feature description

When using the with_job and cmd decorators together, it would be great to separate the changing of the directories from the cmd itself.

Proposed solution

I have been using the titan.sh template for my research. I have quite a few operations that require changing into the job.workspace() and then using a flow.cmd. These are also parallelized, so I am using the mpiexec variable from the jinja2 template. This is where the issue occurs. When with_job writes the command to change directories out to a batch script, it is placed within the mpiexec call, which causes it to fail since it thinks that call is a file.

There are definitely work arounds, but it would be nice to have the ability to split those two in the jinja templates.

Additional context

Current implementation

{{ mpiexec}} {{ operation.cmd }}...

What would be nice to see

{{operation.with_job}}{{mpiexec}}{{operation.cmd}}

justinGilmer avatar Mar 05 '19 04:03 justinGilmer

Do we think this be the default behavior? When I added with_job I didn't event think about how this would mess up the command prefix stuff.

mikemhenry avatar Mar 06 '19 19:03 mikemhenry

Personally, it seems logical that changing into the directory would be separate from the command. @cmd would contain the actual shell script, and the with_job would also be separate.

Especially since the current implementation requires with_job to appear before the cmd decorator.

justinGilmer avatar Mar 07 '19 16:03 justinGilmer

It definitely makes the overall logic and code in my project.py much cleaner as well. Since I am currently working around this by manually changing into the directory and then manually adding the proper mpi call and ranks needed. I am still using a template.sh file, but I am losing out on a lot of the convience that could be had.

justinGilmer avatar Mar 07 '19 16:03 justinGilmer

I agree with @justinGilmer that the most natural way is probably to extend the template. I believe this will also give us the best chance to not break existing scripts.

I suggest that we make this slightly more general and extend the relevant template line in the following way:

{{ changedir }}{{ mpi_prefix }}{{ cmd_prefix }}{{ operation.cmd }}{{ cmd_suffix }}

The changedir field would be empty by default and populated with 'trap "cd $(pwd)" EXIT && cd {directory} && '.format(directory=directory) in combination with the following decorator:

@Project.operation
@chdir('mypath')
def my_op(job):
    pass

where the argument to chdir is either an absolute path, a path relative to the project root directory, or a function of job(s). The @with_job decorator is then just an alias for @chdir(lambda job: job.workspace()).

What do you think?

csadorf avatar Mar 07 '19 18:03 csadorf

I like it, might as well future proof things a bit

mikemhenry avatar Mar 07 '19 18:03 mikemhenry

This is related to issue #119 .

csadorf avatar Jul 05 '19 17:07 csadorf

@justinGilmer could you give us some example code where this would be used. We need to understand the behavior a little better before we can decide on implementation details.

Assuming this is still relevant to you.

b-butler avatar Jul 12 '19 16:07 b-butler

Due to the changes made in #184, any logic that we apply here will now probably live in the Environment classes. We may need to use the cmd_suffix as well. @justinGilmer would you still be able to provide us with an example snippet? @csadorf any thoughts on what the implementation would like like now, given that the prefix logic has moved into the environment classes?

vyasr avatar Feb 26 '20 17:02 vyasr

Oops sorry about that @b-butler ! I missed that request entirely! Ill try to get an example later this week!

justinGilmer avatar Feb 26 '20 17:02 justinGilmer

Thanks Justin!

vyasr avatar Feb 26 '20 18:02 vyasr

@csadorf any thoughts on what the implementation would like like now, given that the prefix logic has moved into the environment classes?

I assume it would be handled in: https://github.com/glotzerlab/signac-flow/blob/f40b7ebe6fb2f172b9fb36f4458c1ef90748267a/flow/environment.py#L230

csadorf avatar Feb 27 '20 17:02 csadorf

I think the open question is how should this behave with respect to the suffix? For consistency with the current behavior I would expect the return to the original directory to happen before the suffix, but now is the time to consider that if we're already making a change like this. In particular, we should consider the following questions: 1) What are practical uses of the cmd_suffix? 2) Are they things that should still happen in the same directory?

vyasr avatar Feb 28 '20 01:02 vyasr

  1. What are practical uses of the cmd_suffix?

One could use it to generally pipe output into some file or to integrate the command into some kind of subprocess.

  1. Are they things that should still happen in the same directory?

That would be my inclination.

csadorf avatar Mar 02 '20 09:03 csadorf

In that case the logic for this issue wouldn't just go in get_prefix, right? You need it to wrap the entire command, including the cmd_suffix.

vyasr avatar Mar 04 '20 14:03 vyasr

It appears so.

csadorf avatar Mar 04 '20 16:03 csadorf

@klywang Can you give an update on this issue? Has this potentially been "automatically" resolved with some of the other patches that have been merged in the meantime?

csadorf avatar Sep 28 '20 16:09 csadorf

Could someone provide an example of where/how this is used, and what the undesirable behavior looks like? I think I need to see the code and output to get a better understanding of the issue. Maybe @justinGilmer, if this is still relevant to you?

klywang avatar Nov 10 '20 20:11 klywang

Sure, sorry about the super tardy response. This was an older project for the now-decommissioned OLCF Titan, but there were some command line calls that were necessary for each statepoint (an example from some of that study, which this issue was originally based on is here). Here is an excerpt:

@Project.operation
@flow.directives(nranks=16)
@flow.directives(ngpu=1)
@Project.pre.after(is_nvt_grompp)
@Project.pre.after(nvt_equil_grompp)
@Project.post.isfile('nvt.gro')
@flow.cmd
def mdrun_nvt(job):
    msg = _mdrun_str('nvt')
    return "cd {}; aprun -n 16 {}".format(job.workspace(), msg)

In this case, for each statepoint that fit in this criteria, we needed to change into the respective statepoint directory and then run a simulation code parallelized.

I originally tried this using the with_job decorator, since that would handle the directory switching. However, this was not successful since the breakdown of this in the templates put the with_job script within the aprun command, which was not intended. I had to manually change into these directories, as shown by the excerpt above. This was only an issue with the parallelized commands.

What i wanted to happen was the return statement in the excerpt above. However, the resulting line generated was more like this:

'aprun -n 16 cd path/to/statepoint simulation_call'

Hopefully this helps!

justinGilmer avatar Dec 01 '20 20:12 justinGilmer

@klywang Are you still working on this?

csadorf avatar Mar 25 '21 15:03 csadorf

Hey sorry about the delayed reply. I'm having a hard time reconciling @justinGilmer's example and the issue itself. I tried to replicate the error in the following ways:

(a) When I try project.py submit --pretend on this:

@Project.operation
@flow.with_job
@flow.cmd
def foo(job):
     return "aprun -n 16 something"

The command looks like this:

trap "cd $(pwd)" EXIT && cd /path/to/job && aprun -n 16 something

(b) With directives, e.g.:

@Project.operation.with_directives({"nranks": 16})
@flow.with_job
@flow.cmd
def foo(job):
     return "aprun -n 16 something"

For the machine I tested on, I get this:

mpiexec -n 16  trap "cd $(pwd)" EXIT && cd /path/to/job && aprun -n 16 something

Where, in this example, my understanding is that:

  • {{ changedir }} = cd /path/to/job
  • {{ mpi_prefix }} = mpiexec -n 16 (is this supposed to be aprun -n 16 in this example?)
  • {{ cmd_prefix }} = trap "cd $(pwd)" EXIT
  • {{ operation.cmd }} = aprun -n 16 something
  • {{ cmd_suffix }}

To the best of my understanding, case (a)'s output looks more like the example provided, but we want to address case (b) with this issue. Is my understand correct?

klywang avatar Sep 14 '22 17:09 klywang