nrn icon indicating copy to clipboard operation
nrn copied to clipboard

subsequent `setup_transfer -> stdinit` produces different results

Open alexsavulescu opened this issue 3 years ago • 12 comments

Seems like after #1051 refactoring we are hitting the following error in the BBP simulation stack. Failing test: neocortex-scx-v5-gapjunctions:

62 special: ParallelContext.setup_transfer() needs to be called.

62  near line 0
62  {Cell[2].soma TPointList[20794].append(0.5)}                       ^
        62 finitialize(-65)
      62 init()
    62 stdinit()

MPT ERROR: Rank 62(g:62) is aborting with error code -1.
	Process ID: 44289, Host: r1i7n10, Program: /jenkins/06/workspace/hpc.SimulationStack/BUILD_HOME/spack/opt/spack/linux-rhel7-x86_64/intel-19.1.2.254/neurodamus-neocortex-develop-5dntmn/bin/special
	MPT Version: HPE HMPT 2.22  03/31/20 16:17:35

Experiencing this issue with

  • master
  • 8.0-cherries (going into neuron future 8.0 branch)

The simulation passes if I revert #1051 in branch:

  • 8.0-bitter-cherries

alexsavulescu avatar Apr 14 '21 12:04 alexsavulescu

Does the test actually call setup_transfer after the gap information (source_var, target_var) has been setup?

nrnhines avatar Apr 14 '21 17:04 nrnhines

Does the test actually call setup_transfer after the gap information (source_var, target_var) has been setup?

setup_transfer is called but it's bit later in the python version of neurodamus i.e. before stdinit. If I call it right after source_var, target_var has been setup then it's working.

For my understanding - does setup_transfer() needs to be called right after gap info is setup? what are requirements?

pramodk avatar Apr 14 '21 19:04 pramodk

The intention is for it to be required after the last source_var/target_var and before finitialize. If there is inadvertent extra dependence on it, I'd like to know

nrnhines avatar Apr 14 '21 20:04 nrnhines

@ferdonline can tomorrow confirm if there is any bug on our side in neurodamus but at least the behaviour is changed after #1051.

pramodk avatar Apr 14 '21 21:04 pramodk

This is speculative. The error message was generated because is_setup_ was checked in void thread_transfer and was false. On launch is_setup_ is initialized to false and is set to false on a call to source_var and target_var. is_setup_ is set to true on a call to setup_transfer. The thing that is new in this pull request is the activation of nrnthread_v_transfer_ = thread_transfer; // otherwise can't check is_setup_ whenever source_var or target_var is called. So my speculation is that neurodamus called something (finitialize) that called nrnthread_v_transfer_ before setup_transfer and got away with it prior to the merge of this pull request, but it didn't matter because it called finitialize again after setup_transfer.

nrnhines avatar Apr 15 '21 12:04 nrnhines

I think I understood the issue with neurodamus-py. Indeed we didn't have a setup_transfer before stdinit when dumping gid info. However, and this is a bit worrying, if we do setup_transfer -> stdinit followed by another setup_transfer -> stdinit results are different. Any hint on that/way to fix?

ferdonline avatar Apr 15 '21 15:04 ferdonline

Too ambiguous for me. What results (of a run or of the dumped gid info?)

nrnhines avatar Apr 15 '21 15:04 nrnhines

The run results (reports) are different. I also haven't looked deeply, I am just raising awareness and asking for initial thoughts.

ferdonline avatar Apr 15 '21 15:04 ferdonline

Any extra calls to setup_transfer should not affect results.

nrnhines avatar Apr 15 '21 15:04 nrnhines

Anyway, that shouldn't hit us in neurodamus since we changed it to have a single set of calls to setup_transfer -> stdinit. However, it seems setup_transfer -> stdinit -> dump info -> setup_transfer -> stdinit -> run sim produced different results than setup_transfer -> stdinit -> run sim. I will try to isolate the case then.

ferdonline avatar Apr 15 '21 15:04 ferdonline

I'd suspect that a single stdinit() is not sufficient to reach a fixed point for initialization. The setup_transfer may be a red-herring. Check first for identical results for one and two calls to stdinit() and then a psolve.

nrnhines avatar Apr 15 '21 16:04 nrnhines

Discussed offline with @ferdonline, this is not blocking for 8.0.

alexsavulescu avatar Apr 19 '21 07:04 alexsavulescu