nimble Allow a compiled nimbleFunction object (with setup code) in models.

Do not merge.

This is something I sketched out a few months ago and am throwing into testing now to see what breaks.

These changes allow a user-defined distribution or function to be an object of a nimbleFunction class with setup code. This would allow embedding input data into that object in cases where it doesn't really need to be in the model. This may need clean-up and completion. I am just moving it aside for now and want to see testing.

Oct 12 '23 22:10 perrydv

@paciorek I think this is working and the only test failure is the car issue that propagated from devel to other branches. This wasn't a release priority but it could go in. Let's discuss how conservative we should be given everything else going into the imminent release. I haven't added tests to this new feature yet but that should be easy.

Jan 27 '24 22:01 perrydv

Also this is a feature that we turn off by default and not document. We know some users who would want to start using it.

Jan 27 '24 22:01 perrydv

@perrydv I'm happy to have this included. I'll let you do any additional work you want on it. Then just let me know it is ready to merge in. It might not matter, but probably best if you let me merge it in.

Jan 27 '24 23:01 paciorek

@perrydv This seems like hugely useful functionality, which I'm quite excited to see.

Do you have any code which demonstrates use? It doesn't have to do anything useful, but only be a proof of concept - including the name of the nimbleOption which protects this.

Jan 28 '24 17:01 danielturek

Thanks @danielturek. The functionality is rudimentary. Here is an example:

library(nimble)

myDist <- nimbleFunction(
  setup = function() {mean <- 1},
  run = function(x = double(0), sd = double(), log = integer(0, default = 0)) {
    return(dnorm(x, mean = mean, sd = sd, log = log))
    returnType(double())
  }
)

myDist1 <- myDist()

mc <- nimbleCode({
  x ~ myDist1$run(2)
  y ~ dpois(exp(x))
})

registerDistributions(list('myDist1$run' = list(BUGSdist = 'myDist1$run(sd)',
                                               types = c('value = double()',
                                                         'sd = double()'))),
                      check = FALSE)

m <- nimbleModel(mc, debug = FALSE, data = list(y = 3), inits = list(x = .5))
cm <- compileNimble(m)
mcmc <- buildMCMC(m)
cmcmc <- compileNimble(mcmc, project = m)
samples <- runMCMC(cmcmc, niter = 5000)

Notice:

One must register the distribution manually, and with check=FALSE. At least checkDistributionFunctions would need updating to support automatic registration.
The distribution function name does not need to start with "d". However, do we want to impose that requirement so that automatic determination of the "r" function can work? I think these would be in the method names. I guess a potential use case is different nimbleFunction objects for the "d" and "r" functions; not sure if we should worry about that.
My statement above that a user-defined function is also supported does not seem to be correct, so that would take more work.

Given that this is so rudimentary, but we are aware of use cases waiting for this, and we can't delay the release, here are some options:

We could hold this out of the release entirely and point early adopters of this feature to a new branch that will go in the next next release.
We could include it in the release with allowNFinModels=FALSE and tell early adopts how to turn it on.

The first option would have the advantage that we could more dynamically update it for early adopters.

Jan 29 '24 02:01 perrydv

@perrydv If this is included in the release, my initial reaction is to keep the default as allowNFinModels=FALSE for the time being.

I think there are some reasonable questions that should be addressed before turning this on by default:

Automatic registration?
Naming requirements? (d versus r functions)
Also providing support for user-defined functions (with setup code) in models?

I also think it's a fair question whether the argument name allowNFinModels is the final choice. Is this too implementation specific?

Jan 29 '24 13:01 danielturek

I'm on the fence about this. I guess if we know about use cases, it makes sense to put in, behind an option, and labelled experimental. We can always point initial users to branch if we want to make changes to it.

But particularly given release timeline, leaving it out also makes good sense.

Jan 29 '24 17:01 paciorek

@perrydv and I agreed that we not put this in 1.1.0.

Jan 29 '24 21:01 paciorek

I tend to agree, sounds good.

Jan 30 '24 01:01 danielturek

@perrydv @paciorek I was excited about this being added in the release, but I understand the hesitancy. I am happy to help test it and write up some use cases because I personally have some example code that I think this would make a lot cleaner by being able to cache some values and not repeat computation without it being a bunch of constants defined in the model. If that would be helpful let me know.

Jan 30 '24 17:01 paul-vdb

Hi @danielturek @paciorek @paul-vdb I have pushed updates to this functionality. Here is a summary (note that some changes affect our current system for distribution registration):

Use of "d" and "r" names is now supported & enforced for methods in a nimbleFunction. E.g. myNF$dcustom or even simply myNF$d. If an r function is provided, it must be a method whose name corresponds to that of the d function. If an r function is not provided, one will be automatically generated (as currently done). Note however that it is not possible to generate a new method into an already-defined nimbleFunction. Therefore the generated r function is a stand-alone (no setup code) nimbleFunction. In this example it would be rcustom_myNF_dummy.
All generated r functions (including for existing functionality for user-defined dists) now have the suffix _dummy or __dummy added to their name. If we don't like having this change for existing functionality, I can revert it easily. For the new functionality, the name is more clearly artificial anyway, so the new suffix seems reasonable.
I refactored and extended the steps done inside of registerDistributions. This was a moderately substantial change.
User-defined functions that are methods of nimbleFunction objects are now also supported. The same nimbleFunction object can provide methods as user-defined functions and (possibly multiple) user-defined distributions.
Existing tests in test-user could not be run twice in the same R session due to a missing deregisterDistributions, which resulting in a piece of litter. I cleaned that up.
The changes are largely but not entirely gated by checking the nimbleOption allowNFinModel. Some changes will affect processing even if that option is FALSE. A common need was some version of changing as.character(code) to safeDeparse(code) where code is the foo of foo(x), which could now be myNF$foo(x), necessitating a full deparse of myNF$foo since it is not guaranteed to be just a name.
See the lower half of test-user (clearly marked) to walk through the new functionality.
Happy to hear better name ideas for the new option than allowNFinModel.

Feb 22 '24 02:02 perrydv

This is now passing tests. Some of the test failures arose because existing tests needed updating. Some were from the distribution name swapping that comes up (e.g. dbin and dbinom) that I've ironed out. Also I made it work with T() when the p and q functions are provided in the nimbleFunction. I think is ready for any code review.

Feb 24 '24 22:02 perrydv

Nice!

My review so far is just looking at the test suite.

When we generate the dummy r, our message (here specifically for one of the test cases) says "NIMBLE is generating a placeholder function, r2", when the name of the actual dummy in this case is r2_myDist1_dummy.
If we start naming dummy r functions with _dummy we should think about whether to do that for cases not involving nimbleFunctions in models. Given we haven't been doing it, my slight preference is not to do it for this new functionality.
So we don't want users to use the run method as the d function, right? Folks might be inclined to do: x ~ dmynorm$run().

Relatedly we are erroring out in an unclear way if a user doesn't name their d method starting with "d". E.g., if in the test starting on line 605, we name the method foo instead of dmynorm, this is the error:

 [Note] Registering 'myDist1$foo' as a distribution based on its use in BUGS code. If you make changes to the nimbleFunctions for the distribution, you must call 'deregisterDistributions' before using the distribution in BUGS code for those changes to take effect.
 Error in FUN(X[[i]], ...) :  checkDistributionFunctions: expecting 'n' as the first argument for the simulation function for foo.

The tests that use deregisterDistributions don't check that the distribution has been removed from our set of user-defined dists. We may want to add an expectation for that.
Line 841 has a (presumably forgotten) comment about an error that no longer seems to error.
The last test (about "p" and "q") needs an expectation or testthat will ignore it.

This doesn't feel like the sort of functionality that will silently give wrong results, but if there are parts of the code you want us to look at, let us know.

Mar 02 '24 00:03 paciorek

For the manual, my thought is it will suffice to show a basic example with just a "d" function and then a fairly full example like that in test-user.R line 807, but perhaps adding an rbar method and illustrating the idea of caching big objects via setup code rather than having them in the model).

Mar 02 '24 00:03 paciorek

Hi @paciorek I'm catching up on this and realizing I didn't catch your comments in March. Numbers will match your points:

Good point. I will hold off on changing this depending on final decision about the dummy suffix.
What are the relevant cases not involving nimbleFunctions in models? I do get the inclination to not make the change if it's not necessary.
I thought about this and concluded that sticking with "d" and "r" seemed like the most internally coherent approach. To decide whether an "r" function has been provided, it works well to automatically use the "d" name to determine what the "r" name should be. Without that, we will need the user to tell us the "r" name (or choose some required name), which reduces the "it just works" behavior. And also note that a single nimbleFunction object could potentially provide multiple distributions (some with and some without "r" functions). And a lot of our processing steps with distributions rely on the "d" and "r" conventions, so it is really helpful to be able to stay consistent.
Good point. I will try to fix this.
IIRC, I got a bit stuck with the weirdness of where to find things when evaluating in a testthat environment and I gave up at that moment but might be able to work it out.
Thanks.
IIRC I also burned out a bit on this one and may have had a hard time setting up a test expectation but could likely do so with a fresh look.

So I think the dummy is the one design question, and I guess I should just not do that. And then it looks like I didn't add make user manual changes, per your last comment.

Apr 27 '24 00:04 perrydv

Regarding #2 I just am referring to the fact that with standard user-defined distributions we often generate "dummy" "r" functions in the global env't, but they don't have "dummy" in the name. In any event it sounds like I convinced you to remove "dummy".

For #7, something simple like using expect_silent or doing something basic and checking the result may be sufficient.

Apr 28 '24 19:04 paciorek

@perrydv I spent yesterday updating some code that has a lot of repeated computation in the distribution function. Before I was doing some messy saving of values within model code and passing these matrices to the distribution function. Now with the setup code I can do that all internally and it's way cleaner so huge bonus. The thing I had to think about a lot while doing it was the implication on the model graph when I have some functions that are caching a value internally that might get used elsewhere. Here is a very small example expanded from yours above of where things can go terribly wrong (if someone did something very silly). I don't necessarily think this is a flaw, but something that we should warn that Nimble doesn't detect additional dependencies within the nimble function code with setup. Not sure if there is documentation already written for how to use this.

library(nimble)

myDist <- nimbleFunction(
  setup = function() {mean <- 1},
  run = function(){},
  methods = list(
    dfn = function(x = double(0), sd = double(), log = integer(0, default = 0)) {
      return(dnorm(x, mean = mean, sd = sd, log = log))
      returnType(double())
    },
    rfn = function(n = integer(), sd = double()){
      return(rnorm(1,mean, sd))
      returnType(double())
    },
    cacheNewMean = function(mu = double()){
      mean <<- mu
    }
  )
)

myDist1 <- myDist()

mc <- nimbleCode({
  mu ~ dnorm(0, 1)
  myDist1$cacheNewMean(mu)
  x ~ myDist1$dfn(2)
  y ~ dpois(exp(x))
})

registerDistributions(list('myDist1$dfn' = list(BUGSdist = 'myDist1$dfn(sd)',
                                               types = c('value = double()',
                                                         'sd = double()'))))

m <- nimbleModel(mc, debug = FALSE, data = list(y = 3), inits = list(x = .5))
m$getDependencies("mu")

May 03 '24 17:05 paul-vdb

@perrydv if you could finish off this PR soon, that would be helpful, and allow us to merge in #1414 too.

May 18 '24 17:05 paciorek

@paciorek I will do this. @paul-vdb I am not following your example. myDist1$cacheNewMean(mu) is not a declaration and so shouldn't be expected to do anything. It should be error-trapped but it is possible that an unrelated issue prevented that from happening (and should be fixed now or soon). But the overarching point of drawing attention to implications for the model graph and dependencies makes sense and I will try to do that.

May 21 '24 17:05 perrydv

@paciorek @danielturek @paul-vdb I've made the following changes:

Revert naming of auto-generated "r" function for non-object case to previous behavior, i.e. without "_dummy" suffix.
For auto-generated "r" function for NF object case, "_dummy" is still used. e.g. "obj$ddist" yields "rdist_obj_dummy"
Added an error-trap for e.g. "obj$mydist" where "mydist" does not begin with the required prefix "d".
Changed the nimbleOption name to "allowNFobjInModel" (was "allowNFinModel") and made it default to TRUE.
Added many expect_* statements to tests (and some new tests), especially for results of deregisterDistributions.

I think this should be now very close or ready to merge.

Are we ok with enabling this feature by default now?
Are we ok with the "_dummy" suffix in the case of a dist from an NF object?

May 22 '24 18:05 perrydv

@perrydv

Yes, I'd be happy to see this be enabled by default.
I'm of the reasonably strong opinion to not include any _dummy suffixes. Respectfully, I don't see the point, and it feels like it infringes on a grey area. obj$ddist can just as well auto-generate rdist_obj without any problems, right?

May 22 '24 18:05 danielturek

I agree that we should remove _dummy.

Happy to see this merged in regardless.

May 23 '24 17:05 paciorek

Ok I've just pushed a version with "_dummy" entirely removed from auto-generated r function names.

May 27 '24 04:05 perrydv

@perrydv I see a merge conflict here. Just flagging it for you.

May 28 '24 23:05 paciorek

@paciorek The resolution was to remove the devel parts of the merge conflict and accept this branch's changes. I'm not clear why git got confused (unless I'm confused). If this passes testing (but it doesn't have the dmvt AD test fixed), merge if you are ready and then I can bring the changes from devel into #1414 and make the final tweak we need there.

May 29 '24 16:05 perrydv

@perrydv this is now merged. Back to you for #1414 .

May 29 '24 17:05 paciorek