mipfp icon indicating copy to clipboard operation
mipfp copied to clipboard

Fast ipfp

Open okrebs opened this issue 6 years ago • 7 comments

Rcpp implementation including parallelization of the Ipfp algorithm. This greatly increases speed for large problems and fixes #4 and #5.

Apart from this the obtained (multidimensional) matrices are equal to the previous implementation up to machine (double) precision.

okrebs avatar Sep 23 '19 09:09 okrebs

Is this pull request under active consideration? Because it looks like it would be extremely useful.

ellisp avatar Jul 11 '21 00:07 ellisp

I am trialling this branch now. It is indeed much faster. I noticed that code that worked under the current CRAN version breaks under this version with a couple of small things:

  • I needed to add an as.matrix(...) around the data passed to the seed argument to make it consistent with the target.data. Previously Ipfp was more relaxed about this being a data frame.
  • I think the output is different - some code that we used to have to melt, now comes out in the shape we would melt it into (still checking exactly what is happening here).

These aren't big deals but do count as breaking changes that would need to be managed if this is merged in.

ellisp avatar Jul 11 '21 01:07 ellisp

Regarding your first point what is actually expected is an array (as 'seed' can be multidimensional, matrix should work as it has subclass 'array'). There probably should be a warning if a "data.frame" or similar is passed. However, I can not replicate your problem with my branch as passing a data.frame works for me:

library(mipfp)
Lade nötiges Paket: cmm
Lade nötiges Paket: Rsolnp
Lade nötiges Paket: numDeriv
mymat <- matrix(1, nrow = 2, ncol = 2)
mydat <- data.frame(a = c(1,1), b = c(1,1))
Ipfp(seed = mymat, target.list = list(1), target.data = list(c(3,3)))

Call:
Ipfp(seed = mymat, target.list = list(1), target.data = list(c(3, 
    3)))

Method:  ipfp - convergence:  TRUE 

Estimates:
     [,1] [,2]
[1,]  1.5  1.5
[2,]  1.5  1.5

Ipfp(seed = mydat, target.list = list(1), target.data = list(c(3,3)))

Call:
Ipfp(seed = mydat, target.list = list(1), target.data = list(c(3, 
    3)))

Method:  ipfp - convergence:  TRUE 

Estimates:
    a   b
1 1.5 1.5
2 1.5 1.5

Can you provide a MWE?

The same for your second point.

okrebs avatar Jul 15 '21 10:07 okrebs

Hi Okrebs,

Happy to merge your work into this repo. Thanks for your good work! And sorry for the extremely long delay...

Before doing the merge,

  • let's make sure the change does not break anything, ie let's wait for ellips to come back with an example
  • can you resolve the conflict in mipfp/DESCRIPTION first?

Cheers!

Johan

jojo- avatar Jul 16 '21 03:07 jojo-

Sorry for the long delay in responding, and not including a reprex in my first comments.

As it happens @okrebs your own example produces the error I was finding:

> mymat <- matrix(1, nrow = 2, ncol = 2)
> mydat <- data.frame(a = c(1,1), b = c(1,1))
> Ipfp(seed = mymat, target.list = list(1), target.data = list(c(3,3)))

Call:
Ipfp(seed = mymat, target.list = list(1), target.data = list(c(3, 
    3)))

Method:  ipfp - convergence:  TRUE 

Estimates:
     [,1] [,2]
[1,]  1.5  1.5
[2,]  1.5  1.5
> Ipfp(seed = mydat, target.list = list(1), target.data = list(c(3,3)))
 Error in .IpfpCoreC(seed, target.list, target.data, print, iter, tol, : 
Not compatible with requested type: [type=list; target=double]. 
> Ipfp(seed = as.matrix(mydat), target.list = list(1), target.data = list(c(3,3)))

Call:
Ipfp(seed = as.matrix(mydat), target.list = list(1), target.data = list(c(3, 
    3)))

Method:  ipfp - convergence:  TRUE 

Estimates:
     [,1] [,2]
[1,]  1.5  1.5
[2,]  1.5  1.5

There must be something else differing with our systems. The example above works fine with the CRAN version, but when I build from the fast-ipfp branch I get the above result - fails with a data.frame, but is ok with as.matrix() around the data.frame.

Here is my session info:

> devtools::session_info()
- Session info ------------------------------------------------------------------------------------
 setting  value                       
 version  R version 4.1.1 (2021-08-10)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United States.1252  
 ctype    English_United States.1252  
 tz       Australia/Sydney            
 date     2021-10-12                  

- Packages ----------------------------------------------------------------------------------------
 package     * version    date       lib source        
 cachem        1.0.6      2021-08-19 [1] CRAN (R 4.1.1)
 callr         3.7.0      2021-04-20 [1] CRAN (R 4.1.1)
 cli           3.0.1      2021-07-17 [1] CRAN (R 4.1.1)
 cmm         * 0.12       2018-03-11 [1] CRAN (R 4.1.1)
 crayon        1.4.1      2021-02-08 [1] CRAN (R 4.1.1)
 desc          1.4.0      2021-09-28 [1] CRAN (R 4.1.1)
 devtools      2.4.2      2021-06-07 [1] CRAN (R 4.1.1)
 ellipsis      0.3.2      2021-04-29 [1] CRAN (R 4.1.1)
 fastmap       1.1.0      2021-01-25 [1] CRAN (R 4.1.1)
 fs            1.5.0      2020-07-31 [1] CRAN (R 4.1.1)
 glue          1.4.2      2020-08-27 [1] CRAN (R 4.1.1)
 lifecycle     1.0.1      2021-09-24 [1] CRAN (R 4.1.1)
 magrittr      2.0.1      2020-11-17 [1] CRAN (R 4.1.1)
 memoise       2.0.0      2021-01-26 [1] CRAN (R 4.1.1)
 mipfp       * 3.1        2021-10-11 [1] local         
 numDeriv    * 2016.8-1.1 2019-06-06 [1] CRAN (R 4.1.1)
 pkgbuild      1.2.0      2020-12-15 [1] CRAN (R 4.1.1)
 pkgload       1.2.2      2021-09-11 [1] CRAN (R 4.1.1)
 prettyunits   1.1.1      2020-01-24 [1] CRAN (R 4.1.1)
 processx      3.5.2      2021-04-30 [1] CRAN (R 4.1.1)
 ps            1.6.0      2021-02-28 [1] CRAN (R 4.1.1)
 purrr         0.3.4      2020-04-17 [1] CRAN (R 4.1.1)
 R6            2.5.1      2021-08-19 [1] CRAN (R 4.1.1)
 Rcpp          1.0.7      2021-07-07 [1] CRAN (R 4.1.1)
 remotes       2.4.1      2021-09-29 [1] CRAN (R 4.1.1)
 rlang         0.4.11     2021-04-30 [1] CRAN (R 4.1.1)
 rprojroot     2.0.2      2020-11-15 [1] CRAN (R 4.1.1)
 Rsolnp      * 1.16       2015-12-28 [1] CRAN (R 4.1.1)
 sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.1.1)
 testthat      3.1.0      2021-10-04 [1] CRAN (R 4.1.1)
 truncnorm     1.0-8      2018-02-27 [1] CRAN (R 4.1.1)
 usethis       2.0.1      2021-02-10 [1] CRAN (R 4.1.1)
 withr         2.4.2      2021-04-18 [1] CRAN (R 4.1.1)

[1] D:/Peter/Documents/R/win-library/4.1
[2] C:/Program Files/R/R-4.1.1/library

ellisp avatar Oct 11 '21 22:10 ellisp

I should add that the second issue I noted, where in the CRAN version I had to 'melt' the results to get the desired output but with your fast-ipfp version I didn't have to, I can not reproduce, and I think the problem was probably entirely at my end. For the record, the following is my test case and both the CRAN version and the fast-ipfp version work ok with it:


seed_df <- data.frame(
  lions = 1:2,
  tigers = 4:5,
  bears = 7:8
)

row.names(seed_df) <- c("Africa", "India")

# marginal populations for Africa, India; lions, tigers, bears:
population_df <- list(c(100, 100), c(60, 110, 30))



fit <- Ipfp(
  seed = as.matrix(seed_df),
  target.list = list(1,2),
  target.data = population_df
)

raked <- fit[[1]]

dim(raked) # should be 2, 3

ellisp avatar Oct 11 '21 23:10 ellisp

Finally, here is another problem. When building the package, I see this note:

Note: break used in wrong context: no loop is visible at ipfp_multi_dim.R:107 

I can then force an error when I have multi dimensional fitting with inconsistent marginal totals:

library(mipfp)

seed_df <- data.frame(
  lions = 1:2,
  tigers = 4:5,
  bears = 7:8
)

row.names(seed_df) <- c("Africa", "India")

# marginal populations for Africa, India; lions, tigers, bears:
population_df <- list(c(100, 100), c(60, 120, 30)) # note inconsistent totals


fit <- Ipfp(
  seed = as.matrix(seed_df),
  target.list = list(1,2),
  target.data = population_df
)

produces this error in the fast-ipfp branch:

Error in Ipfp(seed = as.matrix(seed_df), target.list = list(1, 2), target.data = population_df) : 
  no loop for break/next, jumping to top level
In addition: Warning message:
In Ipfp(seed = as.matrix(seed_df), target.list = list(1, 2), target.data = population_df) :
  Target not consistent - shifting to probabilities!
                Check input data!

But if I reload and use the CRAN version it works as expected (noting that it returns probabilities, not marginal totals, because they were inconsistent - but at least it isn't an error)

> library(mipfp)
Loading required package: cmm
Loading required package: Rsolnp
Loading required package: numDeriv
> 
> seed_df <- data.frame(
+   lions = 1:2,
+   tigers = 4:5,
+   bears = 7:8
+ )
> 
> row.names(seed_df) <- c("Africa", "India")
> 
> # marginal populations for Africa, India; lions, tigers, bears:
> population_df <- list(c(100, 100), c(60, 120, 30)) # note inconsistent totals
> 
> 
> 
> fit <- Ipfp(
+   seed = as.matrix(seed_df),
+   target.list = list(1,2),
+   target.data = population_df
+ )
Warning message:
In Ipfp(seed = as.matrix(seed_df), target.list = list(1, 2), target.data = population_df) :
  Target not consistents - shifting to probabilities!
                  Check input data!

> 
> fit[[1]]
           lions    tigers      bears
Africa 0.1181559 0.3029328 0.07891131
India  0.1675584 0.2684957 0.06394583

ellisp avatar Oct 11 '21 23:10 ellisp