philentropy icon indicating copy to clipboard operation
philentropy copied to clipboard

JSD() stack overflow with many rows

Open wkc1986 opened this issue 7 years ago • 13 comments

I'm trying to figure this out myself, but no luck yet.

# test.R

library(philentropy)

m <- matrix(runif(1500 * 10), nrow = 1500)
JSD(m)
$ /usr/local/bin/Rscript test.R
Jensen-Shannon Divergence using unit 'log2'.
Metric: 'jensen-shannon' using unit: 'log2'.
Error: segfault from C stack overflow
Execution halted

wkc1986 avatar Jun 06 '18 18:06 wkc1986

Dear @wkc1986,

Thank you very much for finding this issue.

I will have a look, it might be a problem with the internal Rcpp loop e.g. DistMatrixWithUnitDF() which I implemented to pairwise compare all rows in a matrix. My prediction would be that if you write a loop in R and run pairwise comparisons JSD(x_i,y_j) it will work.

But I will keep you posted.

Many thanks and kind regards, Hajk

HajkD avatar Jun 07 '18 10:06 HajkD

I'm seeing something that I suspect is related. I run JSD in a loop, each time generating a 500x500 matrix of distances. After the loop completes, within about 30 seconds I get the C stack overflow error. I'm seeing this on both OS X and Ubuntu 16.04.

elbamos avatar Jun 11 '18 17:06 elbamos

Dear @elbamos

Thank you very very much for making me aware of this. Yes, I also think that it is related and I think I can fix it :)

I am very sorry for the inconvenience it might have caused.

Kind regards, Hajk

HajkD avatar Jun 11 '18 18:06 HajkD

No problem :)

BTW - you know in C you can refer to a function by a pointer. If you did that with custom_log, it would probably simplify a lot of your code. You'd have a single function that takes the log argument and returns a pointer to the C function, that you'd then pass to the respective distance functions.

elbamos avatar Jun 11 '18 21:06 elbamos

Dear all,

as I predicted the issue was in the Rcpp functions DistMatrixWithUnitMAT() and DistMatrixWithoutUnitMAT() in file src/dist_matrix.cpp.

The problem was that Rcpp doesn't allow efficiently passing functions as an argument of another function. You can find a detailed discussion here:

https://stackoverflow.com/questions/27391472/passing-r-function-as-parameter-to-rcpp-function

@elbamos: You are absolutely correct and I am aware that it is possible to define C functions as pointers. But when writing philentropy I wanted to use the Rcpp notation throughout my code. Thus, since most functions are now implemented using the syntactic sugar of Rcpp I would have to re-write all functions using the base C/C++ notation to be able to pass functions as pointers. Since I don't have the time to do so in the near future and since after playing a while with the option to still pass function pointers to Rcpp functions (which simply wasn't possible) I decided to write new optimized R functions named DistMatrix(), DistMatrixNoUnit(), and DistMatrixMinkowski() to iterate through the pairwise comparisons of an input matrix.

@wkc1986 When I now run your example code:

m <- matrix(runif(1500 * 10), nrow = 1500)
dist_mat <- JSD(m)
str(dist_mat)

I don't get a segfault anymore.

# >Metric: 'jensen-shannon' using unit: 'log2'; comparing: 1500 vectors.
num [1:1500, 1:1500] 0 0.58 0.347 0.516 0.327 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:1500] "v1" "v2" "v3" "v4" ...
  ..$ : chr [1:1500] "v1" "v2" "v3" "v4" ...

@elbamos Could you please let me know if your example also works now with the newest developer version of philentropy?

In the future, I will also add the option to run matrix computations in parallel (multicore computations) to further speed up the process.

Thank you both once again very very much for making me aware of this issue and I hope that you won't have problems now.

Kind regards, Hajk

HajkD avatar Jun 13 '18 11:06 HajkD

Confirm dev version working for my (actual) application. Thanks so much as always for your responsiveness @HajkD !

wkc1986 avatar Jun 13 '18 18:06 wkc1986

@HajkD I can take a look this weekend. I can also tell you that, to solve my issue, I cut-and-pasted the JS function and DistMatrixWithUnitMAT() into another file and took out the function pointer, and that solved it. (That was my guess about what was causing it.)

Regarding function pointers, I actually wasn't referring to the passing of Rcpp functions, which always seemed like voodoo to me. I was referring to structures like this: https://github.com/HajkD/philentropy/blob/9ca1b4c503ae61ab223d892840a13adeb92cc414/src/distances.h#L1288, which could be made more maintainable and simpler, and probably considerably faster, by passing the relevant custom_log function as a pointer. Similarly, when you have structures like this: https://github.com/HajkD/philentropy/blob/9ca1b4c503ae61ab223d892840a13adeb92cc414/src/distances.h#L514 what you probably want to do is just have the check for NAs inside the if() clause. My guess is that 2000 line file could be 500-800 lines of code, which would be much simpler for you to maintain.

elbamos avatar Jun 13 '18 18:06 elbamos

Hi @elbamos,

This is a great suggestion.

I would truly appreciate it if you could look into this and whatever makes it easier for me to maintain the code the happier I am of course :)

Many thanks for helping me on this one!

Cheers, Hajk

HajkD avatar Jun 23 '18 11:06 HajkD

@HajkD

Sorry I haven't had a chance to test the new code yet; I had extracted the relevant code into a separate file and I've been using that.

In terms of using a function pointer:

# Necessary because log is an overloaded function
double mylog(const double& x) {
  return log(x);
} 

# Switched to use armadillo for openmp safety...
double jensen_shannon(const vec& P,  const vec& Q, const Rcpp::String unit){
  
...
  # function pointer def
  double (*get_log)(const double&);
  
  if (unit == "log") {
    get_log = mylog; 
  } else if (unit == "log2") {
    get_log = custom_log2; 
  } else if (unit == "log10") {
    get_log = custom_log10; 
  } else {
    Rcpp::stop("Please choose from units: log, log2, or log10.");
  }
  
 ...
    
  for(int i = 0; i < P_len; i++){
    PQsum =   P[i] + Q[i];
    if (P[i] == 0.0 || PQsum == 0.0) {
      sum1  += 0.0;
    } else {
      sum1  +=  P[i] * get_log((2.0 * P[i]) / PQsum);
    }
    if (Q[i] == 0.0 || PQsum == 0.0) {
      sum2  += 0.0;
    } else {
      sum2  +=  Q[i] * get_log((2.0 * Q[i]) / PQsum);
    }
  }
  return 0.5 * (sum1 + sum2);
}

Not fully tested, but you get the idea.

elbamos avatar Jun 25 '18 18:06 elbamos

Also, you can buy a performance improvement in the loops with:

  for (int i = 0; i < ncols; i++){
    const vec& vec_i = dists.col(i); 
     # You don't need to consider j < i; actually, you don't need to consider j == i or i == ncols either
     # but that's for another time.
    for (int j = i; j < ncols; j++){
        dist_value = jensen_shannon(vec_i, dists.col(j), unit);
        dist_matrix(i,j) = dist_matrix(j,i) = dist_value;
    }
  } 

elbamos avatar Jun 25 '18 18:06 elbamos

The function pointer idea works even better if you do the if..then in the exported function and pass the function pointer as a parameter to all of your distance functions.

Also you don't need to check that the columns are the same size each time the distance function is called -- they come from a matrix, so they have to be the same size.

(Actually it would be better to map2 them using C++11, but that's another story.)

elbamos avatar Jun 26 '18 01:06 elbamos

Dear @elbamos,

These are absolutely amazing suggestions!!! Thank you so so much for taking the time to help me improve the speed of computations.

I will start working on incorporating your approaches and am more than happy for any further improvements :)

Many many thanks, Hajk

HajkD avatar Jun 26 '18 13:06 HajkD

BTW - it shouldn't be terribly hard to get this working with OpenMP. I put an hour or two into it but kept getting seg faults and don't have the time to track them down. But its "embarassingly parallel", as they say.

elbamos avatar Jun 26 '18 22:06 elbamos