optmatch icon indicating copy to clipboard operation
optmatch copied to clipboard

coerce other packages' sparse distances to {B}ISM format

Open benthestatistician opened this issue 4 years ago • 6 comments

I've heard from a few package users who'd like to convert sparse distances in some other format into ours. Here's the wish list at present:

  • [ ] bigmatch package's distances --> InfinitySparseMatrix

For a start on code for doing this, see existing "setAs" s4 method for matrix to ISM conversion and associated tests .

benthestatistician avatar Apr 30 '20 18:04 benthestatistician

@vkartha and @renyc432, I have the two of you in mind with this issue. I don't see starting on this soon myself, but we (package developers) would be happy to receive pull requests with this functionality. (@vkartha : if the matrices you'd like to convert are blocked as well as sparse, then you'd be looking to convert to optmatch's BlockedInfinitySparseMatrix, a subclass of InfinitySparseMatrix.)

benthestatistician avatar Apr 30 '20 19:04 benthestatistician

Thanks again to @benthestatistician for assisting with this. While I didn't come up with a way to coerce sparseMatrix objects (available through the Matrix R package) directly to the InfinitySparseMatrix class supported by optmatch, I was able to use an intermediate dense distance matrix to do this conversion, while still constraining the search problem in the process.

I basically made my own euclidean matrix distMat (dgCMatrix class), and assigned Inf values at specific indices in the matrix based on my own predetermined pair constraints. This matrix does indeed have row and column names, to identify the observation labels after pairing. I suppose the reason for me even wanting to use this class (as opposed to just building a regular matrix), is that for very large dimensions, it is a lot faster to store and manipulate sparse elements than it is a dense object.

I then converted this to an InfinitySparseMatrix using the function, as you suggested, and directly ran pairmatch as follows:

pairmatch(as.InfinitySparseMatrix(as.matrix(distMat)))

As Ben suggested, as long as the memory overhead isn't too much (when doing the sparse to dense conversion using as.matrix above), this seems to be a feasible option (I was able to do this for a 20K x 20K distance matrix). At the end of the day, solving for optimal pairs was immensely accelerated by considering only the pre-defined restricted pair search space, and I did not have to use match_on in this way, which would have taken time.

Hope this is helpful for those with similar needs

vkartha avatar May 04 '20 17:05 vkartha

Thanks for this post, @vkartha ! It'll also be helpful for testing direct conversion routines that maybe written in the future.

benthestatistician avatar May 04 '20 18:05 benthestatistician

For the Matrix package, if those are meaningfully sparse, this means that many observations have zero distance. I just wanted to check that we understand this situation.

For the bigmatch package, I seem to recall it was an edge list implementation, more or less what ISMs are now. Should be reasonable to make the change.

markmfredrickson avatar May 04 '20 19:05 markmfredrickson

@markmfredrickson Yeah sorry I should have clarified. You are correct that it doesn't make much sense to have a sparseMatrix object of distances since the computed distances are almost never sparse. Since I have very large dimensions, I started with this construct so that I could keep all entries as 0, and only updated the ones I determined to be in the pool of valid pairs (takes up less memory). I then make all the 0 entries Inf so that I can coerce it into an InfinitySparseMatrix and use that with to solve the matching.

I guess I brought this up in the first place because I was having trouble using match_on to this effect (was taking a very long time, making me suspect that it was computing all distances) and wanted to know if I could still directly use optmatch to determine pairs from a pre-existing sparse distance matrix (hence the need to first convert to the ISM class).

Sorry for that rather roundabout response.

vkartha avatar May 05 '20 00:05 vkartha

@vkartha , how does the distMat of your example (a dgcMatrix) mark the distinction between forbidden pairings (infinite distances) versus perfect pairings (zero distances)?

benthestatistician avatar May 06 '20 03:05 benthestatistician