BPCells icon indicating copy to clipboard operation
BPCells copied to clipboard

matrix verification

Open brgew opened this issue 2 years ago • 2 comments

Hi Ben,

I am thinking about verifying a BPCells directory-based matrix. I would like to verify that the BPCells::write_matrix_dir() function succeeded and I would like to verify that a matrix stored in a Monocle3 object is the same as one stored in a copy of the Monocle3 object; that is, make a distinct ID for a matrix.

I am concerned about users running out of disk space, or lacking permission, while writing a BPCells matrix directory without noticing the problem. I am guessing that you will tell me that BPCells will throw an error in this scenario and the user needs to be alert to the possibility of an error occurring. That seems reasonable to me. But perhaps you have a better suggestion?

Regarding a matrix ID, I can do this for an R sparse matrix using the digest::digest() function. A possibility is to run digest::digest() on BPCells rowsums + colsums, which is not ideal but it should be fast and use relatively little space. I wonder if you have a better suggestion?

I appreciate your consideration and guidance.

Thank you.

Ever grateful, Brent

brgew avatar Nov 28 '23 19:11 brgew

Hi Brent, there is a very minimal safeguard in the BPCells::write_matrix_dir() function, which is that it immediately opens the matrix it just wrote to disk in order to provide as the return value from the function. This mostly just checks that the right file names are present, but will cause an error to be thrown if something catastrophic has gone wrong during writing before the final metadata files are written.

If you want something more robust that actually verifies the contents of the matrix, that's probably best done on the C++ level. I'd recommend running a hash such as MD5 or CRC-32 over the (row, col, val) tuples, maybe also hashing in the row/col names too. This would probably be easiest to add straight into BPCells but is probably possible to do in straight RCpp if you copy a few BPCells header files.

An example function that shows how to iterate over the values in a matrix is here. Incidentally, that code link is to a non-exported method to check that two integer BPCells matrices are identical. Making that an exported function and adding versions for float + double types might be an alternative if you will always have a known-good copy of the data to compare against.

bnprks avatar Nov 29 '23 07:11 bnprks

Hi Ben,

Thank you for the insights. I appreciate your guidance, as always.

Thank you!

Ever grateful, Brent

On Tue, Nov 28, 2023 at 11:52 PM bnprks @.***> wrote:

Hi Brent, there is a very minimal safeguard in the BPCells::write_matrix_dir() function, which is that it immediately opens the matrix it just wrote to disk in order to provide as the return value from the function. This mostly just checks that the right file names are present, but will cause an error to be thrown if something catastrophic has gone wrong during writing before the final metadata files are written.

If you want something more robust that actually verifies the contents of the matrix, that's probably best done on the C++ level. I'd recommend running a hash such as MD5 or CRC-32 over the (row, col, val) tuples, maybe also hashing in the row/col names too. This would probably be easiest to add straight into BPCells but is probably possible to do in straight RCpp if you copy a few BPCells header files.

An example function that shows how to iterate over the values in a matrix is here https://urldefense.com/v3/__https://github.com/bnprks/BPCells/blob/main/src/matrix_utils.cpp*L451-L461__;Iw!!K-Hz7m0Vt54!i1ziB7CVBLgwTz-p3LRoVg_UYc1b66croPMA9-HPrqHJKW1ZA_Lf-1YZTjIIR4DLDHWDT4Tz6oy-JbHQntHJ$. Incidentally, that code link is to a non-exported method to check that two integer BPCells matrices are identical. Making that an exported function and adding versions for float + double types might be an alternative if you will always have a known-good copy of the data to compare against.

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/bnprks/BPCells/issues/60*issuecomment-1831383903__;Iw!!K-Hz7m0Vt54!i1ziB7CVBLgwTz-p3LRoVg_UYc1b66croPMA9-HPrqHJKW1ZA_Lf-1YZTjIIR4DLDHWDT4Tz6oy-JRgAsoDB$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ACXSPQ5OMESO4FCAVOLEC33YG3STBAVCNFSM6AAAAAA76J3AVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZRGM4DGOJQGM__;!!K-Hz7m0Vt54!i1ziB7CVBLgwTz-p3LRoVg_UYc1b66croPMA9-HPrqHJKW1ZA_Lf-1YZTjIIR4DLDHWDT4Tz6oy-JePYYQgF$ . You are receiving this because you authored the thread.Message ID: @.***>

brgew avatar Nov 29 '23 17:11 brgew