SnapTools icon indicating copy to clipboard operation
SnapTools copied to clipboard

Possible issue with uint8 in bmat

Open dawe opened this issue 6 years ago • 3 comments

I'm trying to parse snap files into python sparse matrices. This is what I'm doing

import numpy as np
import h5py
import snaptools.utilities
import scipy.sparse as sp

f= h5py.File(myfile, 'r')
n_cells = len(f['BD/name'])
bin_dict = snaptools.utilities.getBinsFromGenomeSize(genome_dict, bin_size) #from snaptools code
n_bins = len(bin_dict)
idy = f['AM/5000/idy'][:]
idx = np.arange(n_cells + 1)
data = data = f['AM/5000/count'][:]

X = sp.csc_matrix((data, idy, idx), shape=(n_bins, n_cells))

Everything seems to work but I've noticed two things:

  • my data are capped at 255
  • there are many more zeros than I previously found with another method (outside snaptools)

as for the second I've thought that maybe I was counting wrong but reading the snap file internals I've realized counts are saved as uint8, which explains the capping to 255. The problem is that at line 55 of add_bmat.py the counter is a generic python integer

            bins = collections.defaultdict(lambda : 0);

which is then casted to uint8 at time of writing (line 79).

        f.create_dataset("AM/"+str(bin_size)+"/count", data=countList[bin_size], dtype="uint8", compression="gzip", compression_opts=9);    

This causes the values to be set to the modulus of X % 256. I don't know if standard scATAC experiments expect read counts per bin being below 255, but this is not my case.

dawe avatar Sep 26 '19 11:09 dawe

Hi,

Thanks for reporting this. I was intentionally capping the max value to 255. The reason is two-fold:

  1. we found that an extremely small portion (less than 0.1%) of the items in the matrix will have count larger than 10 or 50. Given that there are only two copies of genome in a normal cell if not considering copy number variation or chrM sequence, the items that have value larger than 100 is very likely due to the alignment errors from repetitive sequences.

  2. the downstream analysis is converting the count matrix to binary matrix anyway, the absolute count value will not be considered in the downstream analysis.

-- Rongxin Fang Ph.D. Student, Ren Lab Ludwig Institute for Cancer Research University of California, San Diego

On Sep 26, 2019, at 7:38 AM, Davide Cittaro [email protected] wrote:

I'm trying to parse snap files into python sparse matrices. This is what I'm doing

import numpy as np import h5py import snaptools.utilities import scipy.sparse as sp

f= h5py.File(myfile, 'r') n_cells = len(f['BD/name']) bin_dict = snaptools.utilities.getBinsFromGenomeSize(genome_dict, bin_size) #from snaptools code n_bins = len(bin_dict) idy = f['AM/5000/idy'][:] idx = np.arange(n_cells + 1) data = data = f['AM/5000/count'][:]

X = sp.csc_matrix((data, idy, idx), shape=(n_bins, n_cells)) Everything seems to work but I've noticed two things:

my data are capped at 255 there are many more zeros than I previously found with another method (outside snaptools) as for the second I've thought that maybe I was counting wrong but reading the snap file internals I've realized counts are saved as uint8, which explains the capping to 255. The problem is that at line 55 of add_bmat.py the counter is a generic python integer

        bins = collections.defaultdict(lambda : 0);

which is then casted to uint8 at time of writing (line 79).

    f.create_dataset("AM/"+str(bin_size)+"/count", data=countList[bin_size], dtype="uint8", compression="gzip", compression_opts=9);    

This causes the values to be set to the modulus of X % 256. I don't know if standard scATAC experiments expect read counts per bin being below 255, but this is not my case.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/r3fang/SnapTools/issues/19?email_source=notifications&email_token=ABT6GG2OU7QJT4DCCT4WIJLQLSNMNA5CNFSM4I2Y23B2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HN3SNEA, or mute the thread https://github.com/notifications/unsubscribe-auth/ABT6GGZ3PED22E3SNGSMKWTQLSNMNANCNFSM4I2Y23BQ.

r3fang avatar Sep 30 '19 03:09 r3fang

I see. Still if you have a bin counting 256 that would be set to 0, also when it is binarized. If 255 should be the max value any bin should be capped before writing the snap object

dawe avatar Oct 01 '19 16:10 dawe

I see your point. I agree with your this should be kept before binarizing, I will try to change it. Meanwhile, this won’t affect the standard downstream analysis (unless you are using it for other purpose) because the item in the matrix with value usually larger than 100 will be removed and set to be 0. But again, i agree with you this is an issue that needs to be fixed. Thanks for reporting

Best,

Rongxin Fang Ph.D. Student, Ren Lab Ludwig Institute for Cancer Research University of California, San Diego

On Oct 1, 2019, at 12:15 PM, Davide Cittaro [email protected] wrote:

I see. Still if you have a bin counting 256 that would be set to 0, also when it is binarized. If 255 should be the max value any bin should be capped before writing the snap object

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/r3fang/SnapTools/issues/19?email_source=notifications&email_token=ABT6GG42E5BIPK4H72CMWLTQMNZTVA5CNFSM4I2Y23B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAB24MQ#issuecomment-537112114, or mute the thread https://github.com/notifications/unsubscribe-auth/ABT6GG2ORAINC7SMFISQQITQMNZTVANCNFSM4I2Y23BQ.

r3fang avatar Oct 02 '19 13:10 r3fang