Possible issue with uint8 in bmat
I'm trying to parse snap files into python sparse matrices. This is what I'm doing
import numpy as np
import h5py
import snaptools.utilities
import scipy.sparse as sp
f= h5py.File(myfile, 'r')
n_cells = len(f['BD/name'])
bin_dict = snaptools.utilities.getBinsFromGenomeSize(genome_dict, bin_size) #from snaptools code
n_bins = len(bin_dict)
idy = f['AM/5000/idy'][:]
idx = np.arange(n_cells + 1)
data = data = f['AM/5000/count'][:]
X = sp.csc_matrix((data, idy, idx), shape=(n_bins, n_cells))
Everything seems to work but I've noticed two things:
- my data are capped at 255
- there are many more zeros than I previously found with another method (outside snaptools)
as for the second I've thought that maybe I was counting wrong but reading the snap file internals I've realized counts are saved as uint8, which explains the capping to 255. The problem is that at line 55 of add_bmat.py the counter is a generic python integer
bins = collections.defaultdict(lambda : 0);
which is then casted to uint8 at time of writing (line 79).
f.create_dataset("AM/"+str(bin_size)+"/count", data=countList[bin_size], dtype="uint8", compression="gzip", compression_opts=9);
This causes the values to be set to the modulus of X % 256. I don't know if standard scATAC experiments expect read counts per bin being below 255, but this is not my case.
Hi,
Thanks for reporting this. I was intentionally capping the max value to 255. The reason is two-fold:
-
we found that an extremely small portion (less than 0.1%) of the items in the matrix will have count larger than 10 or 50. Given that there are only two copies of genome in a normal cell if not considering copy number variation or chrM sequence, the items that have value larger than 100 is very likely due to the alignment errors from repetitive sequences.
-
the downstream analysis is converting the count matrix to binary matrix anyway, the absolute count value will not be considered in the downstream analysis.
-- Rongxin Fang Ph.D. Student, Ren Lab Ludwig Institute for Cancer Research University of California, San Diego
On Sep 26, 2019, at 7:38 AM, Davide Cittaro [email protected] wrote:
I'm trying to parse snap files into python sparse matrices. This is what I'm doing
import numpy as np import h5py import snaptools.utilities import scipy.sparse as sp
f= h5py.File(myfile, 'r') n_cells = len(f['BD/name']) bin_dict = snaptools.utilities.getBinsFromGenomeSize(genome_dict, bin_size) #from snaptools code n_bins = len(bin_dict) idy = f['AM/5000/idy'][:] idx = np.arange(n_cells + 1) data = data = f['AM/5000/count'][:]
X = sp.csc_matrix((data, idy, idx), shape=(n_bins, n_cells)) Everything seems to work but I've noticed two things:
my data are capped at 255 there are many more zeros than I previously found with another method (outside snaptools) as for the second I've thought that maybe I was counting wrong but reading the snap file internals I've realized counts are saved as uint8, which explains the capping to 255. The problem is that at line 55 of add_bmat.py the counter is a generic python integer
bins = collections.defaultdict(lambda : 0);which is then casted to uint8 at time of writing (line 79).
f.create_dataset("AM/"+str(bin_size)+"/count", data=countList[bin_size], dtype="uint8", compression="gzip", compression_opts=9);This causes the values to be set to the modulus of X % 256. I don't know if standard scATAC experiments expect read counts per bin being below 255, but this is not my case.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/r3fang/SnapTools/issues/19?email_source=notifications&email_token=ABT6GG2OU7QJT4DCCT4WIJLQLSNMNA5CNFSM4I2Y23B2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HN3SNEA, or mute the thread https://github.com/notifications/unsubscribe-auth/ABT6GGZ3PED22E3SNGSMKWTQLSNMNANCNFSM4I2Y23BQ.
I see. Still if you have a bin counting 256 that would be set to 0, also when it is binarized. If 255 should be the max value any bin should be capped before writing the snap object
I see your point. I agree with your this should be kept before binarizing, I will try to change it. Meanwhile, this won’t affect the standard downstream analysis (unless you are using it for other purpose) because the item in the matrix with value usually larger than 100 will be removed and set to be 0. But again, i agree with you this is an issue that needs to be fixed. Thanks for reporting
Best,
Rongxin Fang Ph.D. Student, Ren Lab Ludwig Institute for Cancer Research University of California, San Diego
On Oct 1, 2019, at 12:15 PM, Davide Cittaro [email protected] wrote:
I see. Still if you have a bin counting 256 that would be set to 0, also when it is binarized. If 255 should be the max value any bin should be capped before writing the snap object
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/r3fang/SnapTools/issues/19?email_source=notifications&email_token=ABT6GG42E5BIPK4H72CMWLTQMNZTVA5CNFSM4I2Y23B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAB24MQ#issuecomment-537112114, or mute the thread https://github.com/notifications/unsubscribe-auth/ABT6GG2ORAINC7SMFISQQITQMNZTVANCNFSM4I2Y23BQ.