PlasFlow
PlasFlow copied to clipboard
Possible incompatibility with underlying sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2
A colleague is working on using plasflow to analyze on all contigs >1000 bp in her dataset. After filtering using filter_sequences_by_length.pl
, she has a total of 2,964,210 contigs. We are using plasflow-1.1
, python-3.5
and sklearn-0.18.1
on CentOS 6.9. Plasflow was installed via Anaconda
.
Running :
PlasFlow.py --input all.contigs.1000.fasta --output output.plasflow.all.contigs.csv --threshold 0.7
Yields: Stdout:
Importing sequences
Imported 2964210 sequences
Calculating kmer frequencies using kmer 5
Due to large number of sequences in the input file, it is splitted to smaller chunks (maximum size: 25000 sequences)
processing chunk: 1
.
.
.
processing chunk: 119
Transforming kmer frequencies
Stderr :
/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/feature_extraction/text.py:1059: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
Traceback (most recent call last):
File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 346, in <module>
vote_proba = vote_class.predict_proba(inputfile)
File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 300, in predict_proba
self.probas_ = [clf.predict_proba_tf(X) for clf in self.clfs]
File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 300, in <listcomp>
self.probas_ = [clf.predict_proba_tf(X) for clf in self.clfs]
File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 252, in predict_proba_tf
self.calculate_freq(data)
File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 243, in calculate_freq
test_tfidf = transformer.fit_transform(kmer_count)
File "/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/base.py", line 494, in fit_transform return self.fit(X, **fit_params).transform(X)
File "/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 1084, in transform
X = normalize(X, norm=self.norm, copy=False)
File "/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/preprocessing/data.py", line 1352, in normalize
inplace_csr_row_normalize_l2(X)
File "sklearn/utils/sparsefuncs_fast.pyx", line 359, in sklearn.utils.sparsefuncs_fast.inplace_csr_row_normalize_l2 (sklearn/utils/sparsefuncs_fast.c:12648)
File "sklearn/utils/sparsefuncs_fast.pyx", line 362, in sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2 (sklearn/utils/sparsefuncs_fast.c:13750)
ValueError: Buffer dtype mismatch, expected 'int' but got 'long'
This issue leads me to think this is due to passing the underlying C-funtion, sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2
, too large of a matrix. Following a path of links, lead me to this commit which makes me think that this may be fixed in a more recent version of scikit-learn. The input data, all.contigs.1000.fasta
is 12GB in size
Question:
- Is my assessment of this issue correct?
- Is there a work-around this issue?
- Is the input data too big?
Thanks.
Hi, thank for submitting that issue. I will take a closer look at that and will think about the fix. However, I think that the answer to the 3rd question is yes, and limiting the number of input sequences (for example splitting in the half) should help by now.