PlasFlow Possible incompatibility with underlying sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize

Possible incompatibility with underlying sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2

Open astrophys opened this issue 5 years ago • 1 comments

A colleague is working on using plasflow to analyze on all contigs >1000 bp in her dataset. After filtering using filter_sequences_by_length.pl, she has a total of 2,964,210 contigs. We are using plasflow-1.1, python-3.5 and sklearn-0.18.1 on CentOS 6.9. Plasflow was installed via Anaconda.

Running :

PlasFlow.py --input all.contigs.1000.fasta --output output.plasflow.all.contigs.csv --threshold 0.7

Yields: Stdout:

Importing sequences
Imported  2964210  sequences
Calculating kmer frequencies using kmer 5
Due to large number of sequences in the input file, it is splitted to smaller chunks (maximum size: 25000 sequences)
processing chunk: 1
.
.
.
processing chunk: 119
Transforming kmer frequencies

Stderr :

/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/feature_extraction/text.py:1059: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
Traceback (most recent call last):
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 346, in <module>
    vote_proba = vote_class.predict_proba(inputfile)
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 300, in predict_proba
    self.probas_ = [clf.predict_proba_tf(X) for clf in self.clfs]
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 300, in <listcomp>
    self.probas_ = [clf.predict_proba_tf(X) for clf in self.clfs]
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 252, in predict_proba_tf
    self.calculate_freq(data)
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 243, in calculate_freq
    test_tfidf = transformer.fit_transform(kmer_count)
  File "/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/base.py", line 494, in fit_transform    return self.fit(X, **fit_params).transform(X)
  File "/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 1084, in transform
    X = normalize(X, norm=self.norm, copy=False)
  File "/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/preprocessing/data.py", line 1352, in normalize
    inplace_csr_row_normalize_l2(X)
  File "sklearn/utils/sparsefuncs_fast.pyx", line 359, in sklearn.utils.sparsefuncs_fast.inplace_csr_row_normalize_l2 (sklearn/utils/sparsefuncs_fast.c:12648)
  File "sklearn/utils/sparsefuncs_fast.pyx", line 362, in sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2 (sklearn/utils/sparsefuncs_fast.c:13750)
ValueError: Buffer dtype mismatch, expected 'int' but got 'long'

This issue leads me to think this is due to passing the underlying C-funtion, sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2, too large of a matrix. Following a path of links, lead me to this commit which makes me think that this may be fixed in a more recent version of scikit-learn. The input data, all.contigs.1000.fasta is 12GB in size

Question:

Is my assessment of this issue correct?
Is there a work-around this issue?
Is the input data too big?

Thanks.

Mar 27 '19 15:03 astrophys

Hi, thank for submitting that issue. I will take a closer look at that and will think about the fix. However, I think that the answer to the 3rd question is yes, and limiting the number of input sequences (for example splitting in the half) should help by now.

Mar 27 '19 15:03 smaegol

PlasFlow PlasFlow copied to clipboard

Possible incompatibility with underlying sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2

PlasFlow
PlasFlow copied to clipboard