Update H. flu database to new sklearn version

conmeehan opened this issue 1 year ago • 2 comments

Versions poppunk 2.6.0 zsh: command not found: poppunk_sketch poppunk_assign 2.6.0

Conda list:

packages in environment at /Users/cmeehan/opt/miniconda3/envs/poppunk:

Name Version Build Channel

Command used and output returned poppunk_assign --db Haemophilus_influenzae_v1_refs --query input.txt --output poppunk_clusters --threads 7 Input.txt: AP022846 AP022846.1.fa SRR11108932 SRR11108932_1.fastq.gz SRR11108932_1.fastq.gz

Describe the bug Get the following error when running on Apple M1 macOS 13.4.1 16GB memory:

PopPUNK: assign (with backend: sketchlib v2.1.1 sketchlib: /Users/cmeehan/opt/miniconda3/envs/poppunk/lib/python3.10/site-packages/pp_sketchlib.cpython-310-darwin.so) Mode: Assigning clusters of query sequences

Graph-tools OpenMP parallelisation enabled: with 7 threads Sketching 1 genomes using 1 thread(s) Progress (CPU): 1 / 1 Writing sketches to file Traceback (most recent call last): File "/Users/cmeehan/opt/miniconda3/envs/poppunk/bin/poppunk_assign", line 11, in sys.exit(main()) File "/Users/cmeehan/opt/miniconda3/envs/poppunk/lib/python3.10/site-packages/PopPUNK/assign.py", line 211, in main assign_query(dbFuncs, File "/Users/cmeehan/opt/miniconda3/envs/poppunk/lib/python3.10/site-packages/PopPUNK/assign.py", line 307, in assign_query isolateClustering = assign_query_hdf5(dbFuncs, File "/Users/cmeehan/opt/miniconda3/envs/poppunk/lib/python3.10/site-packages/PopPUNK/assign.py", line 357, in assign_query_hdf5 from .models import loadClusterFit File "/Users/cmeehan/opt/miniconda3/envs/poppunk/lib/python3.10/site-packages/PopPUNK/models.py", line 19, in import hdbscan File "/Users/cmeehan/opt/miniconda3/envs/poppunk/lib/python3.10/site-packages/hdbscan/init.py", line 1, in from .hdbscan_ import HDBSCAN, hdbscan File "/Users/cmeehan/opt/miniconda3/envs/poppunk/lib/python3.10/site-packages/hdbscan/hdbscan_.py", line 40, in FAST_METRICS = KDTree.valid_metrics + BallTree.valid_metrics + ["cosine", "arccos"] TypeError: unsupported operand type(s) for +: 'builtin_function_or_method' and 'builtin_function_or_method'

Note: Ran on an UBUNTU server and do not get this error.

Sorry about this, I think this looks like it's due to scikit-learn changing their API, which I couldn't make backwards compatible, see: https://github.com/bacpop/PopPUNK#2022-08-04

The change in scikit-learn's API in v1.0.0 and above mean that HDBSCAN models fitted with sklearn <=v0.24 will give an error when loaded. If you run into this, the solution is one of:

  • Downgrade sklearn to v0.24.
  • Run model refinement to turn your model into a boundary model instead (this will change clusters).
  • Refit your model in an environment with sklearn >=v1.0.

If this is a common problem let us know, as we could write a script to 'upgrade' HDBSCAN models. See issue https://github.com/bacpop/PopPUNK/issues/213 for more details.

Was the Haemophilus_influenzae_v1_refs database from our website? I should update it to fix this if so

Ah sorry, I didnt see that bit in the README. Surprised that it worked on the Ubuntu box, must have used a different scikit-learn and I just didn't notice.

The database was from your website, yes. I didn't try any other ones, just that one.


I'm really sorry @conmeehan but I totally forgot about this!!

I remembered as I'd seen this pubished: https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.001281 Just made a compatible poppunk scheme (without the error reported here) and uploaded it as v2.

