FunASR icon indicating copy to clipboard operation
FunASR copied to clipboard

'hdbscan' module not found; maybe use installed sklearn.cluster.HDBSCAN?

Open atomiechen opened this issue 1 year ago • 13 comments

❓ Questions and Help

What is your question?

Starting from a fresh container environment equipped with pytorch and funasr (via pip install funasr), I encountered ModuleNotFoundError: No module named 'hdbscan' when I instanciate an AutoModel with a spk model. It originates from the import hdbscan in UmapHdbscan() <- ClusterBackend() <- AutoModel(...).

  1. Must I install hdbscan manually? Is there any other package that I also need in advance?
  • I am crafting my own container and I am frustrated to find that I have to build my image again. I see no hint message from the output or doc.
  1. There is a sklearn.cluster.HDBSCAN, and I find sklearn is already there with funasr installed. Can we just use that sklearn one instead of installing the standalone version hdbscan?
  • These two versions seem coming from same authors, and differ in some minor ways (see https://github.com/scikit-learn/scikit-learn/issues/27829)

Code

from funasr import AutoModel
model = AutoModel(
    model="iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch", model_revision="v2.0.4",
    vad_model="iic/speech_fsmn_vad_zh-cn-16k-common-pytorch", vad_model_revision="v2.0.4",
    punc_model="iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch", punc_model_revision="v2.0.4",
    spk_model="iic/speech_campplus_sv_zh-cn_16k-common", spk_model_revision="v2.0.2",
)

What have you tried?

In a pytorch docker container, run pip install funasr and then the script above.

What's your environment?

  • OS (e.g., Linux):
  • FunASR Version (e.g., 1.0.0): 1.0.19
  • ModelScope Version (e.g., 1.11.0): None (do not need it)
  • PyTorch Version (e.g., 2.0.0): 2.2.2
  • How you installed funasr (pip, source): pip
  • Python version: 3.10.14
  • GPU (e.g., V100M32): NVIDIA GeForce RTX 4090
  • CUDA/cuDNN version (e.g., cuda11.7): cuda11.8
  • Docker version (e.g., funasr-runtime-sdk-cpu-0.4.1): pytorch/pytorch:2.2.2-cuda11.8-cudnn8-runtime
  • Any other relevant information:

atomiechen avatar Mar 31 '24 02:03 atomiechen

delete all *model_revision, and try it again. All requirements would be installed automatically.

LauraGPT avatar Mar 31 '24 14:03 LauraGPT

Yes, thank you. But basically what I want to do is to build an image with installed packages ahead of running any scripts. I believe I should not figure it out through trial and error by myself.

atomiechen avatar Mar 31 '24 15:03 atomiechen

Yes, thank you. But basically what I want to do is to build an image with installed packages ahead of running any scripts. I believe I should not figure it out through trial and error by myself.

If there exists any errors, please let me know after you delete all *model_revision.

LauraGPT avatar Apr 01 '24 12:04 LauraGPT

If there exists any errors, please let me know after you delete all *model_revision.

Sadly yes.

I removed all *model_revision:

from funasr import AutoModel

model = AutoModel(
    model="iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch", 
    vad_model="iic/speech_fsmn_vad_zh-cn-16k-common-pytorch", 
    punc_model="iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch", 
    spk_model="iic/speech_campplus_sv_zh-cn_16k-common", 
)

And I still got:

ckpt: iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/model.pt
ckpt: iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/model.pt
ckpt: iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/model.pt
ckpt: iic/speech_campplus_sv_zh-cn_16k-common/campplus_cn_common.bin
Traceback (most recent call last):
  File "/shared/test-funasr/tmp_test.py", line 10, in <module>
    spk_model="iic/speech_campplus_sv_zh-cn_16k-common", 
  File "/home/user/.local/lib/python3.10/site-packages/funasr/auto/auto_model.py", line 135, in __init__
    self.cb_model = ClusterBackend().to(kwargs["device"])
  File "/home/user/.local/lib/python3.10/site-packages/funasr/models/campplus/cluster_backend.py", line 149, in __init__
    self.umap_hdbscan_cluster = UmapHdbscan()
  File "/home/user/.local/lib/python3.10/site-packages/funasr/models/campplus/cluster_backend.py", line 118, in __init__
    import hdbscan
ModuleNotFoundError: No module named 'hdbscan'

FunASR Version: 1.0.19

And I cannot even import funasr using the latest commit (702b9b540c3c1524748cd975a10ce33f0fa53912) on main branch:

>>> import funasr
/.../FunASR/funasr/datasets/large_datasets/utils/tokenize.py:93: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  if vad is not -2:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../FunASR/funasr/__init__.py", line 36, in <module>
    import_submodules(__name__)
  File "/.../FunASR/funasr/__init__.py", line 33, in import_submodules
    results.update(import_submodules(name))
  File "/.../FunASR/funasr/__init__.py", line 33, in import_submodules
    results.update(import_submodules(name))
  File "/.../FunASR/funasr/__init__.py", line 33, in import_submodules
    results.update(import_submodules(name))
  File "/.../FunASR/funasr/__init__.py", line 25, in import_submodules
    for loader, name, is_pkg in pkgutil.walk_packages(package.__path__, package.__name__ + '.'):
AttributeError: 'str' object has no attribute '__path__'. Did you mean: '__hash__'?

atomiechen avatar Apr 01 '24 16:04 atomiechen

Plus: all my models are already there inside the literally iic folder in current directory, so there is no extra downloads. The environment running above script does not have modelscope installed.

Still worth mentioning: during the image building phase one should not use a test script like this to 'trigger' the auto installation of extra dependencies, which is anti-pattern. It needs explicit commands to prepare the environment, like pip install funasr[spk].

atomiechen avatar Apr 01 '24 16:04 atomiechen

If there exists any errors, please let me know after you delete all *model_revision.

Sadly yes.

I removed all *model_revision:

from funasr import AutoModel

model = AutoModel(
    model="iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch", 
    vad_model="iic/speech_fsmn_vad_zh-cn-16k-common-pytorch", 
    punc_model="iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch", 
    spk_model="iic/speech_campplus_sv_zh-cn_16k-common", 
)

And I still got:

ckpt: iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/model.pt
ckpt: iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/model.pt
ckpt: iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/model.pt
ckpt: iic/speech_campplus_sv_zh-cn_16k-common/campplus_cn_common.bin
Traceback (most recent call last):
  File "/shared/test-funasr/tmp_test.py", line 10, in <module>
    spk_model="iic/speech_campplus_sv_zh-cn_16k-common", 
  File "/home/user/.local/lib/python3.10/site-packages/funasr/auto/auto_model.py", line 135, in __init__
    self.cb_model = ClusterBackend().to(kwargs["device"])
  File "/home/user/.local/lib/python3.10/site-packages/funasr/models/campplus/cluster_backend.py", line 149, in __init__
    self.umap_hdbscan_cluster = UmapHdbscan()
  File "/home/user/.local/lib/python3.10/site-packages/funasr/models/campplus/cluster_backend.py", line 118, in __init__
    import hdbscan
ModuleNotFoundError: No module named 'hdbscan'

FunASR Version: 1.0.19

And I cannot even import funasr using the latest commit (702b9b5) on main branch:

>>> import funasr
/.../FunASR/funasr/datasets/large_datasets/utils/tokenize.py:93: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  if vad is not -2:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../FunASR/funasr/__init__.py", line 36, in <module>
    import_submodules(__name__)
  File "/.../FunASR/funasr/__init__.py", line 33, in import_submodules
    results.update(import_submodules(name))
  File "/.../FunASR/funasr/__init__.py", line 33, in import_submodules
    results.update(import_submodules(name))
  File "/.../FunASR/funasr/__init__.py", line 33, in import_submodules
    results.update(import_submodules(name))
  File "/.../FunASR/funasr/__init__.py", line 25, in import_submodules
    for loader, name, is_pkg in pkgutil.walk_packages(package.__path__, package.__name__ + '.'):
AttributeError: 'str' object has no attribute '__path__'. Did you mean: '__hash__'?

FunASR Version: 1.0.19 You should pip install -e .

LauraGPT avatar Apr 02 '24 02:04 LauraGPT

I mean I tried both ways:

  1. pip install funasr to install the latest pypi version (1.0.19)
  2. pip install -e . after pulling the latest commit of main branch, which results in above error.

atomiechen avatar Apr 02 '24 02:04 atomiechen

I mean I tried both ways:

  1. pip install funasr to install the latest pypi version (1.0.19)
  2. pip install -e . after pulling the latest commit of main branch, which results in above error.

先 pip install -e . 然后把这里注释解除,把报错log出来:https://github.com/alibaba-damo-academy/FunASR/blob/main/funasr/init.py#L21

LauraGPT avatar Apr 02 '24 02:04 LauraGPT

I mean I tried both ways:

  1. pip install funasr to install the latest pypi version (1.0.19)
  2. pip install -e . after pulling the latest commit of main branch, which results in above error.

Bug has been fixed. Please update funasr https://github.com/alibaba-damo-academy/FunASR/pull/1580 :

pip pull 
pip install -e .

LauraGPT avatar Apr 02 '24 03:04 LauraGPT

I pulled latest commit, used pip install -e . and uncommnet the print (see screenshot), but found still the same output: image

So there is no error reported here.

atomiechen avatar Apr 02 '24 03:04 atomiechen

Requirements would be installed in https://github.com/alibaba-damo-academy/FunASR/blob/main/funasr/download/download_from_hub.py#L76

Maybe you could debug it and show the log.

LauraGPT avatar Apr 02 '24 04:04 LauraGPT

Plus: all my models are already there inside the literally iic folder in current directory, so there is no extra downloads. The environment running above script does not have modelscope installed.

The problem is that models of previous revision (instead of master) is already downloaded in the iic folder, and the code does not check that and will not redownload the latest master revision. So there is no requirements.txt file in the campplus model folder.

image

atomiechen avatar Apr 05 '24 17:04 atomiechen

I now understand that the requirements.txt comes from the model dir. Maybe some mechanism of auto redownloading the specified revision is required?

❓ And also I wonder if this is possible:

2. There is a sklearn.cluster.HDBSCAN, and I find sklearn is already there with funasr installed. Can we just use that sklearn one instead of installing the standalone version hdbscan?

atomiechen avatar Apr 05 '24 18:04 atomiechen