transformers.js Add Twitter/twhin-bert-large

Model description

https://huggingface.co/Twitter/twhin-bert-large/

trained on 7 billion Tweets from over 100 distinct language

It's the best (and maybe even the only real) multilingual model for social media posts I could find. Tried running the conversion script but fails due to ONNX opset version 11. I tried to hard-code the parameter it in the script but without success so far.

$ python -m scripts.convert --quantize --model_id Twitter/twhin-bert-large
/mnt/c/Users/dome/transformers.js/onnxconversion/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 634/634 [00:00<00:00, 3.85MB/s]
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 373/373 [00:00<00:00, 644kB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 17.1M/17.1M [00:04<00:00, 3.65MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 239/239 [00:00<00:00, 497kB/s]
Framework not specified. Using pt to export to ONNX.
model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 2.25G/2.25G [02:11<00:00, 17.1MB/s]
Automatic task detection to fill-mask (possible synonyms are: masked-lm).
Using the export variant default. Available variants are:
        - default: The default ONNX variant.
Using framework PyTorch: 2.2.1+cu121
Overriding 1 configuration item(s)
        - use_cache -> False
Traceback (most recent call last):
  File "/home/dome/mambaforge/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/dome/mambaforge/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/c/Users/dome/transformers.js/scripts/convert.py", line 519, in <module>
    main()
  File "/mnt/c/Users/dome/transformers.js/scripts/convert.py", line 422, in main
    main_export(**export_kwargs)
  File "/mnt/c/Users/dome/transformers.js/onnxconversion/lib/python3.10/site-packages/optimum/exporters/onnx/__main__.py", line 486, in main_export
    _, onnx_outputs = export_models(
  File "/mnt/c/Users/dome/transformers.js/onnxconversion/lib/python3.10/site-packages/optimum/exporters/onnx/convert.py", line 752, in export_models
    export(
  File "/mnt/c/Users/dome/transformers.js/onnxconversion/lib/python3.10/site-packages/optimum/exporters/onnx/convert.py", line 855, in export
    export_output = export_pytorch(
  File "/mnt/c/Users/dome/transformers.js/onnxconversion/lib/python3.10/site-packages/optimum/exporters/onnx/convert.py", line 572, in export_pytorch
    onnx_export(
  File "/mnt/c/Users/dome/transformers.js/onnxconversion/lib/python3.10/site-packages/torch/onnx/utils.py", line 516, in export
    _export(
  File "/mnt/c/Users/dome/transformers.js/onnxconversion/lib/python3.10/site-packages/torch/onnx/utils.py", line 1613, in _export
    graph, params_dict, torch_out = _model_to_graph(
  File "/mnt/c/Users/dome/transformers.js/onnxconversion/lib/python3.10/site-packages/torch/onnx/utils.py", line 1139, in _model_to_graph
    graph = _optimize_graph(
  File "/mnt/c/Users/dome/transformers.js/onnxconversion/lib/python3.10/site-packages/torch/onnx/utils.py", line 677, in _optimize_graph
    graph = _C._jit_pass_onnx(graph, operator_export_type)
  File "/mnt/c/Users/dome/transformers.js/onnxconversion/lib/python3.10/site-packages/torch/onnx/utils.py", line 1967, in _run_symbolic_function
    raise errors.UnsupportedOperatorError(
torch.onnx.errors.UnsupportedOperatorError: Exporting the operator 'aten::einsum' to ONNX opset version 11 is not supported. Support for this operator was added in version 12, try exporting with this version.

Prerequisites

[X] The model is supported in Transformers (i.e., listed here)
[X] The model can be exported to ONNX with Optimum (i.e., listed here)

Additional information

As it's BERT-based I'd guess it's generally possible to convert it. Not sure whether TwHIN complicates it somehow.

Your contribution

I can try to rerun the conversion, but would need a clue how to deal with the opset param.

Feb 28 '24 16:02 do-me

You can increase the opset value by adding --opset 12 to the conversion command :)

So,

python -m scripts.convert --quantize --model_id Twitter/twhin-bert-large --opset 12

should work

Feb 28 '24 18:02 xenova

Worked, thanks! :) I uploaded the model for testing purposes here but the embeddings don't seem to be correct. I added the model to SemanticFinder and checked the similarity scores but it's just random which chunk is most similar. Also, the similarity scores seem suspiciously high with always 0.97 or 0.96.

My guess would be that they changed something in the architecture or that there is something special about it (e.g. with CLS token or similar). Unfortunately, the HF repo seems kind of dead.

Feb 29 '24 20:02 do-me

If you are only using the model for embeddings (i.e., not masked language modelling, which the model was trained for), you should append --task feature-extraction to the conversion command. This will remove the final language modelling head from the model.

This is probably the reason for the high values, since it's producing a probability distribution over the vocabulary (between 0 and 1), with most values similar to each other (resulting in high cosine similarity)

Feb 29 '24 21:02 xenova

That's really interesting, I wasn't aware of this flag so far. Will give it a shot tomorrow and update the model! Thank you so much for all your immediate help on these issues :)

Feb 29 '24 21:02 do-me

Added the flag and rerun the script. The values changed and seem normal to me. However, the actual results for normal semantic similarity are still poor. Is it possible, that this model is not at all suitable for feature extraction?

Mar 01 '24 12:03 do-me

transformers.js transformers.js copied to clipboard

Add Twitter/twhin-bert-large

Model description

Prerequisites

Additional information

Your contribution

transformers.js
transformers.js copied to clipboard