ESMFold for multimer fails when using HuggingFace installation
When attempting to use HuggingFace's ESMFold implementation for multimers with the suggested ':' separator (from the README) between chain sequences, I get a ValueError when submitting to the tokenizer. I am ok with using the artificial glycine linker suggested by HuggingFace's tutorial, but would like clarification if the ':' separator approach suggested by this github's README is valid. Thanks!
Reproduction steps Install huggingface transformers module via conda (version 4.24.0)
from transformers import EsmTokenizer, AutoTokenizer, EsmForProteinFolding
tokenizer = AutoTokenizer.from_pretrained("facebook/esmfold_v1") # Download tokenizer
# OR
tokenizer = EsmTokenizer.from_pretrained("facebook/esmfold_v1") # Download alternative tokenizer
model = EsmForProteinFolding.from_pretrained("facebook/esmfold_v1") # Download model
seq = chain1_seq + ":" + chain2_seq # Concatenate sequences with suggested delimiter
inputs = tokenizer([seq], return_tensors="pt", add_special_tokens=False) # Tokenize seq
Expected behavior
I expect the tokenizer to be able to handle the ':' delimiter and not raise an error. I tried both tokenizers (AutoTokenizer and EsmTokenizer) and both yielded the same error.
As suggested by the ValueError, I tried to turn on padding and truncation, but the output remained the same.
Logs Filepaths in output are truncated for privacy reasons.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
File [~/.../site-packages/transformers/tokenization_utils_base.py:715](.../site-packages/transformers/tokenization_utils_base.py:715), in BatchEncoding.convert_to_tensors(self, tensor_type, prepend_batch_axis)
[714](.../site-packages/transformers/tokenization_utils_base.py:714) if not is_tensor(value):
--> [715](.../site-packages/transformers/tokenization_utils_base.py:715) tensor = as_tensor(value)
[717](.../site-packages/transformers/tokenization_utils_base.py:717) # Removing this for now in favor of controlling the shape with `prepend_batch_axis`
[718](.../site-packages/transformers/tokenization_utils_base.py:718) # # at-least2d
[719](.../site-packages/transformers/tokenization_utils_base.py:719) # if tensor.ndim > 2:
[720](.../site-packages/transformers/tokenization_utils_base.py:720) # tensor = tensor.squeeze(0)
[721](.../site-packages/transformers/tokenization_utils_base.py:721) # elif tensor.ndim < 2:
[722](.../site-packages/transformers/tokenization_utils_base.py:722) # tensor = tensor[None, :]
RuntimeError: Could not infer dtype of NoneType
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
[.../src/embed.ipynb](.../src/embed.ipynb) Cell 139 line 1
----> [1](.../src/embed.ipynb#Y253sZmlsZQ%3D%3D?line=0) inputs = tokenizer([seq], return_tensors="pt", add_special_tokens=False)
File [.../site-packages/transformers/tokenization_utils_base.py:2488](.../site-packages/transformers/tokenization_utils_base.py:2488), in PreTrainedTokenizerBase.__call__(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
[2486](.../site-packages/transformers/tokenization_utils_base.py:2486) if not self._in_target_context_manager:
[2487](.../site-packages/transformers/tokenization_utils_base.py:2487) self._switch_to_input_mode()
-> [2488](.../site-packages/transformers/tokenization_utils_base.py:2488) encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
[2489](.../site-packages/transformers/tokenization_utils_base.py:2489) if text_target is not None:
...
[735](.../site-packages/transformers/tokenization_utils_base.py:735) " expected)."
[736](.../site-packages/transformers/tokenization_utils_base.py:736) )
[738](.../site-packages/transformers/tokenization_utils_base.py:738) return self
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
Additional context python version = 3.8 transformers version = 4.24.0
https://github.com/tonyreina/antibody-affinity/blob/main/esmfold_multimer.ipynb
The HuggingFace model doesn't use the ":". I've included a link to my notebook that shows how to do multimer predictions. The hack is that you include a linker sequence of G between all chains. So a single sequence linked by Gs is passed to the model. This linker output is masked when producing the PDB file. I suspect the ":" does the same thing under the hood.