unilm
unilm copied to clipboard
layoutlmv3-base-chinese tokenizer could not be loaded.
The resources in chinese version layoutlmv3 is sentencepiece.bpe.model
.
But we seem need vocab.json
and merges.txt
to load the LayoutLMv3Tokenizer
.
So @Dod-o could you provide a function to convert them or confirm whether there is a diff between these two tokenizers?
layoutlmv3-base-chinese use XLMRobertaTokenizer as tokenizer, add "tokenizer_class": "XLMRobertaTokenizer"
in https://huggingface.co/microsoft/layoutlmv3-base-chinese/blob/main/config.json works for me.
Same problem. How to solve it?
layoutlmv3-base-chinese 使用 XLMRobertaTokenizer 作为标记器,添加https://huggingface.co/microsoft/layoutlmv3-base-chinese/blob/main/config.json
"tokenizer_class": "XLMRobertaTokenizer"
对我有用。
what is your transformers version? transformers newest not work.
@HYPJUDY Seem the XLMRobertaTokenizer
define in config file only accept text as input; but the LayoutLMv3Tokenizer
accept both text and bboxes as input. So it could add bboxes info to tokenized res automatic.
It would be more convenient.
layoutlmv3-base-chinese 使用 XLMRobertaTokenizer 作为标记器,添加[https://huggingface.co/microsoft/layoutlmv3-base-chinese/blob/main/config.json
"tokenizer_class](https://huggingface.co/microsoft/layoutlmv3-base-chinese/blob/main/config.json%60%22tokenizer_class)": "XLMRobertaTokenizer"
对我有用。what is your transformers version? transformers newest not work.
4.5.1
Hi @wlhgtc, as mentioned above, the Chinese version of layoutlmv3 uses XLMRobertaTokenizer, it is the difference between layoutlmv3-zh and layoutlmv3-en. This code provides an example of how to process bbox by ourselves.
Hi @wlhgtc, as mentioned above, the Chinese version of layoutlmv3 uses XLMRobertaTokenizer, it is the difference between layoutlmv3-zh and layoutlmv3-en. This code provides an example of how to process bbox by ourselves.
Actually, I create a new tokenzier from LayoutLMv3Tokenizer
, use spm model to tokenize text, rather than the two files.
SPIECE_UNDERLINE = "▁"
class LayoutLMv3ChineseTokenizer(LayoutLMv3Tokenizer):
def __init__(
self,
vocab_file,
merges_file,
errors="replace",
bos_token="<s>",
eos_token="</s>",
sep_token="</s>",
cls_token="<s>",
unk_token="<unk>",
pad_token="<pad>",
mask_token="<mask>",
add_prefix_space=True,
cls_token_box=[0, 0, 0, 0],
sep_token_box=[0, 0, 0, 0],
pad_token_box=[0, 0, 0, 0],
pad_token_label=-100,
only_label_first_subword=True,
sp_model_kwargs: Optional[Dict[str, Any]] = None,
**kwargs
):
bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
# Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
super().__init__(
vocab_file=vocab_file,
merges_file=merges_file,
errors=errors,
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
cls_token_box=cls_token_box,
sep_token_box=sep_token_box,
pad_token_box=pad_token_box,
pad_token_label=pad_token_label,
only_label_first_subword=only_label_first_subword,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)
with open(vocab_file, encoding="utf-8") as vocab_handle:
self.encoder = json.load(vocab_handle)
self.decoder = {v: k for k, v in self.encoder.items()}
self.errors = errors # how to handle errors in decoding
self.byte_encoder = bytes_to_unicode()
self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
# with open(merges_file, encoding="utf-8") as merges_handle:
# bpe_merges = merges_handle.read().split("\n")[1:-1]
# bpe_merges = [tuple(merge.split()) for merge in bpe_merges]
# self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
sp_model_file = kwargs["name_or_path"] + "/sentencepiece.bpe.model"
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(str(sp_model_file))
self.cache = {}
self.add_prefix_space = add_prefix_space
# Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
# self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
# additional properties
self.cls_token_box = cls_token_box
self.sep_token_box = sep_token_box
self.pad_token_box = pad_token_box
self.pad_token_label = pad_token_label
self.only_label_first_subword = only_label_first_subword
def _tokenize(self, text: str) -> List[str]:
return self.sp_model.encode(text, out_type=str)
def convert_tokens_to_string(self, tokens):
"""Converts a sequence of tokens (strings for sub-words) in a single string."""
out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
return out_string
Maybe you could check whether it's available, or just do some minor change to LayoutLMv3Tokenizer
. Then it could fit both languages?
@Dod-o
layoutlmv3-base-chinese 使用 XLMRobertaTokenizer 作为标记器,添加[https://huggingface.co/microsoft/layoutlmv3-base-chinese/blob/main/config.json
"tokenizer_class](https://huggingface.co/microsoft/layoutlmv3-base-chinese/blob/main/config.json%60%22tokenizer_class)": "XLMRobertaTokenizer"
对我有用。what is your transformers version? transformers newest not work.
have you solve this problem?
layoutlmv3-base-chinese 使用 XLMRobertaTokenizer 作为标记器,添加[https://huggingface.co/microsoft/layoutlmv3-base-chinese/blob/main/config.json
"tokenizer_class](https://huggingface.co/microsoft/layoutlmv3-base-chinese/blob/main/config.json%60%22tokenizer_class)": "XLMRobertaTokenizer"
对我有用。what is your transformers version? transformers newest not work.
have you solve this problem? I tried changing the code, but it didn't work. In the end I use unilm and it's ok. https://github.com/microsoft/unilm/tree/master/layoutlmv3
把这个tokenizer.json放到模型目录下。AutoProcessor load完之后记得用sentencepiece load一下sentencepiece.bpe.model 核对一下processor load出来的对不对
Use LayoutLMv3TokenizerFast() instead of LayoutLMv3Tokenizer()