unilm icon indicating copy to clipboard operation
unilm copied to clipboard

layoutlmv3-base-chinese tokenizer could not be loaded.

Open wlhgtc opened this issue 2 years ago • 7 comments

The resources in chinese version layoutlmv3 is sentencepiece.bpe.model . But we seem need vocab.json and merges.txt to load the LayoutLMv3Tokenizer . So @Dod-o could you provide a function to convert them or confirm whether there is a diff between these two tokenizers?

wlhgtc avatar Jul 13 '22 10:07 wlhgtc

layoutlmv3-base-chinese use XLMRobertaTokenizer as tokenizer, add "tokenizer_class": "XLMRobertaTokenizer" in https://huggingface.co/microsoft/layoutlmv3-base-chinese/blob/main/config.json works for me.

Sanster avatar Jul 15 '22 03:07 Sanster

Same problem. How to solve it?

pogevip avatar Jul 26 '22 11:07 pogevip

layoutlmv3-base-chinese 使用 XLMRobertaTokenizer 作为标记器,添加https://huggingface.co/microsoft/layoutlmv3-base-chinese/blob/main/config.json"tokenizer_class": "XLMRobertaTokenizer"对我有用。

what is your transformers version? transformers newest not work.

pogevip avatar Jul 26 '22 17:07 pogevip

@HYPJUDY Seem the XLMRobertaTokenizer define in config file only accept text as input; but the LayoutLMv3Tokenizer accept both text and bboxes as input. So it could add bboxes info to tokenized res automatic. It would be more convenient.

wlhgtc avatar Aug 01 '22 14:08 wlhgtc

layoutlmv3-base-chinese 使用 XLMRobertaTokenizer 作为标记器,添加[https://huggingface.co/microsoft/layoutlmv3-base-chinese/blob/main/config.json"tokenizer_class](https://huggingface.co/microsoft/layoutlmv3-base-chinese/blob/main/config.json%60%22tokenizer_class)": "XLMRobertaTokenizer"对我有用。

what is your transformers version? transformers newest not work.

4.5.1

Sanster avatar Aug 01 '22 14:08 Sanster

Hi @wlhgtc, as mentioned above, the Chinese version of layoutlmv3 uses XLMRobertaTokenizer, it is the difference between layoutlmv3-zh and layoutlmv3-en. This code provides an example of how to process bbox by ourselves.

Dod-o avatar Aug 02 '22 06:08 Dod-o

Hi @wlhgtc, as mentioned above, the Chinese version of layoutlmv3 uses XLMRobertaTokenizer, it is the difference between layoutlmv3-zh and layoutlmv3-en. This code provides an example of how to process bbox by ourselves.

Actually, I create a new tokenzier from LayoutLMv3Tokenizer, use spm model to tokenize text, rather than the two files.


SPIECE_UNDERLINE = "▁"

class LayoutLMv3ChineseTokenizer(LayoutLMv3Tokenizer):
    def __init__(
            self,
            vocab_file,
            merges_file,
            errors="replace",
            bos_token="<s>",
            eos_token="</s>",
            sep_token="</s>",
            cls_token="<s>",
            unk_token="<unk>",
            pad_token="<pad>",
            mask_token="<mask>",
            add_prefix_space=True,
            cls_token_box=[0, 0, 0, 0],
            sep_token_box=[0, 0, 0, 0],
            pad_token_box=[0, 0, 0, 0],
            pad_token_label=-100,
            only_label_first_subword=True,
            sp_model_kwargs: Optional[Dict[str, Any]] = None,
            **kwargs
    ):
        bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
        eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
        sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
        cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token
        unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
        pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token

        # Mask token behave like a normal word, i.e. include the space before it
        mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token

        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs

        super().__init__(
            vocab_file=vocab_file,
            merges_file=merges_file,
            errors=errors,
            bos_token=bos_token,
            eos_token=eos_token,
            unk_token=unk_token,
            sep_token=sep_token,
            cls_token=cls_token,
            pad_token=pad_token,
            mask_token=mask_token,
            add_prefix_space=add_prefix_space,
            cls_token_box=cls_token_box,
            sep_token_box=sep_token_box,
            pad_token_box=pad_token_box,
            pad_token_label=pad_token_label,
            only_label_first_subword=only_label_first_subword,
            sp_model_kwargs=self.sp_model_kwargs,
            **kwargs,
        )

        with open(vocab_file, encoding="utf-8") as vocab_handle:
            self.encoder = json.load(vocab_handle)
        self.decoder = {v: k for k, v in self.encoder.items()}
        self.errors = errors  # how to handle errors in decoding
        self.byte_encoder = bytes_to_unicode()
        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}

        # with open(merges_file, encoding="utf-8") as merges_handle:
        #     bpe_merges = merges_handle.read().split("\n")[1:-1]
        # bpe_merges = [tuple(merge.split()) for merge in bpe_merges]
        # self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))

        sp_model_file = kwargs["name_or_path"] + "/sentencepiece.bpe.model"

        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
        self.sp_model.Load(str(sp_model_file))

        self.cache = {}
        self.add_prefix_space = add_prefix_space

        # Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
        # self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

        # additional properties
        self.cls_token_box = cls_token_box
        self.sep_token_box = sep_token_box
        self.pad_token_box = pad_token_box
        self.pad_token_label = pad_token_label
        self.only_label_first_subword = only_label_first_subword

    def _tokenize(self, text: str) -> List[str]:
        return self.sp_model.encode(text, out_type=str)


    def convert_tokens_to_string(self, tokens):
        """Converts a sequence of tokens (strings for sub-words) in a single string."""
        out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
        return out_string

Maybe you could check whether it's available, or just do some minor change to LayoutLMv3Tokenizer. Then it could fit both languages?
@Dod-o

wlhgtc avatar Aug 02 '22 08:08 wlhgtc

layoutlmv3-base-chinese 使用 XLMRobertaTokenizer 作为标记器,添加[https://huggingface.co/microsoft/layoutlmv3-base-chinese/blob/main/config.json"tokenizer_class](https://huggingface.co/microsoft/layoutlmv3-base-chinese/blob/main/config.json%60%22tokenizer_class)": "XLMRobertaTokenizer"对我有用。

what is your transformers version? transformers newest not work.

have you solve this problem?

YBAgg avatar Aug 27 '22 06:08 YBAgg

layoutlmv3-base-chinese 使用 XLMRobertaTokenizer 作为标记器,添加[https://huggingface.co/microsoft/layoutlmv3-base-chinese/blob/main/config.json"tokenizer_class](https://huggingface.co/microsoft/layoutlmv3-base-chinese/blob/main/config.json%60%22tokenizer_class)": "XLMRobertaTokenizer"对我有用。

what is your transformers version? transformers newest not work.

have you solve this problem? I tried changing the code, but it didn't work. In the end I use unilm and it's ok. https://github.com/microsoft/unilm/tree/master/layoutlmv3

pogevip avatar Aug 31 '22 15:08 pogevip

把这个tokenizer.json放到模型目录下。AutoProcessor load完之后记得用sentencepiece load一下sentencepiece.bpe.model 核对一下processor load出来的对不对

aixuedegege avatar Apr 25 '23 09:04 aixuedegege

Use LayoutLMv3TokenizerFast() instead of LayoutLMv3Tokenizer()

BarryRun avatar Aug 13 '23 14:08 BarryRun