paulpaul91
paulpaul91
hey,friend, I meet the same problem. Did you have solve it? Thank you
> Thanks for the attention. Actually, we used some optimization techniques the same as layoutlmv2. You can refer to the paper. At the same time, based on the StructuralLM model,...
> > > Thanks for the attention. Actually, we used some optimization techniques the same as layoutlmv2. You can refer to the paper. At the same time, based on the...
> > > Thanks for the attention. Actually, we used some optimization techniques the same as layoutlmv2. You can refer to the paper. At the same time, based on the...
> The continue pre-training on the DocVQA set can bring about 2.0+ ANLS. train set and validation set. QG can bring about 2.4+ANLS. In addition, merge the train set and...
> > > The continue pre-training on the DocVQA set can bring about 2.0+ ANLS. train set and validation set. QG can bring about 2.4+ANLS. In addition, merge the train...
> ### 中文字符token vs 句子分词token > 请问,DocVQA-ZH的数据集的预处理 model_zoo/ernie-layout/utils.py/Precessor.py/preprocess_mrc中, DocVQA-ZH数据集的text是单个中文字符(非句子在做分词),并且上面提到的阅读理解的预处理preprocess_mrc也没有先合并句子在做分词,我想知道为什么?我看到其他例子,比如说application/下的智能文档对ocr结果就是先分行在拆分字符,在合并成完成句子,最后做分词。这两种处理方式(句子做分词vs直接用字符作为token)效果是一样的吗?为什么不用词的token而用字符token 效果是一样的哈,不用担心
> I find the solution about this question ---- CopyNet. > The code is [here](https://github.com/lspvic/CopyNet). > > But I wonder that, its vocabulary size is fixed, and there's no array...
> Cheers thank you
> Therefore, no text information is used for this dataset?