Minghao Li
Minghao Li
1. We use the label of the whole world as the label of the first token. The rest tokens are labeled by "CrossEntropyLoss().ignore_index" which will be ignored when computing the...
Our training code is also based on LayoutLM, so there is no plan to provide it.
We use the first method and pad the incomplete sequence with the padding tokens.
We do mark the semantic structure of "date" during the data construction process, but the proportion of "date" is too small, we do not include it in the paper.
Yes. We regard '##LTLine##' as a special token during train and predict.
In fact, we did not add them to the vocabulary. They will also be tokenized into tokens and labeled in the way I mentioned at #25.
> https://doc-analysis.github.io/docbank-page/index.html这个链接下载不了,该怎么解决? 想用数据集用于版面分析 Can't you access the web page or can't you download the file? If it's the latter, please provide the name of the file.
Please check your network, this link has been tested to be accessible.
@taosong2019 We have moved the data and the models to Azure blob just now, try to download them again.