zijunsun

[email protected]

Results 24 comments of


                                            zijunsun

Code for processing raw data

Thanks for code sharing. However, we met the same problem that without bio preprocessing code raw data, we don't quite sure how to generate node id, edge attr and so...

预训练细节

对于预训练的超大数据，我们采用了MMap来解决。首先将数据处理为二进制格式并通过MMap存储到硬盘，之后在训练过程中直接使用MMap在硬盘中检索所需数据的位置，从硬盘中动态加载。存入内存的只是所有数据的index，我们对index进行shuffle，之后选取每个batch的index，在训练过程中动态从硬盘上加载数据。详细可以参考fairseq的实现：[fairseq/fairseq/data/indexed_dataset.py](https://github.com/pytorch/fairseq/blob/1bba712622b8ae4efb3eb793a8a40da386fe11d0/fairseq/data/indexed_dataset.py)

run ChnSetiCorp_trainer.py

麻烦贴一下具体的错误截图

关于CPU

LCQMC_trainer.py使用了[pytorch_lightning](https://pytorch-lightning.readthedocs.io/en/latest/)框架，参数的输入参照该框架要求。52行和149行可能需要做相应改动。建议使用GPU, CPU训练会非常慢。。

关于CPU

可以的，但是代码也要相应做一些改动。

关于CPU

cpu下你试下num_workers设置为0.

The version conflict of the pytorch-lightening

Please check the requirement.txt to make sure you install the right version

关于tokenizer

这个其实比较简单，就是之前一个token 映射一个input id, 现在一个token同时对应了一个input id和一个pinyin id。这个没有训练。就是一个字典的查找。映射字典在这 [link](https://huggingface.co/ShannonAI/ChineseBERT-base/tree/main/config)

单个GPU参数如何配置

麻烦贴一下报错的截图吧。

单个GPU参数如何配置

单个gpu参数： “--gpus=0, ” 0后面要加个逗号

1
2
3
›