yuxin.wang comments

Results 78 comments of


                                            yuxin.wang

get_gzh_by_search() 中解析 authentication 时会出现非预期的结果

按照之前的写法，现在该字段获取到的都是 "\n"。实际上是有值的。通过 xpath 获取到的结果为 ["\n", "some value about authentication "]，但我们只去了index = 0 这一项，所以会造成非预期的结果。建议修改成下方的写法，更加健壮一些。最好在该字段提取时，还是不要使用 get_first_of_element api 太死板。 authentication = get_first_of_element(li, './/i[@class="identify"]/parent::dd/text()[2]')

Error in program: No valid option generated in #select!

> Thanks! Clearly we need some non-english based unit tests. A PR would be much appreciated! (and if you do send in a PR can you also make sure whatever...

代码跑着跑着就挂了，CUDA out of memory

内部是会做截断的，但 uniem 依然是动态申请内存的，所以如果数据中出现一个文本很长的样本，就有可能出现中途 OOM。可以考虑减少 batch_size，或者手动平衡文本长度。

问题

基于 batch 内的样本进行负采样，通过对比学习的方式来训练模型

问题

大概率不会的，除非凑巧 batch 内有正样本，这种极其少量的干扰其实影响不大。

checkpoint模型无法加载

checkpoint 保存的是权重，是为了恢复训练流程设计的，而完整的模型还包括 tokenizer 和模型配置等文件，是为了加载和推断设计的。简单来讲就是两者存在的意义不同，是两个东西，所以文件内容不同。

checkpoint模型无法加载

Q: 我想这两个东西同时保存，是否有参数可以传递？ A: 目前没有参数可以控制这一行为， Q: 是否时直接把.bin文件放在微调的模型文件中，替换掉原来旧的.bin即可 A: 只需要替换 pytorch_model.bin 即可，其余两个文件是 "运行时" 才需要的。

转onnx问题

token_type_ids 是不需要输入的，输出多个也是正常的，token_embeddings 可以忽略掉（是每个 token 的编码），用 sentence_embedding 就可以了(是整句话的编码)

单机多卡运行时报错 has parameters that were not used in producing loss

需要设置 Accelerator 。报错信息其实很明确，在原始的训练代码中进行如下修改 ``` # import 这个 from accelerate import DistributedDataParallelKwargs accelerator = Accelerator( mixed_precision=mixed_precision.value, gradient_accumulation_steps=gradient_accumulation_steps, project_config=project_config, log_with=['tensorboard'] if use_tensorboard else None, dispatch_batches=True, split_batches=True, # 添加下面这一行 kwargs_handlers=[DistributedDataParallelKwargs(find_unused_parameters=True)], ) ```...

单机多卡运行时报错 has parameters that were not used in producing loss

我是按照文档的方式使用的，没有深究过。我现在主要使用 FSDP ，就没太关注 DDP 了。