Jiarui Fang(方佳瑞)
Jiarui Fang(方佳瑞)
### Proposal ## 动机 现在ChunkManager是挂在一个pytorch model中。这样做有限制 1. 无法处理多个model用Gemini训练,以为不同模型对异构内存使用会互相干扰,导致每个模型在warmup采集的信息不具备指导意义。 2. 更重要的是,和Pytorch的使用方式差异。如下optim定义必须传入一个model作为参数。 ``` model = zero_model_wrapper(model, zero_stage, gemini_config) optimizer = zero_optim_wrapper(**model**, optimizer, optim_config=optim_config) ``` 而Pytorch Optimizer初始化和model没有任何关系(https://pytorch.org/docs/stable/optim.html), 尽管大多数使用场景,optimizer构建时使用model.parameters(),但比如下面代码第二种方式目前Gemini就不能支持。 ``` optimizer = optim.SGD(model.parameters(), lr=0.01,...
## 📌 Checklist before creating the PR - [ ] I have created an issue for this PR for traceability - [ ] The title follows the standard format: `[doc/gemini/tensor/...]:...
Dear authors, Thank you for the awesome works. I try to learn some implementation details and come across a small question. I doubt the meaning of the two following lines....
1. install sentencepiece from github repo. I can not run the .zip version on my MacOS. 2. make some necessary directories during make 3. cache the wiki json.gz if has...
In this PR, we can run `python -m cc_net --config config/test_segment.json` successfully in the following directory. data_prep/cc/cc_net/cc_net depends on #36
I run the benchmark.pu with the following warnings. python benchmark.py --arch resnet18 --device cuda:0 Parsing Computation Graph with torch.jit failed, revert to manual parse_graph function
### 🐛 Describe the bug As a place to show the best practice for users, I believe it is necessary to help users to skip the annoying dataset preparation stage....
Hello, thanks for the wonderful project. Did you consider aligning the results with some commonly used ones? https://github.com/mlcommons/training https://github.com/Oneflow-Inc/DLPerf
I have noticed that LightLLM currently seems to only support decoding through **sampling**. Additional decoding methods such as **BeamSearch** and **GreedySearch** are not yet supported. I would like to know...
I fixed the gym error. However, another error occurs. ```` [ERROR:640844 training:471 2022-10-12 11:16:25,954] Exception in worker process 0 Traceback (most recent call last): File "/home/lcfjr/codes/autoshard/autoshard/training.py", line 437, in act...