Jack Chen
Jack Chen
我的笔记:https://github.com/Jack47/hack-SysML/blob/master/papers/ZeRO-Offload.md
I'm reading this paper too, maybe we can share and discuss together.
> > 这篇论文是北邮的博士学长的一作,据说有开源的计划,可以期待一下。 > > 论文是基于这样几个 key insight: > > > > * 深度学习是反馈驱动的探索,用户经常运行一批训练,取其中结果最好的。这个可以理解为是类似参数搜索,模型结构搜索这样的场景。 > > * 在资源使用的异构性,导致很难得到最优解 > > * intra-job predictability,这是全文比较关键的一个概念,如下图所示,GPU 的内存使用存在一定的周期性 > > > > 没完全传上去
这两天又重新看了下,这篇论文还是挺牛逼的,应该是现在大模型训练的标配了,主要使用很方便,可以和现有的 DDP 这种方式无缝衔接,与 MP,PP 等大模型训练的方式相比,对研究员而言可以无须改模型,即无痛使用,而且计算效率高
so we still needs this workaround? I found when signal 11 received, the health check is still ok😂
For example: python backend process hangs with this signal 11 error, but triton readiness is still ok, so request are incoming ``` DownCropResizer is doing nothing! {"pod_name": "sd15-triton-5cc495c8cc-zjhmx", "namespace": "production",...
we really needs this to allow fast develop and quick feedback through running unit tests locally. Docker image for Triton Inference Server is very large and not friendly for the...
@Tabrizian yes, it would be great help~. would you guys update a link which can show the feature request progress?