Jack Chen comments

Results 28 comments of


                                            Jack Chen

ZeRO-Offload: Democratizing Billion-Scale Model Training

我的笔记：https://github.com/Jack47/hack-SysML/blob/master/papers/ZeRO-Offload.md

Gandiva: Introspective Cluster Scheduling for Deep Learning

I'm reading this paper too, maybe we can share and discuss together.

Gandiva: Introspective Cluster Scheduling for Deep Learning

> > 这篇论文是北邮的博士学长的一作，据说有开源的计划，可以期待一下。 > > 论文是基于这样几个 key insight: > > > > * 深度学习是反馈驱动的探索，用户经常运行一批训练，取其中结果最好的。这个可以理解为是类似参数搜索，模型结构搜索这样的场景。 > > * 在资源使用的异构性，导致很难得到最优解 > > * intra-job predictability，这是全文比较关键的一个概念，如下图所示，GPU 的内存使用存在一定的周期性 > > > > ![screenshot from...

ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters

👍，我之前也看过，总结到这里了：https://github.com/Jack47/hack-SysML/blob/master/papers/ZeRO.md 之前看的比较粗，最近应该还会看一遍。回头再参考下你的笔记👏

ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters

> 在思考为什么之前没有人提出这样的设计，@VoVAllen 给了个思路说之前数据并行都是小模型，因此没有这样的需求。我感觉很合理。在 Transformer 出来之后，才有了这样的新需求。 > > 还有一个问题是有没有可能把这个工作放到推荐领域应用？还留待调研除了模型变大了（发现模型越大，效果越好），数据集逐步变大也是另一个原因。ZeRO 系列主要是解决数据并行下，显存不够用的问题，把Optimizer States, Parameter 等进行了 Sharding。对于推荐领域，好像都是 Parameter Server 这种异步更新的架构，它本身就是模型特征数量数亿记，所以进行了 sharding 的。跟 DDP这种同步更新(all-reduce)的方式不太一样，属于机器学习中的另一种架构了最近微信的 PatricStar，属于对微软 DeepSpeed 的Sharding的一些改进，前段时间我也看了看，不过[笔记](https://github.com/Jack47/hack-SysML/blob/master/memory-efficiency/patrickstar.md)没完全传上去

ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters

这两天又重新看了下，这篇论文还是挺牛逼的，应该是现在大模型训练的标配了，主要使用很方便，可以和现有的 DDP 这种方式无缝衔接，与 MP，PP 等大模型训练的方式相比，对研究员而言可以无须改模型，即无痛使用，而且计算效率高

health check should not say it's ready when cuda device-side assertion error is triggered

so we still needs this workaround? I found when signal 11 received, the health check is still ok😂

health check should not say it's ready when cuda device-side assertion error is triggered

For example: python backend process hangs with this signal 11 error, but triton readiness is still ok, so request are incoming ``` DownCropResizer is doing nothing! {"pod_name": "sd15-triton-5cc495c8cc-zjhmx", "namespace": "production",...

Install Python Backend via pip locally

we really needs this to allow fast develop and quick feedback through running unit tests locally. Docker image for Triton Inference Server is very large and not friendly for the...

Install Python Backend via pip locally

@Tabrizian yes, it would be great help~. would you guys update a link which can show the feature request progress?