Houmin Wei comments

Results 23 comments of


                                            Houmin Wei

cannot ssh to vm when in masquerade mode

I came across the same problem for kernel 3.10, after I update node os to ubuntu 18.04(kernel 4.15), it works all right.

配置好之后点击/blog/movies/，会默认自动新弹到一个about:blank#blocked的页面，原页面显示正常，不知道是什么情况。

复议，我也是有这个情况。使用的浏览器为 Chrome

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

> 对显存的挑战。即使是最大的GPU的主内存也不可能适合这些模型的参数，比如一个175B的GPT-3模型需要（175B * 4bytes）就是700GB模型参数空间，从而梯度也是700G，优化器状态是1400G，一共2.8TB。需要从算法角度理解为什么 175B 的 GPT-3 需要 2.8TB 的内存，是如何计算的？

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

> 数据并行扩展通常效果很好，但有两个限制： a）超过某一个点之后，每个GPU的batch size变得太小，这降低了GPU的利用率，增加了通信成本； b）可使用的最大设备数就是batch size，着限制了可用于训练的加速器数量。这两个限制是为啥？

[Bug] MathJax layout problem in Safari

Hi I met a similar bug for KaTeX dispaly on safari, here is the [demo for the bug](https://codepen.io/librabyte/pen/MWBrWjR). I am not sure if the issue is the same problem.

GaiaGPU: Sharing GPUs in Container Clouds

背景：容器在云计算中因为其轻量性和可扩展性而得到广泛应用，GPU在深度学习等场景下被广泛用于加速计算，如何在容器间共享GPU资源，提高GPU利用率得到广泛研究。GaiaGPU 通过将虚拟的GPU分割成若干虚拟GPU，实现GPU memory和计算资源的隔离与共享 GaiaGPU的实现主要分为两个部分：Kubernetes 部分和 vCUDA 部分 - Kubernetes部分基于 Kubernetes 的 Extended Resources、Device Plugin 和 Scheduler Extender机制，实现了下面两个项目 - [GPU Manager ](https://github.com/tkestack/gpu-manager) ：实现为一个 Device Plugin，与 NVIDIA 的 [k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin) 相比，不需要额外配置 `nvidia-docker2`，使用的是原生的...

GaiaGPU: Sharing GPUs in Container Clouds

下面对这几个组件依次分析，首先是 GPU Manager，实际上就是一个 Device Plugin，负责创建 vGPUS 和与 kubelet 通信。如果对Device Plugin不了解可以先看看[这里](https://houmin.cc/posts/3f069334/) ![image](https://user-images.githubusercontent.com/16812977/99637062-46aeb500-2a7f-11eb-940d-560ae307da24.png) - 与阿里的 [GPUShare](https://github.com/AliyunContainerService/gpushare-scheduler-extender) 不同，GPU Manager 在 `ListAndWatch` 返回给Kubelet的是 `a list of vGPUs`，而不是实际的GPU设备。 - GPU被虚拟化为两个资源维度，memory 和 computing resource - memory：以256M内存作为单位，每个memory...

GaiaGPU: Sharing GPUs in Container Clouds

接下来是vGPU的管理，对应论文中的 vGPU Manager 和 vGPU Library，其中 vGPU Library实际实现的是 [vcuda-controller](https://github.com/tkestack/vcuda-controller) vGPU Manager 最后从属于 [GPU Manager](https://github.com/tkestack/gpu-manager)项目的一部分，作为DaemonSet会运行在每个Node之上。当一个容器申请了 Container 资源，论文图中的 GPU Manager 会将容器配置比如申请的GPU资源大小，容器的名字发送给 vGPU Manager。 vGPU Manager 收到容器的配置之后，会为这个容器在Host上创建一个独特的以容器名命名的目录，并且会将这个目录返回到 AllocateResponse 里面，最终返回给 kubelet。 vGPU...

GaiaGPU: Sharing GPUs in Container Clouds

接下来是最关键的部分，vCUDA Library的实现，它通过劫持 vCUDA API 的调用来做资源隔离，具体劫持的API如下表所示 ![image](https://user-images.githubusercontent.com/16812977/99669840-a2426800-2aaa-11eb-8f75-5837b46204ec.png) 这里的问题是，vCUDA Library 是如何做到注射到容器之内的呢？ ``` Host | Container | | .-----------. | | allocator |----------. | ___________ '-----------' PodUID | | \ \ v |...

GaiaGPU: Sharing GPUs in Container Clouds

把这部分论文分析总结在了[博客里面](https://houmin.cc/posts/cf391335/)