Xiaoyu Xu
Xiaoyu Xu
> when I set debug(0), there is no special log printed on the screen Try to set debug(2),logs are expected to show at the first call of nn.Graph. > when...
Is your code on github?We need to see the code to figure out what is happening.
> 在创建这个对象时会统一申请全部计算所需内存,析构时统一释放内存,因为是纯c++计算且整个过程没有内存申请操作 之前提到推理时有个动态 shape 的问题,它是取 max 去申请了内存么
这个 set/get_global_default_device - torch 有对应的么 - global 是什么含义了,因为我们有 global tensor,所以再使用 global 需要想清楚
另外 global tensor 相关的,之前我们提供了一个 global mode,看起来是一类功能(global mode 是给 global tensor 用的),可以综合考虑下 https://oneflow.readthedocs.io/en/master/generated/oneflow.utils.global_view.global_mode.html#oneflow.utils.global_view.global_mode
This idea is nice, I have heard of work like this: https://medium.com/rapids-ai/pytorch-rapids-rmm-maximize-the-memory-efficiency-of-your-workflows-f475107ba4d4 But this is not a bottleneck for the moment, we haven't seen tasks limited by this. So this...
> it takes 5 minutes to compile a deep LSTM net Is this a single device task or multi-device task? Compilation time costs of each compilation stage can be shown...
``` I20230425 00:48:47.266355 20443 cost_util.h:98] [count log]{"loc":"[GraphCompile]Graph_0 OptimizationLogicalGraph","mem_rss":"11621.000000 MB","time_cost":"433 seconds"} I20230425 00:49:50.931952 20443 cost_util.h:98] [count log]{"loc":"[GraphCompile]Graph_0 AlignStates","mem_rss":"11623.000000 MB","time_cost":"47 seconds"} I20230425 00:54:51.325140 20443 cost_util.h:98] [count log]{"loc":"[GraphCompile]Graph_0 CompleteJob","mem_rss":"11606.000000 MB","time_cost":"300 seconds"} ``` It...
带 nccl logical op 和 sbp 的 op graph log:https://oneflow-test.oss-cn-beijing.aliyuncs.com/mt5_test/2n4g_log/output.log 搜索下`Operator` 可以找到 op graph 的起点。
这个和之前那个分支的关系是什么,需要测评么; 对性能有提升作用不。