oneflow 暂时不支持nn.pairwisedistance() 及 Variable

暂时不支持nn.pairwisedistance() 及 Variable，然后我就用torch.nn.pairwisedistance() 代替了，Variable也使用了torch的，请问这对训练结果或者训练速度有影响吗？

Aug 04 '22 03:08 wjy3326

我们还没有混合使用过 torch 和 oneflow ，无论是效率还是正确性应该都是没有保证的

Variable 是不是在torch 未来的版本要废弃的，统一使用parameter?

方便说是哪个模型里需要pairwisedistance 吗? 我们看看怎么尽快支持起来

Aug 04 '22 03:08 yuanms2

暂时不支持nn.pairwisedistance() 及 Variable，然后我就用torch.nn.pairwisedistance() 代替了，Variable也使用了torch的，请问这对训练结果或者训练速度有影响吗？

回复到这里吧。https://github.com/Oneflow-Inc/OneTeam/issues/1207#issuecomment-1073432125 Variable这个是过时的用法就不用写了，用Tensor代替吧

Aug 04 '22 03:08 BBuf

暂时不支持nn.pairwisedistance() 及 Variable，然后我就用torch.nn.pairwisedistance() 代替了，Variable也使用了torch的，请问这对训练结果或者训练速度有影响吗？

回复到这里吧。https://github.com/Oneflow-Inc/OneTeam/issues/1207#issuecomment-1073432125 Variable这个是过时的用法就不用写了，用Tensor代替吧

抱歉这个链接打不开

Aug 04 '22 03:08 wjy3326

我们还没有混合使用过 torch 和 oneflow ，无论是效率还是正确性应该都是没有保证的

Variable 是不是在torch 未来的版本要废弃的，统一使用parameter?

方便说是哪个模型里需要pairwisedistance 吗? 我们看看怎么尽快支持起来

您好，是TransE, TransH模型里面用到的，我现在用torch.nn.pairwisedistance()可以跑通，但是不知道

暂时不支持nn.pairwisedistance() 及 Variable，然后我就用torch.nn.pairwisedistance() 代替了，Variable也使用了torch的，请问这对训练结果或者训练速度有影响吗？

回复到这里吧。https://github.com/Oneflow-Inc/OneTeam/issues/1207#issuecomment-1073432125 Variable这个是过时的用法就不用写了，用Tensor代替吧

我用variable是在这里 if self.args.usegpu: with torch.cuda.device(self.args.gpunum): posX = Variable(torch.LongTensor(posX).cuda()) negX = Variable(torch.LongTensor(negX).cuda()) 请问把这里的Variable改成Tensor就行吗，但是好像不是可训练的参数吧？

Aug 04 '22 03:08 wjy3326

我们还没有混合使用过 torch 和 oneflow ，无论是效率还是正确性应该都是没有保证的

Variable 是不是在torch 未来的版本要废弃的，统一使用parameter?

方便说是哪个模型里需要pairwisedistance 吗? 我们看看怎么尽快支持起来

你好，是TransH, TransE模型需要的，在最后计算距离的时候

Aug 04 '22 03:08 wjy3326

我们还没有混合使用过 torch 和 oneflow ，无论是效率还是正确性应该都是没有保证的 Variable 是不是在torch 未来的版本要废弃的，统一使用parameter? 方便说是哪个模型里需要pairwisedistance 吗? 我们看看怎么尽快支持起来

您好，是TransE, TransH模型里面用到的，我现在用torch.nn.pairwisedistance()可以跑通，但是不知道

暂时不支持nn.pairwisedistance() 及 Variable，然后我就用torch.nn.pairwisedistance() 代替了，Variable也使用了torch的，请问这对训练结果或者训练速度有影响吗？

回复到这里吧。Oneflow-Inc/OneTeam#1207 (comment) Variable这个是过时的用法就不用写了，用Tensor代替吧

我用variable是在这里 if self.args.usegpu: with torch.cuda.device(self.args.gpunum): posX = Variable(torch.LongTensor(posX).cuda()) negX = Variable(torch.LongTensor(negX).cuda()) 请问把这里的Variable改成Tensor就行吗，但是好像不是可训练的参数吧？

关于从Variable 改成 Tensor，可以参考pytorch文档，有例子 https://pytorch.org/blog/pytorch-0_4_0-migration-guide/

谢谢，我们研究一下TransH, TransE模型，并把缺失的算子尽快补齐。

Aug 04 '22 03:08 yuanms2

import oneflow as torch

posX = [[4185651, 166, 622], [1520396, 31, 1253465], [3353699, 46, 518777], [ 410635, 479, 812141]] posX = torch.LongTensor(posX).cuda() print("pos",posX) 上述程序没有反应，也不报错

Aug 04 '22 07:08 wjy3326

我本地试了一下是正常的，python3 -m oneflow --doctor 可以查看一下你的oneflow版本，另外可以直接用torch跑一下确认一下gpu是否正常工作。

Aug 04 '22 07:08 BBuf

python3 -m oneflow --doctor

您好，我的oneflow版本是0.8.0，我的torch gpu可以正常工作。但是运行还是没有出结果

Aug 04 '22 07:08 wjy3326

python3 -m oneflow --doctor

您好，我的oneflow版本是0.8.0，我的torch gpu可以正常工作。但是运行还是没有出结果

意思是oneflow运行的时候没有出结果，pytorch正常出结果吗？

Aug 04 '22 07:08 BBuf

是的，目前一直没有显示结果。就显示下面一句话

python test_.py loaded library: /lib64/libibverbs.so.1

Aug 04 '22 07:08 wjy3326

你好，这种情况多数是因为之前设置了一些环境变量，比如 WORLD_SIZE。您检查一下是否有类似的环境变量未清理？

Aug 04 '22 07:08 shangguanshiyuan

python3 -m oneflow --doctor

您好，我的oneflow版本是0.8.0，我的torch gpu可以正常工作。但是运行还是没有出结果

您能粘贴一下输出吗？想看一下安装的cuda版本。以及nvidia-smi结果中的Driver Version

Aug 04 '22 07:08 ouyangyu

你好，这种情况多数是因为之前设置了一些环境变量，比如 WORLD_SIZE。您检查一下是否有类似的环境变量未清理？

我的代码只有上述几行，没有额外的环境变量呢，不转cuda可以正常输出，转cuda就没有结果

Aug 04 '22 07:08 wjy3326

python3 -m oneflow --doctor

您好，我的oneflow版本是0.8.0，我的torch gpu可以正常工作。但是运行还是没有出结果

您能粘贴一下输出吗？想看一下安装的cuda版本。以及nvidia-smi结果中的Driver Version

torch输出结果：

posX = [[4185651, 166, 622], [1520396, 31, 1253465], [3353699, 46, 518777], [ 410635, 479, 812141]] posX = torch.LongTensor(posX).cuda() print(posX) tensor([[4185651, 166, 622], [1520396, 31, 1253465], [3353699, 46, 518777], [ 410635, 479, 812141]], device='cuda:0')

Aug 04 '22 07:08 wjy3326

cuda 11.4版本

Aug 04 '22 07:08 wjy3326

python3 -m oneflow --doctor

您好，我的oneflow版本是0.8.0，我的torch gpu可以正常工作。但是运行还是没有出结果

您能粘贴一下输出吗？想看一下安装的cuda版本。以及nvidia-smi结果中的Driver Version

torch输出结果：

posX = [[4185651, 166, 622], [1520396, 31, 1253465], [3353699, 46, 518777], [ 410635, 479, 812141]] posX = torch.LongTensor(posX).cuda() print(posX) tensor([[4185651, 166, 622], [1520396, 31, 1253465], [3353699, 46, 518777], [ 410635, 479, 812141]], device='cuda:0')

python3 -m oneflow --doctor 这个输出以及，nvidia-smi命令结果中的Driver Version

Aug 04 '22 07:08 ouyangyu

loaded library: /lib64/libibverbs.so.1 path: ['/u01/wangjunyan/anaconda3/envs/recommendation/lib/python3.7/site-packages/oneflow'] version: 0.8.0 git_commit: fa6edf31 cmake_build_type: Release rdma: True mlir: True

Aug 04 '22 07:08 wjy3326

python3 -m pip uninstall -y oneflow python3 -m pip install --find-links https://release.oneflow.info oneflow==0.8.0+cu112 麻烦您再试试。此时python3 -m oneflow --doctor输出如下，(version是：0.8.0+cu112)：

loaded library: /lib/libibverbs.so.1
version: 0.8.0+cu112
git_commit: a6d4cb80
cmake_build_type: Release
rdma: True
mlir: True

Aug 04 '22 07:08 ouyangyu

现在可以出结果了。我现在需要用import torch.nn.functional as F ，即F中的relu函数，但是把torch改为oneflow报错， TypeError: relu(): missing required argument x 请问这个函数可以用原来torch的吗还是需要重新写relu函数呢？

Aug 04 '22 08:08 wjy3326

现在可以出结果了。我现在需要用import torch.nn.functional as F ，即F中的relu函数，但是把torch改为oneflow报错， TypeError: relu(): missing required argument x 请问这个函数可以用原来torch的吗还是需要重新写relu函数呢？

您可以参考一下我们的API文档： https://oneflow.readthedocs.io/en/master/generated/oneflow.nn.functional.relu.html?highlight=oneflow.nn.functional.relu# https://oneflow.readthedocs.io/en/master/generated/oneflow.nn.ReLU.html#oneflow.nn.ReLU

Aug 04 '22 08:08 ouyangyu

谢谢。现在又有一个问题，torch.norm 改为oneflow遇到错误： AttributeError: module 'oneflow' has no attribute 'norm'

Aug 04 '22 09:08 wjy3326

谢谢。现在又有一个问题，torch.norm 改为oneflow遇到错误： AttributeError: module 'oneflow' has no attribute 'norm'

您可能需要安装一下最新master（Nightly版本）的OneFlow了：python3 -m pip uninstall -y oneflow && python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/master/cu112

Aug 04 '22 09:08 ouyangyu

您好，现在程序应该可以跑了，但是显示oom的错误，我之前就是因为模型参数太大一张卡跑不了，请问oneflow怎么设置只加载一部分模型参数到GPU，剩下的放到CPU上呢？

W20220804 17:55:20.659793 6255 ep_backend_allocator.cpp:37] OOM error is detected, process will exit. And it will start to reset CUDA device for releasing device memory. F20220804 17:55:20.659757 6255 virtual_machine_engine.cpp:382] out of memory Error message from /home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/virtual_machine_engine.cpp:382 instruction->Prepare(): reset device

File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/virtual_machine_engine.cpp", line 382, in DispatchInstruction instruction->Prepare() File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/fuse_instruction_policy.h", line 70, in Prepare instruction->Prepare() File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 34, in Prepare TryAllocateTempStorageThenDeallocate(op_call_instruction_policy, allocator) File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 96, in TryAllocateTempStorageThenDeallocate allocator->Allocate(&mem_ptr, byte_size) File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/bin_allocator.h", line 392, in Allocate AllocateBlockToExtendTotalMem(aligned_size) File "/home/ci-user/runners/release/work/oneflow/oneflow/oneflow/core/vm/bin_allocator.h", line 305, in AllocateBlockToExtendTotalMem backend->Allocate(&mem_ptr, final_allocate_bytes) Error Type: oneflow.ErrorProto.out_of_memory_error

Aug 04 '22 10:08 wjy3326

你好，可以参考 https://docs.oneflow.org/master/cookies/one_embedding.html 将大规模embedding分层存储

API文档： https://oneflow.readthedocs.io/en/master/one_embedding.html

Aug 04 '22 10:08 shangguanshiyuan

您好，配置表参数中的make_table_options 函数有什么作用呢？就是用来给embedding初始化吗？

Aug 05 '22 07:08 wjy3326

是的，用来指定初始化方法

Aug 05 '22 08:08 shangguanshiyuan

可以参考这一小段

        scales = np.sqrt(1 / np.array(table_size_array))
        tables = [
            flow.one_embedding. make_table_options(
                flow.one_embedding.make_uniform_initializer(low=-scale, high=scale)
            )
            for scale in scales
        ]

@wjy3326

Aug 05 '22 08:08 ShawnXuan

嗯嗯，请问one_embedding.make_cached_host_mem_store_options()函数中的capacity是指什么含义呢？是指embedding的第一个维度的大小吗或者是embedding的个数吗？

Aug 05 '22 08:08 wjy3326

vocab_size，总词表大小。您可以搭配样例 https://github.com/Oneflow-Inc/models/tree/main/RecommenderSystems 和API文档 https://oneflow.readthedocs.io/en/master/one_embedding.html 来理解、使用OneEmbedding。

Aug 05 '22 08:08 shangguanshiyuan

oneflow oneflow copied to clipboard

暂时不支持nn.pairwisedistance() 及 Variable

oneflow
oneflow copied to clipboard