oneflow NeuS网络训练速度慢

Summary

三维重建NeuS网络实现遇到总体训练耗时太长的问题，用time.time()计算了下每次迭代的正向和反向的时间，与pytorch对比，结果正向平均单步耗时（）远大于pytorch（），反向差别不大。用oneflow只跑正向发现每隔10步左右会有一轮特别耗时，大概1s左右，除此之外的每步耗时和pytorch也差不多。

这是oneflow框架跑网络，用time.time()在python脚本里统计的每个迭代的时间，会发现大部分迭代时间都正常，在30ms左右，偶尔出现耗时达到1s左右的。

Aug 12 '22 09:08 yoonlee888

您好，感谢你的反馈。如果方便的话提供一下相关的代码片段，如果有对应的pytorch代码就更好了。

Aug 12 '22 10:08 shangguanshiyuan

这应该是之江实验室同学在群里反馈的问题。猜测过打印loss，还猜测可能是某个kernel太慢导致触发了背压机制。

Aug 12 '22 10:08 yuanms2

这个是oneflow框架，nsys的结果：

Using report1.sqlite export for stats reports. Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/cudaapisum.py report1.sqlite] to console...

Time(%) Total Time (ns) Num Calls Average Minimum Maximum Name

46.1      12711800961    1747010       7276.3     4118   403176487  cudaLaunchKernel
19.9       5496537524       2005    2741415.2     4929     4737916  cudaStreamSynchronize
14.6       4031587693    6078605        663.2      541      405742  cudaEventQuery
11.3       3105266453         28  110902373.3     2614  3105158569  cudaDeviceSynchronize
 2.7        743080939       1423     522193.2      491    74279641  cudaEventCreateWithFlags
 2.6        726735566         24   30280648.6     1624   718796969  cudaStreamCreateWithFlags
 1.2        342830975      21072      16269.5     4017     1605465  cudaMemcpyAsync
 0.6        178897182        139    1287030.1     2855    26218483  cudaMalloc
 0.3         72037957        147     490054.1      491    21929000  cudaFree
 0.2         51232928      10012       5117.2     3176       36960  cudaMemsetAsync
 0.1         39216972      24024       1632.4      721       29376  cudaEventRecord
 0.1         22998636          3    7666212.0    11893    22972716  cudaHostAlloc
 0.1         22652045          4    5663011.3    24857    22010224  cudaFreeHost
 0.1         15068004         12    1255667.0     1894    14636442  cudaStreamCreateWithPriority
 0.1         14050235         99     141921.6    15229      689332  cudaMemset
 0.0          2548437       1423       1790.9      441      702086  cudaEventDestroy
 0.0          1750521          1    1750521.0  1750521     1750521  cudaMallocHost
 0.0          1316224          5     263244.8    25889      460666  cudaMemcpy
 0.0           303119         39       7772.3     5421       46108  cudaStreamDestroy
 0.0           132712          3      44237.3     6031      115770  cudaStreamCreate
 0.0             1683          1       1683.0     1683        1683  cuInit

Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/gpukernsum.py report1.sqlite] to console...

Time(%) Total Time (ns) Instances Average Minimum Maximum Name

11.2       3585473768      28000  128052.6    70337   240610  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_16x3_tn_align4>(cutlass_80_tensorop_s168…
11.0       3508108269      57000   61545.8     4608   120993  void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
 7.6       2416975189     189000   12788.2     3488   262403  void oneflow::ep::primitive::(anonymous namespace)::CopyNdKernel<2ul, 4ul, int>(oneflow::ep::primit…
 4.7       1512021232      15000  100801.4     4064   130881  void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
 3.7       1185983039       6000  197663.8   174018   223203  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x6_nt_align4>(cutlass_80_tensorop_s1688…
 3.7       1181605575     114000   10365.0     4352   127200  void oneflow::(anonymous namespace)::ToContiguousForwardGpuParallel<float, int, 2ul>(int, oneflow::…
 3.2       1037712944      77000   13476.8     3584   147425  void cub::DeviceSegmentedReduceKernel<cub::DeviceReducePolicy<float, float, int, cub::Sum>::Policy6…
 2.8        897445950       7000  128206.6    30400   277187  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_128x256_32x3_tn_align1>(cutlass_80_tensorop_s168…
 2.7        862318267       5000  172463.7   105313   378275  void oneflow::user_op::DoCUDADimGather<float, int>(oneflow::NdIndexOffsetHelper<int, 8>, oneflow::N…
 2.6        820490662      14000   58606.5     5376   154626  void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
 2.5        792329634       7000  113189.9    30240   272003  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_32x3_tn_align1>(cutlass_80_tensorop_s168…
 2.4        781378051       6000  130229.7   117440   149473  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_16x3_nn_align4>(cutlass_80_tensorop_s168…
 2.3        746593555     124000    6020.9     3936    11424  void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
 2.3        728704402       2000  364352.2   283491   468324  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x64_32x3_nt_align1>(cutlass_80_tensorop_s1688…
 2.0        635688009      10000   63568.8    50209    84001  void cub::DeviceSegmentedRadixSortKernel<cub::DeviceRadixSortPolicy<float, int, int>::Policy700, fa…
 1.9        620273851       8000   77534.2    65152   103489  void oneflow::user_op::DoCUDADimGather<float, long>(oneflow::NdIndexOffsetHelper<long, 8>, oneflow:…
 1.9        618104618      18000   34339.1     5472    73313  void oneflow::(anonymous namespace)::MatrixColReduceByWarpBlock<oneflow::BinaryFuncSum, float, int,…
 1.9        599720207     120000    4997.7     3456   189058  void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::ep::primitive::broadcast_elementwi…
 1.8        586331824      19000   30859.6    21888   129377  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_32x3_tn_align4>(cutlass_80_tensorop_s168…
 1.8        577033690      49000   11776.2     3648   141409  void oneflow::ep::primitive::(anonymous namespace)::CopyNdKernel<2ul, 16ul, int>(oneflow::ep::primi…
 1.7        557819800      20000   27891.0    23392   188706  void cub::DeviceSegmentedRadixSortKernel<cub::DeviceRadixSortPolicy<float, int, int>::Policy700, tr…
 1.6        496884481       6000   82814.1    44544   238531  void oneflow::(anonymous namespace)::CumsumForwardGpuDownSpaceIs1<float, oneflow::BinaryFuncMul>(fl…
 1.3        410017621      15000   27334.5     3712   100608  void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
 1.2        384250064      68000    5650.7     3520    10976  void oneflow::(anonymous namespace)::MathUnaryElementwiseForwardGpu<oneflow::AbsFunctor, float>(lon…
 1.2        382925471       2000  191462.7   164546   225730  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_32x3_nn_align1>(cutlass_80_tensorop_s168…
 1.1        353719658       5000   70743.9     4224   126913  void cub::DeviceSegmentedReduceKernel<cub::DeviceReducePolicy<cub::KeyValuePair<int, float>, cub::K…
 1.0        330259081      32000   10320.6     3424   157538  void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
 1.0        326702311       4000   81675.6    52480   119585  void cub::DeviceSegmentedReduceKernel<cub::DeviceReducePolicy<float, float, int, cub::Min>::Policy6…
 1.0        316943119      60000    5282.4     3904    10336  void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
 0.9        281895644      54000    5220.3     3488    10400  void oneflow::(anonymous namespace)::MathUnaryElementwiseForwardGpu<oneflow::CosFunctor, float>(lon…
 0.9        281361897      54000    5210.4     3360    11712  void oneflow::(anonymous namespace)::MathUnaryElementwiseForwardGpu<oneflow::SinFunctor, float>(lon…
 0.9        276974170      52000    5326.4     3712     9056  void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
 0.9        274116137      35000    7831.9     4448    15488  void oneflow::(anonymous namespace)::ToContiguousForwardGpuParallel<float, int, 3ul>(int, oneflow::…
 0.8        255028137       3000   85009.4    49184   113889  ampere_sgemm_128x64_tn
 0.7        225930798       1000  225930.8   201410   257186  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_128x128_16x5_tn_align4>(cutlass_80_tensorop_s168…
 0.7        225768177      49000    4607.5     3008     9984  void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
 0.7        213776534       4000   53444.1    34496    86720  void oneflow::(anonymous namespace)::CumsumForwardGpuDownSpaceIs1<float, oneflow::BinaryFuncAdd>(fl…
 0.6        197302527      41000    4812.3     3361    10112  void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
 0.6        176546824      39000    4526.8     3328   195938  void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
 0.5        155914199      32000    4872.3     3648    10239  void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
 0.4        141813865       3000   47271.3     9152   127329  void oneflow::(anonymous namespace)::SliceBackwardGpu<unsigned int, 2>(int, oneflow::SliceParams, o…
 0.4        133793199       4000   33448.3    18912    78593  ampere_sgemm_32x128_tn
 0.3         93799280      17000    5517.6     3264     8160  void oneflow::cuda::elementwise::ApplyGeneric<4, true, oneflow::cuda::elementwise::SimpleFactory<on…
 0.3         93506474      22001    4250.1     3680     6944  void oneflow::ep::primitive::(anonymous namespace)::FillGpu<float, 4ul>(float*, float, unsigned lon…
 0.3         88183956       1000   88184.0    84032   200098  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x64_32x3_tn_align4>(cutlass_80_tensorop_s1688…
 0.3         81435406      13000    6264.3     4416     9376  void oneflow::cuda::elementwise::ApplyGeneric<4, true, oneflow::cuda::elementwise::SimpleFactory<on…
 0.3         80902347       1000   80902.3    74753    88194  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x3_nt_align1>(cutlass_80_tensorop_s1688…
 0.3         79896709       1000   79896.7    76512    84353  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x3_tn_align1>(cutlass_80_tensorop_s1688…
 0.2         74021695       1000   74021.7    68577    81249  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x3_nn_align1>(cutlass_80_tensorop_s1688…
 0.2         73915671      13000    5685.8     3520     8448  void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
 0.2         69778131      11000    6343.5     3904   195202  void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
 0.2         67096953      11000    6099.7     3904    11424  void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
 0.2         64703182      12000    5391.9     4448     9505  void oneflow::(anonymous namespace)::MathUnaryElementwiseForwardGpu<oneflow::SigmoidFunctor, float>…
 0.2         64053090      14000    4575.2     3936     8832  void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
 0.2         54780090       4000   13695.0     5887    24352  void oneflow::(anonymous namespace)::NdarrayReduceGpuInplaceReduceAxis<float, 3, oneflow::BinaryFun…
 0.2         54248406      11000    4931.7     4288     8864  void oneflow::ep::primitive::(anonymous namespace)::CopyNdKernel<2ul, 8ul, int>(oneflow::ep::primit…
 0.1         45596545       9000    5066.3     4095     8128  void oneflow::(anonymous namespace)::MathUnaryElementwiseForwardGpu<oneflow::NegativeFunctor, float…
 0.1         38781572       6000    6463.6     4799    10241  void oneflow::(anonymous namespace)::CudaGatherNd<float, long>(oneflow::NdIndexSliceArgs<float, lon…
 0.1         38268351       9000    4252.0     3712     6529  void oneflow::ep::primitive::(anonymous namespace)::FillGpu<long, 2ul>(long*, long, unsigned long)
 0.1         37761279       6000    6293.5     4352     7520  void oneflow::(anonymous namespace)::MathUnaryElementwiseBackwardGpu<oneflow::CosFunctor, float>(lo…
 0.1         37630304       6000    6271.7     5152     7552  void oneflow::(anonymous namespace)::MathUnaryElementwiseBackwardGpu<oneflow::SinFunctor, float>(lo…
 0.1         35944761       8000    4493.1     3903     6656  void oneflow::cuda::elementwise::ApplyGeneric<2, false, oneflow::cuda::elementwise::SimpleFactory<o…
 0.1         34553634       6000    5758.9     4575     8992  void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
 0.1         33200374       7000    4742.9     3232     9729  void oneflow::(anonymous namespace)::CudaClipForward<float, oneflow::ClipByMinMaxFunctor<float> >(o…
 0.1         27297147       6000    4549.5     3680     7168  void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
 0.1         25057916       4000    6264.5     5440     8832  void oneflow::DoSearchSortedLogical<float, long>(int, bool, long, long, bool, float const*, float c…
 0.1         24937028       6004    4153.4     3648     6145  void oneflow::user_op::ArangeForwardGpuKernel<float>(float, float, long, float*)
 0.1         24458428       5000    4891.7     3392     9600  void oneflow::(anonymous namespace)::WriteKeysToOutput<float>(int, cub::KeyValuePair<int, float> co…
 0.1         24092791       5000    4818.6     4096     8544  oneflow::(anonymous namespace)::InitializeIndices(int, int*, int)
 0.1         22525727       5000    4505.1     3840     5600  void oneflow::cuda::elementwise::ApplyGeneric<4, true, oneflow::cuda::elementwise::SimpleFactory<on…
 0.1         21431272       3000    7143.8     6336     8385  void oneflow::(anonymous namespace)::ToContiguousForwardGpuParallel<long, int, 2ul>(int, oneflow::(…
 0.1         20133144       4000    5033.3     4383     9152  void oneflow::cuda::elementwise::ApplyGeneric<8, false, oneflow::cuda::elementwise::SimpleFactory<o…
 0.1         18871696       4000    4717.9     4160     7104  void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
 0.1         18704814       4000    4676.2     3168     7168  void oneflow::cuda::elementwise::ApplyGeneric<2, false, oneflow::cuda::elementwise::SimpleFactory<o…
 0.1         18489455       4000    4622.4     4000     6560  void oneflow::cuda::elementwise::ApplyGeneric<2, false, oneflow::cuda::elementwise::SimpleFactory<o…
 0.1         18481523       4000    4620.4     3713     8608  void oneflow::cuda::elementwise::ApplyGeneric<2, false, oneflow::cuda::elementwise::SimpleFactory<o…
 0.1         18302640       4000    4575.7     4032     6272  void oneflow::cuda::elementwise::ApplyGeneric<2, false, oneflow::cuda::elementwise::SimpleFactory<o…
 0.0         15382971       3000    5127.7     4383     6208  void cub::DeviceReduceSingleTileKernel<cub::DeviceReducePolicy<float, float, int, cub::Sum>::Policy…
 0.0         15127826       2000    7563.9     6528     8992  void splitKreduce_kernel<float, float, float, float>(cublasSplitKParams<float>, float const*, float…
 0.0         14943651       3000    4981.2     4095     6080  void oneflow::(anonymous namespace)::MathUnaryElementwiseForwardGpu<oneflow::ReciprocalNoNanFunctor…
 0.0         13636836       2000    6818.4     6080     7904  void oneflow::(anonymous namespace)::ToContiguousForwardGpuParallel<float, int, 4ul>(int, oneflow::…
 0.0         11795732       2000    5897.9     4576     6848  void gemv2T_kernel_val<int, int, float, float, float, float, 128, 16, 2, 2, false, false, cublasGem…
 0.0         10968798       2000    5484.4     4735     6753  void oneflow::(anonymous namespace)::GenerateGpu<long>(curandStateXORWOW*, long, long*, long, long)
 0.0         10703931       2000    5352.0     4640     7584  void oneflow::(anonymous namespace)::GenerateGpu<float>(curandStateXORWOW*, long, float*, float, fl…
 0.0         10457145       2000    5228.6     4672     6144  void oneflow::(anonymous namespace)::NdarrayAssignReducedGpu<float, float, 3>(oneflow::XpuVarNdarra…
 0.0         10139682       2000    5069.8     4480     5793  void cub::DeviceReduceKernel<cub::DeviceReducePolicy<float, float, int, cub::Sum>::Policy600, float…
 0.0          9964865       2000    4982.4     4192     6015  void oneflow::(anonymous namespace)::MathUnaryElementwiseForwardGpu<oneflow::ExpFunctor, float>(lon…
 0.0          9604374       2000    4802.2     4032     5695  void oneflow::(anonymous namespace)::MathUnaryElementwiseForwardGpu<oneflow::LogFunctor, float>(lon…
 0.0          9454449       2000    4727.2     4063     5728  void cub::DeviceReduceSingleTileKernel<cub::DeviceReducePolicy<float, float, int, cub::Sum>::Policy…
 0.0          9055546       2000    4527.8     3935     5504  void oneflow::cuda::elementwise::ApplyGeneric<4, true, oneflow::cuda::elementwise::SimpleFactory<on…
 0.0          6591979       1000    6592.0     5888     8800  void oneflow::(anonymous namespace)::FlipGpuForward<float>(int, long, oneflow::(anonymous namespace…
 0.0          6165883       1000    6165.9     5472     7232  void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
 0.0          5754254       1000    5754.3     5088     6720  void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
 0.0          5543745       1000    5543.7     4865     6368  void cub::DeviceSegmentedReduceKernel<cub::DeviceReducePolicy<float, float, int, cub::Max>::Policy6…
 0.0          5525850       1000    5525.9     4960     6336  void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::ep::primitive::broadcast_elementwi…
 0.0          5225746       1000    5225.7     3872     6432  void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
 0.0          5218867       1000    5218.9     4512     6048  void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
 0.0          5215081       1000    5215.1     4639     6112  void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
 0.0          5130604       1000    5130.6     4512     6016  void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
 0.0          4803998       1000    4804.0     4096     5632  void oneflow::(anonymous namespace)::MathUnaryElementwiseForwardGpu<oneflow::SqrtFunctor, float>(lo…
 0.0          4702151       1000    4702.2     4128     5472  void oneflow::cuda::elementwise::ApplyGeneric<4, true, oneflow::cuda::elementwise::SimpleFactory<on…
 0.0          4609196       1000    4609.2     4096     5408  void oneflow::cuda::elementwise::ApplyGeneric<2, true, oneflow::cuda::elementwise::SimpleFactory<on…
 0.0          4553745       1000    4553.7     3648     5377  void oneflow::cuda::elementwise::ApplyGeneric<1, false, oneflow::cuda::elementwise::SimpleFactory<o…
 0.0          4509025       1000    4509.0     3744     5280  void oneflow::cuda::elementwise::ApplyGeneric<2, false, oneflow::cuda::elementwise::SimpleFactory<o…
 0.0            12448          2    6224.0     6144     6304  void oneflow::cuda::elementwise::ApplyGeneric<4, true, oneflow::cuda::elementwise::SimpleFactory<on…
 0.0            11487          2    5743.5     5727     5760  void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
 0.0             6144          1    6144.0     6144     6144  oneflow::one::(anonymous namespace)::InitCurandStatesKernel(unsigned long, curandStateXORWOW*, onef…

Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/gpumemtimesum.py report1.sqlite] to console...

Time(%) Total Time (ns) Operations Average Minimum Maximum Operation

55.2        169987000       10111  16812.1     2080    79137  [CUDA memset]
29.7         91424299        9075  10074.3     2080   351780  [CUDA memcpy HtoD]
10.1         31053334        6001   5174.7     4031    10336  [CUDA memcpy DtoD]
 5.0         15493532        6001   2581.8     1312     5632  [CUDA memcpy DtoH]

Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/gpumemsizesum.py report1.sqlite] to console...

 Total      Operations   Average   Minimum   Maximum        Operation

4416005.906 6001 735.878 0.004 960.000 [CUDA memcpy DtoD] 28036.000 6001 4.672 4.000 36.000 [CUDA memcpy DtoH] 203579842.031 10111 20134.491 0.020 110592.000 [CUDA memset] 1189827.613 9075 131.110 0.004 4096.000 [CUDA memcpy HtoD]

Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/osrtsum.py report1.sqlite] to console...

Time(%) Total Time (ns) Num Calls Average Minimum Maximum Name

97.5   27323412856064      73731     370582426.1          1001  105480219776  pthread_cond_wait
 1.5     421874697623          4  105468674405.8  105461333907  105472389064  epoll_wait
 0.4     101226316698       1024      98853824.9          2525     100150542  poll
 0.3      96200177898         21    4580960852.3         18295   10000604148  sem_timedwait
 0.3      77960483493       1274      61193472.1          1122    3660687126  pthread_cond_timedwait
 0.0       4619146864       3494       1322022.6          1012     300450955  ioctl
 0.0       1372138903       6999        196047.9          1001      25037686  pthread_mutex_lock
 0.0        803399274       4461        180094.0          1001      74133533  pthread_rwlock_wrlock
 0.0        707050444       4978        142035.0          1001      64144535  read
 0.0        393542417      73180          5377.7          1002        527423  pthread_cond_signal
 0.0        119381105       1126        106022.3          1012       3189529  pthread_rwlock_rdlock
 0.0         78713267        265        297031.2         57660       8843479  pthread_join
 0.0         68576473       8550          8020.6          1603      16973389  openat
 0.0         59715695      27932          2137.9          1001         56397  sched_yield
 0.0         29977499      17611          1702.2          1002         17343  fread
 0.0         25641702        396         64751.8         38613        302195  pthread_create
 0.0         19751752        195        101291.0          1133      18217557  fopen
 0.0          8079540        948          8522.7          2024        345688  open64
 0.0          2971932        241         12331.7          1844         60004  write
 0.0          2737852        214         12793.7          1483        235298  mmap
 0.0          2415109        157         15382.9          1001        143292  fgets
 0.0          2131646        188         11338.5          1022        518296  mmap64
 0.0          1713949        441          3886.5          1674         79341  pthread_cond_broadcast
 0.0          1611398        284          5673.9          1012        113907  munmap
 0.0           531682        176          3020.9          1002         13054  fclose
 0.0           442962          4        110740.5         77398        163520  sem_wait
 0.0           344970         65          5307.2          3627         10500  fopen64
 0.0           123906          8         15488.3          1744         26390  mprotect
 0.0           111983          6         18663.8          5090         56538  open
 0.0            55966          5         11193.2          2325         25359  socket
 0.0            36829          3         12276.3          2334         17453  fflush
 0.0            30097          1         30097.0         30097         30097  pipe2
 0.0            18816          4          4704.0          1473          8536  bind
 0.0            18395          8          2299.4          1102          3597  fcntl
 0.0            14528          1         14528.0         14528         14528  connect
 0.0            13876          2          6938.0          1142         12734  fwrite
 0.0            13337          2          6668.5          2235         11102  fgetc
 0.0             4980          1          4980.0          4980          4980  epoll_ctl
 0.0             3727          2          1863.5          1413          2314  listen
 0.0             3346          1          3346.0          3346          3346  signal
 0.0             2594          2          1297.0          1051          1543  pthread_mutex_trylock
 0.0             2515          1          2515.0          2515          2515  sigaction
 0.0             2444          1          2444.0          2444          2444  fputs_unlocked

Aug 15 '22 03:08 yoonlee888

这个是pytorch框架，nsys的结果：

Using report1.sqlite export for stats reports. Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/cudaapisum.py report1.sqlite] to console...

Time(%) Total Time (ns) Num Calls Average Minimum Maximum Name

37.9       7820974612    1100149      7109.0     4669     3560676  cudaLaunchKernel
21.8       4508601183        219  20587220.0     4468  4393702431  cudaMalloc
20.7       4272509677      31041    137640.9     7113     8738285  cudaMemcpyAsync
16.3       3360510597      15042    223408.5     2765     4192212  cudaStreamSynchronize
 2.3        482940448         18  26830024.9      821   476609355  cudaFree
 0.7        136884097      16000      8555.3     3947      206842  cudaMemsetAsync
 0.2         33074394       9000      3674.9     1272       35498  cudaEventQuery
 0.1         30974277       9000      3441.6     1473       28925  cudaEventRecord
 0.0          7241864       4061      1783.3      661       21901  cudaStreamIsCapturing_v10000
 0.0          1292359          4    323089.8     4820     1273031  cudaHostAlloc
 0.0           103446          2     51723.0    47169       56277  cudaMemcpy
 0.0            80534          8     10066.8     3016       31430  cudaDeviceSynchronize
 0.0            70489         36      1958.0      771       12453  cudaEventCreateWithFlags
 0.0            48883         36      1357.9      591        4077  cudaEventDestroy
 0.0             1362          1      1362.0     1362        1362  cuInit

Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/gpukernsum.py report1.sqlite] to console...

Time(%) Total Time (ns) Instances Average Minimum Maximum Name

17.1       4743979003      28000  169427.8    99008   234688  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_16x3_tn_align4>(cutlass_80_tensorop_s168…
11.7       3226121943      83002   38868.0     4416   255136  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator&,…
 8.7       2419667331      49000   49381.0     5408   202976  void at::native::vectorized_elementwise_kernel<4, at::native::softplus_kernel(at::TensorIterator&, …
 5.3       1461215572      15000   97414.4     3232   128864  void at::native::vectorized_elementwise_kernel<4, at::native::threshold_kernel_impl<float>(at::Tens…
 4.6       1261166335      32000   39411.4     4320   235200  void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 2, 128, 1>(float*,…
 3.7       1031306847       7000  147329.5    33600   350048  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_128x256_32x3_tn_align1>(cutlass_80_tensorop_s168…
 3.7       1020162021       6000  170027.0   164417   209408  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x6_nt_align4>(cutlass_80_tensorop_s1688…
 3.6        993239499       7000  141891.4   138401   144320  void at::native::vectorized_elementwise_kernel<4, at::native::softplus_backward_kernel(at::TensorIt…
 3.6        990449261     161000    6151.9     4256   188512  void at::native::unrolled_elementwise_kernel<at::native::MulFunctor<float>, at::detail::Array<char*…
 3.4        935567665       7000  133652.5    32512   325441  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_32x3_tn_align1>(cutlass_80_tensorop_s168…
 3.3        914717375       7000  130673.9   119808   248960  void at::native::reduce_kernel<128, 4, at::native::ReduceOp<float, at::native::func_wrapper_t<float…
 2.7        751298962       6000  125216.5   121888   289633  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_16x3_nn_align4>(cutlass_80_tensorop_s168…
 2.6        731521425      17000   43030.7     5184   223424  void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 2, 64, 64>(float*,…
 2.4        671549852       2000  335774.9   284737   475680  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x64_32x3_nt_align1>(cutlass_80_tensorop_s1688…
 2.3        625225596      19000   32906.6    24031   214177  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_32x3_tn_align4>(cutlass_80_tensorop_s168…
 2.0        552416751      52000   10623.4     2881   259424  void at::native::vectorized_elementwise_kernel<4, at::native::MulScalarFunctor<float, float>, at::d…
 1.5        419654208      29009   14466.3     3264   186688  void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, at::detail::Array…
 1.3        370691436       2000  185345.7   171616   238112  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_32x3_nn_align1>(cutlass_80_tensorop_s168…
 1.3        347887402      54000    6442.4     4480   203841  void at::native::unrolled_elementwise_kernel<at::native::AddFunctor<float>, at::detail::Array<char*…
 1.2        346015529      59014    5863.3     4321   187456  void at::native::(anonymous namespace)::weight_norm_fwd_first_dim_kernel<float, float>(float*, floa…
 1.2        319314683       3000  106438.2    60288   213664  ampere_sgemm_128x64_tn
 1.1        299271474      60000    4987.9     3232     7840  void at::native::vectorized_elementwise_kernel<4, at::native::sin_kernel_cuda(at::TensorIterator&):…
 1.1        299015827      60000    4983.6     3136     7616  void at::native::vectorized_elementwise_kernel<4, at::native::cos_kernel_cuda(at::TensorIterator&):…
 1.1        298473999      33000    9044.7     3169   156448  void at::native::vectorized_elementwise_kernel<4, at::native::AddFunctor<float>, at::detail::Array<…
 0.9        258512720      18000   14361.8     4928    66144  void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::func_wrapper_t<float…
 0.9        255586020       1000  255586.0   252800   299328  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_128x128_16x5_tn_align4>(cutlass_80_tensorop_s168…
 0.5        139632350       1000  139632.4   137216   142049  void at::native::unrolled_elementwise_kernel<at::native::softplus_backward_kernel(at::TensorIterato…
 0.5        134033054      32000    4188.5     3169   193888  void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::AddFunctor<…
 0.5        127006064       4000   31751.5    19200    77824  ampere_sgemm_32x128_tn
 0.4         98559958       1000   98560.0    92448   110400  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x3_tn_align1>(cutlass_80_tensorop_s1688…
 0.3         89613656       1000   89613.7    86560    95424  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x64_32x3_tn_align4>(cutlass_80_tensorop_s1688…
 0.3         83070191       1000   83070.2    82176    92801  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x3_nn_align1>(cutlass_80_tensorop_s1688…
 0.3         82813174      19000    4358.6     2976     7072  void at::native::vectorized_elementwise_kernel<4, at::native::MulFunctor<float>, at::detail::Array<…
 0.3         78349734      17000    4608.8     3360     6336  void at::native::vectorized_elementwise_kernel<4, at::native::DivFunctor<float>, at::detail::Array<…
 0.3         76501273       1000   76501.3    74656    86688  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x3_nt_align1>(cutlass_80_tensorop_s1688…
 0.2         59776342       7000    8539.5     5024    15072  void at::native::unrolled_elementwise_kernel<at::native::MulScalarFunctor<float, float>, at::detail…
 0.2         59416240       8014    7414.1     5857     9952  void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::NormTwoOps<float, fl…
 0.2         57802308       1000   57802.3    56512    73984  void bitonicSortKVInPlace<float, long, 2, -1, LTComp<float, true>, unsigned int, 1024>(TensorInfo<f…
 0.2         57679111      12000    4806.6     3392     7136  void at::native::vectorized_elementwise_kernel<4, at::native::sigmoid_kernel_cuda(at::TensorIterato…
 0.2         57532221       6000    9588.7     6656    15872  _ZN2at6native32tensor_kernel_scan_innermost_dimIfLi16ELi32ESt10multipliesIfEEENSt9enable_ifIXntsr3c…
 0.2         55069919       8000    6883.7     5984     9120  void at::native::_scatter_gather_elementwise_kernel<64, 4, at::native::_cuda_scatter_gather_interna…
 0.2         54685328       4000   13671.3    12352    17504  void bitonicSortKVInPlace<float, long, 2, -1, LTComp<float, true>, unsigned int, 128>(TensorInfo<fl…
 0.2         50226806      11000    4566.1     3871   186208  void at::native::vectorized_elementwise_kernel<4, at::native::neg_kernel_cuda(at::TensorIterator&):…
 0.2         48343357       7000    6906.2     5856     9408  void at::native::unrolled_elementwise_kernel<at::native::DivFunctor<float>, at::detail::Array<char*…
 0.2         47398192       8000    5924.8     4575     7936  void at::native::unrolled_elementwise_kernel<at::native::BUnaryFunctor<at::native::CompareLTFunctor…
 0.1         38678373       9000    4297.6     4095     5888  void at::native::vectorized_elementwise_kernel<4, at::native::AUnaryFunctor<at::native::AddFunctor<…
 0.1         34909773       4000    8727.4     7360    11680  void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::MinOps<float>, unsig…
 0.1         34577085       9032    3828.3     3392     5312  void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<long>, at::detail::Array<…
 0.1         33286714       8000    4160.8     3935     5728  void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::AddFunctor<…
 0.1         32970112       4000    8242.5     6688    12064  _ZN2at6native32tensor_kernel_scan_innermost_dimIfLi16ELi32ESt4plusIfEEENSt9enable_ifIXntsr3c1010is_…
 0.1         32203511       5000    6440.7     4896     9345  void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 3, 128, 1>(float*,…
 0.1         31983060       3000   10661.0     9408    13792  void at::native::index_elementwise_kernel<128, 4, at::native::gpu_index_kernel<at::native::index_ke…
 0.1         29858411       7000    4265.5     3583     5696  void at::native::vectorized_elementwise_kernel<4, at::native::clamp_kernel_cuda(at::TensorIterator&…
 0.1         26361151       4000    6590.3     5440     8255  void at::native::unrolled_elementwise_kernel<at::native::MulFunctor<float>, at::detail::Array<char*…
 0.1         25538805       4000    6384.7     5920     8097  void at::native::(anonymous namespace)::searchsorted_cuda_kernel<float, long>(long*, float const*, …
 0.1         24705557       6000    4117.6     2912     5280  void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::CompareLTFu…
 0.1         24654968       4000    6163.7     4992     7776  void at::native::(anonymous namespace)::CatArrayBatchedCopy<long, unsigned int, 3, 128, 1>(long*, a…
 0.1         24323868       5000    4864.8     3776     6336  void fillSliceWithIndex<unsigned int, 2>(TensorInfo<long, unsigned int>, unsigned int, unsigned int…
 0.1         21275934       6004    3543.6     3360     4960  void (anonymous namespace)::elementwise_kernel_with_index<int, at::native::linspace_cuda_out(at::Te…
 0.1         18267190       3000    6089.1     5824     7616  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator&,…
 0.1         17843824       3000    5947.9     4640     7808  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator&,…
 0.1         16921323       4000    4230.3     4064     5792  void at::native::vectorized_elementwise_kernel<4, at::native::BitwiseOrFunctor<bool>, at::detail::A…
 0.1         16632258       4000    4158.1     3040     5376  void at::native::vectorized_elementwise_kernel<4, at::native::maximum_kernel_cuda(at::TensorIterato…
 0.1         16625110       4000    4156.3     3968     5279  void at::native::vectorized_elementwise_kernel<4, at::native::minimum_kernel_cuda(at::TensorIterato…
 0.1         16584496       4000    4146.1     4000     5152  void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::MulFunctor<…
 0.1         16443999       4000    4111.0     3744     5184  void at::native::vectorized_elementwise_kernel<4, at::native::(anonymous namespace)::where_kernel_i…
 0.0         13843055       2000    6921.5     5696     8896  void splitKreduce_kernel<float, float, float, float>(cublasSplitKParams<float>, float const*, float…
 0.0         11622419       3000    3874.1     3360     4799  void (anonymous namespace)::elementwise_kernel_with_index<int, at::native::arange_cuda_out(at::Tens…
 0.0         10615435       2000    5307.7     4704     6912  void gemv2T_kernel_val<int, int, float, float, float, float, 128, 16, 2, 2, false, false, cublasGem…
 0.0          9529063       2000    4764.5     4224     6112  void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 1, 128, 1>(float*,…
 0.0          9270520       2052    4517.8     4319     7168  _ZN2at6native90_GLOBAL__N__66_tmpxft_00005537_00000000_13_DistributionUniform_compute_86_cpp1_ii_8b…
 0.0          9209249       2000    4604.6     4384     5984  void at::native::(anonymous namespace)::distribution_elementwise_grid_stride_kernel<unsigned int, 4…
 0.0          8748149       1000    8748.1     7456    11776  void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::MaxOps<float>, unsig…
 0.0          8487633       2000    4243.8     3392     6176  void at::native::vectorized_elementwise_kernel<4, at::native::exp_kernel_cuda(at::TensorIterator&):…
 0.0          8457925       2000    4229.0     3711     5632  void at::native::vectorized_elementwise_kernel<4, at::native::reciprocal_kernel_cuda(at::TensorIter…
 0.0          8309075       2000    4154.5     3328     5376  void at::native::vectorized_elementwise_kernel<4, at::native::(anonymous namespace)::pow_tensor_sca…
 0.0          8126762       1000    8126.8     7968    10080  void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 3, 64, 64>(float*,…
 0.0          6878162       1000    6878.2     5856     8577  void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::MeanOps<float, float…
 0.0          6362163       1000    6362.2     6080     7872  void at::native::unrolled_elementwise_kernel<at::native::reciprocal_kernel_cuda(at::TensorIterator&…
 0.0          5839958       1000    5840.0     5600     7328  void at::native::unrolled_elementwise_kernel<at::native::(anonymous namespace)::pow_tensor_scalar_k…
 0.0          5478436       1000    5478.4     5312     6881  void at::native::(anonymous namespace)::CatArrayBatchedCopy<long, unsigned int, 2, 128, 1>(long*, a…
 0.0          4928042       1000    4928.0     3776     6017  void at::native::kernel_pointwise_flip_apply2<float, long>(at::cuda::detail::TensorInfo<float, long…
 0.0          4098339       1000    4098.3     3776     5184  void at::native::vectorized_elementwise_kernel<4, at::native::log10_kernel_cuda(at::TensorIterator&…
 0.0          3979337       1000    3979.3     3839     5088  void at::native::vectorized_elementwise_kernel<4, at::native::AbsFunctor<float>, at::detail::Array<…
 0.0          3974670       1000    3974.7     3712     5184  void at::native::vectorized_elementwise_kernel<4, at::native::sqrt_kernel_cuda(at::TensorIterator&)…
 0.0          3671683       1000    3671.7     3520     5152  void at::native::unrolled_elementwise_kernel<at::native::AddFunctor<float>, at::detail::Array<char*…
 0.0            58080          8    7260.0     6848     7424  _ZN2at6native89_GLOBAL__N__65_tmpxft_000054da_00000000_13_DistributionNormal_compute_86_cpp1_ii_7d8…
 0.0            24256          4    6064.0     5568     6336  void at::native::unrolled_elementwise_kernel<at::native::(anonymous namespace)::pow_tensor_tensor_k…
 0.0            11680          2    5840.0     5728     5952  void at::native::unrolled_elementwise_kernel<at::native::FillFunctor<float>, at::detail::Array<char…
 0.0             9856          1    9856.0     9856     9856  void sgetrf_batched_smallsq_noshfl_kernel<4, 4>(float**, int, int**, int*, int)
 0.0             8991          1    8991.0     8991     8991  void trsm_template_batched_lNL_kernel<float, 4, 32>(magma_diag_t, int, int, float, float**, int, fl…
 0.0             8991          1    8991.0     8991     8991  slaswp_columnserial_kernel_batched(int, float**, int, int, int, int**)
 0.0             8864          1    8864.0     8864     8864  void trsm_template_batched_lNU_kernel<float, 4, 32>(magma_diag_t, int, int, float, float**, int, fl…
 0.0             8448          2    4224.0     4000     4448  void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<int>, at::detail::Array<c…
 0.0             7712          1    7712.0     7712     7712  slaset_full_kernel_batched(int, int, float, float, float**, int)
 0.0             7040          1    7040.0     7040     7040  _ZN2at6native89_GLOBAL__N__65_tmpxft_000054da_00000000_13_DistributionNormal_compute_86_cpp1_ii_7d8…

Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/gpumemtimesum.py report1.sqlite] to console...

Time(%) Total Time (ns) Operations Average Minimum Maximum Operation

84.3        380885037       16000  23805.3     3168   251008  [CUDA memcpy DtoD]
 7.6         34510530       16000   2156.9     2079     3649  [CUDA memset]
 6.3         28298124       12002   2357.8     1312   179361  [CUDA memcpy DtoH]
 1.8          8248738        3041   2712.5     2079     4576  [CUDA memcpy HtoD]

Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/gpumemsizesum.py report1.sqlite] to console...

 Total      Operations   Average   Minimum   Maximum       Operation

   324.219       16000      0.020    0.008      0.035  [CUDA memset]
 28043.219       12002      2.337    0.004      6.000  [CUDA memcpy DtoH]
 20020.051        3041      6.583    0.004     20.000  [CUDA memcpy HtoD]

200066000.000 16000 12504.125 2.000 65792.000 [CUDA memcpy DtoD]

Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/osrtsum.py report1.sqlite] to console...

Time(%) Total Time (ns) Num Calls Average Minimum Maximum Name

35.3      45957635098        472    97367871.0     3717    100176399  poll
31.5      40984272066       2002    20471664.4    23034    124189592  pthread_cond_wait
31.4      40931464392         17  2407733199.5    14478  10000184789  sem_timedwait
 1.5       1913495171       4241      451189.6     1152    290886421  ioctl
 0.1         97439113         25     3897564.5   173359      5066508  pthread_cond_timedwait
 0.1         82557276      40664        2030.2     1001       160924  sched_yield
 0.0         50803760       1443       35207.0     1002     31465169  read
 0.0         27119248      17573        1543.2     1012       134184  fread
 0.0         22618854        257       88011.1    54072       304738  sleep
 0.0         19923053         11     1811186.6     1613     19820359  open
 0.0         14003578        196       71446.8    51397       346887  pthread_create
 0.0         10542299       2041        5165.3     1082       105701  pthread_cond_signal
 0.0          7664737       1209        6339.7     1764       372275  open64
 0.0          7222006        279       25885.3     1002        72036  fgets
 0.0          7170386         33      217284.4     1633       688264  pthread_mutex_lock
 0.0          4185312        365       11466.6     1843       129044  write
 0.0          3386553        334       10139.4     1072       122663  munmap
 0.0          3115098        283       11007.4     1012       526678  mmap64
 0.0          3056612        178       17172.0     1062       856552  fopen
 0.0          2969495          1     2969495.0  2969495      2969495  fork
 0.0          1906173        331        5758.8     1352        91363  mmap
 0.0          1238575         62       19977.0     3907       634191  pthread_join
 0.0           958327          6      159721.2    56376       314757  sem_wait
 0.0           474301        158        3001.9     1012         7173  fclose
 0.0           348921         68        5131.2     1162        10480  fopen64
 0.0            47961          1       47961.0    47961        47961  waitpid
 0.0            47068         31        1518.3     1002         9828  fflush
 0.0            41709          3       13903.0     2084        28494  pipe2
 0.0            21371          2       10685.5     6011        15360  socket
 0.0            17433          7        2490.4     1172         3637  fcntl
 0.0            16752          2        8376.0     2485        14267  fgetc
 0.0            11371          8        1421.4     1002         1874  pthread_mutex_trylock
 0.0             9489          1        9489.0     9489         9489  connect
 0.0             3828          1        3828.0     3828         3828  bind
 0.0             3678          2        1839.0     1413         2265  sigaction
 0.0             1523          1        1523.0     1523         1523  fputs_unlocked

Aug 15 '22 03:08 yoonlee888

这个是pytorch框架，nsys的结果：

Using report1.sqlite export for stats reports. Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/cudaapisum.py report1.sqlite] to console...

Time(%) Total Time (ns) Num Calls Average Minimum Maximum Name

37.9       7820974612    1100149      7109.0     4669     3560676  cudaLaunchKernel
21.8       4508601183        219  20587220.0     4468  4393702431  cudaMalloc
20.7       4272509677      31041    137640.9     7113     8738285  cudaMemcpyAsync
16.3       3360510597      15042    223408.5     2765     4192212  cudaStreamSynchronize
 2.3        482940448         18  26830024.9      821   476609355  cudaFree
 0.7        136884097      16000      8555.3     3947      206842  cudaMemsetAsync
 0.2         33074394       9000      3674.9     1272       35498  cudaEventQuery
 0.1         30974277       9000      3441.6     1473       28925  cudaEventRecord
 0.0          7241864       4061      1783.3      661       21901  cudaStreamIsCapturing_v10000
 0.0          1292359          4    323089.8     4820     1273031  cudaHostAlloc
 0.0           103446          2     51723.0    47169       56277  cudaMemcpy
 0.0            80534          8     10066.8     3016       31430  cudaDeviceSynchronize
 0.0            70489         36      1958.0      771       12453  cudaEventCreateWithFlags
 0.0            48883         36      1357.9      591        4077  cudaEventDestroy
 0.0             1362          1      1362.0     1362        1362  cuInit

Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/gpukernsum.py report1.sqlite] to console...

Time(%) Total Time (ns) Instances Average Minimum Maximum Name

17.1       4743979003      28000  169427.8    99008   234688  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_16x3_tn_align4>(cutlass_80_tensorop_s168…
11.7       3226121943      83002   38868.0     4416   255136  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator&,…
 8.7       2419667331      49000   49381.0     5408   202976  void at::native::vectorized_elementwise_kernel<4, at::native::softplus_kernel(at::TensorIterator&, …
 5.3       1461215572      15000   97414.4     3232   128864  void at::native::vectorized_elementwise_kernel<4, at::native::threshold_kernel_impl<float>(at::Tens…
 4.6       1261166335      32000   39411.4     4320   235200  void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 2, 128, 1>(float*,…
 3.7       1031306847       7000  147329.5    33600   350048  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_128x256_32x3_tn_align1>(cutlass_80_tensorop_s168…
 3.7       1020162021       6000  170027.0   164417   209408  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x6_nt_align4>(cutlass_80_tensorop_s1688…
 3.6        993239499       7000  141891.4   138401   144320  void at::native::vectorized_elementwise_kernel<4, at::native::softplus_backward_kernel(at::TensorIt…
 3.6        990449261     161000    6151.9     4256   188512  void at::native::unrolled_elementwise_kernel<at::native::MulFunctor<float>, at::detail::Array<char*…
 3.4        935567665       7000  133652.5    32512   325441  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_32x3_tn_align1>(cutlass_80_tensorop_s168…
 3.3        914717375       7000  130673.9   119808   248960  void at::native::reduce_kernel<128, 4, at::native::ReduceOp<float, at::native::func_wrapper_t<float…
 2.7        751298962       6000  125216.5   121888   289633  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_16x3_nn_align4>(cutlass_80_tensorop_s168…
 2.6        731521425      17000   43030.7     5184   223424  void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 2, 64, 64>(float*,…
 2.4        671549852       2000  335774.9   284737   475680  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x64_32x3_nt_align1>(cutlass_80_tensorop_s1688…
 2.3        625225596      19000   32906.6    24031   214177  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_32x3_tn_align4>(cutlass_80_tensorop_s168…
 2.0        552416751      52000   10623.4     2881   259424  void at::native::vectorized_elementwise_kernel<4, at::native::MulScalarFunctor<float, float>, at::d…
 1.5        419654208      29009   14466.3     3264   186688  void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, at::detail::Array…
 1.3        370691436       2000  185345.7   171616   238112  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_32x3_nn_align1>(cutlass_80_tensorop_s168…
 1.3        347887402      54000    6442.4     4480   203841  void at::native::unrolled_elementwise_kernel<at::native::AddFunctor<float>, at::detail::Array<char*…
 1.2        346015529      59014    5863.3     4321   187456  void at::native::(anonymous namespace)::weight_norm_fwd_first_dim_kernel<float, float>(float*, floa…
 1.2        319314683       3000  106438.2    60288   213664  ampere_sgemm_128x64_tn
 1.1        299271474      60000    4987.9     3232     7840  void at::native::vectorized_elementwise_kernel<4, at::native::sin_kernel_cuda(at::TensorIterator&):…
 1.1        299015827      60000    4983.6     3136     7616  void at::native::vectorized_elementwise_kernel<4, at::native::cos_kernel_cuda(at::TensorIterator&):…
 1.1        298473999      33000    9044.7     3169   156448  void at::native::vectorized_elementwise_kernel<4, at::native::AddFunctor<float>, at::detail::Array<…
 0.9        258512720      18000   14361.8     4928    66144  void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::func_wrapper_t<float…
 0.9        255586020       1000  255586.0   252800   299328  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_128x128_16x5_tn_align4>(cutlass_80_tensorop_s168…
 0.5        139632350       1000  139632.4   137216   142049  void at::native::unrolled_elementwise_kernel<at::native::softplus_backward_kernel(at::TensorIterato…
 0.5        134033054      32000    4188.5     3169   193888  void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::AddFunctor<…
 0.5        127006064       4000   31751.5    19200    77824  ampere_sgemm_32x128_tn
 0.4         98559958       1000   98560.0    92448   110400  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x3_tn_align1>(cutlass_80_tensorop_s1688…
 0.3         89613656       1000   89613.7    86560    95424  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x64_32x3_tn_align4>(cutlass_80_tensorop_s1688…
 0.3         83070191       1000   83070.2    82176    92801  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x3_nn_align1>(cutlass_80_tensorop_s1688…
 0.3         82813174      19000    4358.6     2976     7072  void at::native::vectorized_elementwise_kernel<4, at::native::MulFunctor<float>, at::detail::Array<…
 0.3         78349734      17000    4608.8     3360     6336  void at::native::vectorized_elementwise_kernel<4, at::native::DivFunctor<float>, at::detail::Array<…
 0.3         76501273       1000   76501.3    74656    86688  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x3_nt_align1>(cutlass_80_tensorop_s1688…
 0.2         59776342       7000    8539.5     5024    15072  void at::native::unrolled_elementwise_kernel<at::native::MulScalarFunctor<float, float>, at::detail…
 0.2         59416240       8014    7414.1     5857     9952  void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::NormTwoOps<float, fl…
 0.2         57802308       1000   57802.3    56512    73984  void bitonicSortKVInPlace<float, long, 2, -1, LTComp<float, true>, unsigned int, 1024>(TensorInfo<f…
 0.2         57679111      12000    4806.6     3392     7136  void at::native::vectorized_elementwise_kernel<4, at::native::sigmoid_kernel_cuda(at::TensorIterato…
 0.2         57532221       6000    9588.7     6656    15872  _ZN2at6native32tensor_kernel_scan_innermost_dimIfLi16ELi32ESt10multipliesIfEEENSt9enable_ifIXntsr3c…
 0.2         55069919       8000    6883.7     5984     9120  void at::native::_scatter_gather_elementwise_kernel<64, 4, at::native::_cuda_scatter_gather_interna…
 0.2         54685328       4000   13671.3    12352    17504  void bitonicSortKVInPlace<float, long, 2, -1, LTComp<float, true>, unsigned int, 128>(TensorInfo<fl…
 0.2         50226806      11000    4566.1     3871   186208  void at::native::vectorized_elementwise_kernel<4, at::native::neg_kernel_cuda(at::TensorIterator&):…
 0.2         48343357       7000    6906.2     5856     9408  void at::native::unrolled_elementwise_kernel<at::native::DivFunctor<float>, at::detail::Array<char*…
 0.2         47398192       8000    5924.8     4575     7936  void at::native::unrolled_elementwise_kernel<at::native::BUnaryFunctor<at::native::CompareLTFunctor…
 0.1         38678373       9000    4297.6     4095     5888  void at::native::vectorized_elementwise_kernel<4, at::native::AUnaryFunctor<at::native::AddFunctor<…
 0.1         34909773       4000    8727.4     7360    11680  void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::MinOps<float>, unsig…
 0.1         34577085       9032    3828.3     3392     5312  void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<long>, at::detail::Array<…
 0.1         33286714       8000    4160.8     3935     5728  void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::AddFunctor<…
 0.1         32970112       4000    8242.5     6688    12064  _ZN2at6native32tensor_kernel_scan_innermost_dimIfLi16ELi32ESt4plusIfEEENSt9enable_ifIXntsr3c1010is_…
 0.1         32203511       5000    6440.7     4896     9345  void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 3, 128, 1>(float*,…
 0.1         31983060       3000   10661.0     9408    13792  void at::native::index_elementwise_kernel<128, 4, at::native::gpu_index_kernel<at::native::index_ke…
 0.1         29858411       7000    4265.5     3583     5696  void at::native::vectorized_elementwise_kernel<4, at::native::clamp_kernel_cuda(at::TensorIterator&…
 0.1         26361151       4000    6590.3     5440     8255  void at::native::unrolled_elementwise_kernel<at::native::MulFunctor<float>, at::detail::Array<char*…
 0.1         25538805       4000    6384.7     5920     8097  void at::native::(anonymous namespace)::searchsorted_cuda_kernel<float, long>(long*, float const*, …
 0.1         24705557       6000    4117.6     2912     5280  void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::CompareLTFu…
 0.1         24654968       4000    6163.7     4992     7776  void at::native::(anonymous namespace)::CatArrayBatchedCopy<long, unsigned int, 3, 128, 1>(long*, a…
 0.1         24323868       5000    4864.8     3776     6336  void fillSliceWithIndex<unsigned int, 2>(TensorInfo<long, unsigned int>, unsigned int, unsigned int…
 0.1         21275934       6004    3543.6     3360     4960  void (anonymous namespace)::elementwise_kernel_with_index<int, at::native::linspace_cuda_out(at::Te…
 0.1         18267190       3000    6089.1     5824     7616  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator&,…
 0.1         17843824       3000    5947.9     4640     7808  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator&,…
 0.1         16921323       4000    4230.3     4064     5792  void at::native::vectorized_elementwise_kernel<4, at::native::BitwiseOrFunctor<bool>, at::detail::A…
 0.1         16632258       4000    4158.1     3040     5376  void at::native::vectorized_elementwise_kernel<4, at::native::maximum_kernel_cuda(at::TensorIterato…
 0.1         16625110       4000    4156.3     3968     5279  void at::native::vectorized_elementwise_kernel<4, at::native::minimum_kernel_cuda(at::TensorIterato…
 0.1         16584496       4000    4146.1     4000     5152  void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::MulFunctor<…
 0.1         16443999       4000    4111.0     3744     5184  void at::native::vectorized_elementwise_kernel<4, at::native::(anonymous namespace)::where_kernel_i…
 0.0         13843055       2000    6921.5     5696     8896  void splitKreduce_kernel<float, float, float, float>(cublasSplitKParams<float>, float const*, float…
 0.0         11622419       3000    3874.1     3360     4799  void (anonymous namespace)::elementwise_kernel_with_index<int, at::native::arange_cuda_out(at::Tens…
 0.0         10615435       2000    5307.7     4704     6912  void gemv2T_kernel_val<int, int, float, float, float, float, 128, 16, 2, 2, false, false, cublasGem…
 0.0          9529063       2000    4764.5     4224     6112  void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 1, 128, 1>(float*,…
 0.0          9270520       2052    4517.8     4319     7168  _ZN2at6native90_GLOBAL__N__66_tmpxft_00005537_00000000_13_DistributionUniform_compute_86_cpp1_ii_8b…
 0.0          9209249       2000    4604.6     4384     5984  void at::native::(anonymous namespace)::distribution_elementwise_grid_stride_kernel<unsigned int, 4…
 0.0          8748149       1000    8748.1     7456    11776  void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::MaxOps<float>, unsig…
 0.0          8487633       2000    4243.8     3392     6176  void at::native::vectorized_elementwise_kernel<4, at::native::exp_kernel_cuda(at::TensorIterator&):…
 0.0          8457925       2000    4229.0     3711     5632  void at::native::vectorized_elementwise_kernel<4, at::native::reciprocal_kernel_cuda(at::TensorIter…
 0.0          8309075       2000    4154.5     3328     5376  void at::native::vectorized_elementwise_kernel<4, at::native::(anonymous namespace)::pow_tensor_sca…
 0.0          8126762       1000    8126.8     7968    10080  void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 3, 64, 64>(float*,…
 0.0          6878162       1000    6878.2     5856     8577  void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::MeanOps<float, float…
 0.0          6362163       1000    6362.2     6080     7872  void at::native::unrolled_elementwise_kernel<at::native::reciprocal_kernel_cuda(at::TensorIterator&…
 0.0          5839958       1000    5840.0     5600     7328  void at::native::unrolled_elementwise_kernel<at::native::(anonymous namespace)::pow_tensor_scalar_k…
 0.0          5478436       1000    5478.4     5312     6881  void at::native::(anonymous namespace)::CatArrayBatchedCopy<long, unsigned int, 2, 128, 1>(long*, a…
 0.0          4928042       1000    4928.0     3776     6017  void at::native::kernel_pointwise_flip_apply2<float, long>(at::cuda::detail::TensorInfo<float, long…
 0.0          4098339       1000    4098.3     3776     5184  void at::native::vectorized_elementwise_kernel<4, at::native::log10_kernel_cuda(at::TensorIterator&…
 0.0          3979337       1000    3979.3     3839     5088  void at::native::vectorized_elementwise_kernel<4, at::native::AbsFunctor<float>, at::detail::Array<…
 0.0          3974670       1000    3974.7     3712     5184  void at::native::vectorized_elementwise_kernel<4, at::native::sqrt_kernel_cuda(at::TensorIterator&)…
 0.0          3671683       1000    3671.7     3520     5152  void at::native::unrolled_elementwise_kernel<at::native::AddFunctor<float>, at::detail::Array<char*…
 0.0            58080          8    7260.0     6848     7424  _ZN2at6native89_GLOBAL__N__65_tmpxft_000054da_00000000_13_DistributionNormal_compute_86_cpp1_ii_7d8…
 0.0            24256          4    6064.0     5568     6336  void at::native::unrolled_elementwise_kernel<at::native::(anonymous namespace)::pow_tensor_tensor_k…
 0.0            11680          2    5840.0     5728     5952  void at::native::unrolled_elementwise_kernel<at::native::FillFunctor<float>, at::detail::Array<char…
 0.0             9856          1    9856.0     9856     9856  void sgetrf_batched_smallsq_noshfl_kernel<4, 4>(float**, int, int**, int*, int)
 0.0             8991          1    8991.0     8991     8991  void trsm_template_batched_lNL_kernel<float, 4, 32>(magma_diag_t, int, int, float, float**, int, fl…
 0.0             8991          1    8991.0     8991     8991  slaswp_columnserial_kernel_batched(int, float**, int, int, int, int**)
 0.0             8864          1    8864.0     8864     8864  void trsm_template_batched_lNU_kernel<float, 4, 32>(magma_diag_t, int, int, float, float**, int, fl…
 0.0             8448          2    4224.0     4000     4448  void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<int>, at::detail::Array<c…
 0.0             7712          1    7712.0     7712     7712  slaset_full_kernel_batched(int, int, float, float, float**, int)
 0.0             7040          1    7040.0     7040     7040  _ZN2at6native89_GLOBAL__N__65_tmpxft_000054da_00000000_13_DistributionNormal_compute_86_cpp1_ii_7d8…

Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/gpumemtimesum.py report1.sqlite] to console...

Time(%) Total Time (ns) Operations Average Minimum Maximum Operation

84.3        380885037       16000  23805.3     3168   251008  [CUDA memcpy DtoD]
 7.6         34510530       16000   2156.9     2079     3649  [CUDA memset]
 6.3         28298124       12002   2357.8     1312   179361  [CUDA memcpy DtoH]
 1.8          8248738        3041   2712.5     2079     4576  [CUDA memcpy HtoD]

Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/gpumemsizesum.py report1.sqlite] to console...

 Total      Operations   Average   Minimum   Maximum       Operation

   324.219       16000      0.020    0.008      0.035  [CUDA memset]
 28043.219       12002      2.337    0.004      6.000  [CUDA memcpy DtoH]
 20020.051        3041      6.583    0.004     20.000  [CUDA memcpy HtoD]

200066000.000 16000 12504.125 2.000 65792.000 [CUDA memcpy DtoD]

Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/osrtsum.py report1.sqlite] to console...

Time(%) Total Time (ns) Num Calls Average Minimum Maximum Name

35.3      45957635098        472    97367871.0     3717    100176399  poll
31.5      40984272066       2002    20471664.4    23034    124189592  pthread_cond_wait
31.4      40931464392         17  2407733199.5    14478  10000184789  sem_timedwait
 1.5       1913495171       4241      451189.6     1152    290886421  ioctl
 0.1         97439113         25     3897564.5   173359      5066508  pthread_cond_timedwait
 0.1         82557276      40664        2030.2     1001       160924  sched_yield
 0.0         50803760       1443       35207.0     1002     31465169  read
 0.0         27119248      17573        1543.2     1012       134184  fread
 0.0         22618854        257       88011.1    54072       304738  sleep
 0.0         19923053         11     1811186.6     1613     19820359  open
 0.0         14003578        196       71446.8    51397       346887  pthread_create
 0.0         10542299       2041        5165.3     1082       105701  pthread_cond_signal
 0.0          7664737       1209        6339.7     1764       372275  open64
 0.0          7222006        279       25885.3     1002        72036  fgets
 0.0          7170386         33      217284.4     1633       688264  pthread_mutex_lock
 0.0          4185312        365       11466.6     1843       129044  write
 0.0          3386553        334       10139.4     1072       122663  munmap
 0.0          3115098        283       11007.4     1012       526678  mmap64
 0.0          3056612        178       17172.0     1062       856552  fopen
 0.0          2969495          1     2969495.0  2969495      2969495  fork
 0.0          1906173        331        5758.8     1352        91363  mmap
 0.0          1238575         62       19977.0     3907       634191  pthread_join
 0.0           958327          6      159721.2    56376       314757  sem_wait
 0.0           474301        158        3001.9     1012         7173  fclose
 0.0           348921         68        5131.2     1162        10480  fopen64
 0.0            47961          1       47961.0    47961        47961  waitpid
 0.0            47068         31        1518.3     1002         9828  fflush
 0.0            41709          3       13903.0     2084        28494  pipe2
 0.0            21371          2       10685.5     6011        15360  socket
 0.0            17433          7        2490.4     1172         3637  fcntl
 0.0            16752          2        8376.0     2485        14267  fgetc
 0.0            11371          8        1421.4     1002         1874  pthread_mutex_trylock
 0.0             9489          1        9489.0     9489         9489  connect
 0.0             3828          1        3828.0     3828         3828  bind
 0.0             3678          2        1839.0     1413         2265  sigaction
 0.0             1523          1        1523.0     1523         1523  fputs_unlocked

可以顺带附上nsys的qdrep文件吗？

Aug 15 '22 03:08 lixinqi

目前粗略的看，感觉oneflow的cudaLaunchKernel、cudaStreamSynchronize比pytorch多，可能在oneflow里有一些op是拼接得到的，多一些cuda_kernel。

Aug 15 '22 03:08 shangguanshiyuan

summary主要是为了看具体某个kernel优化前后对比。如果想看和pytorch的对比，最好先看nsys的qdrep文件。

Aug 15 '22 03:08 lixinqi

从这里https://oneflow-test.oss-cn-beijing.aliyuncs.com/NeuS/nsys/report1.qdrep 可以看到 cuda kernel之间应该有很多cpu op。也许是某处代码直接写了cpu device type。

Aug 15 '22 04:08 lixinqi

@yoonlee888 跑性能测试的时候可以少跑写iter。否则文件太大了

Aug 15 '22 04:08 lixinqi

从这里https://oneflow-test.oss-cn-beijing.aliyuncs.com/NeuS/nsys/report1.qdrep 可以看到 cuda kernel之间应该有很多cpu op。也许是某处代码直接写了cpu device type。

好的我排查下哪里初始化用了cpu，感谢

Aug 15 '22 06:08 yoonlee888

oneflow oneflow copied to clipboard

NeuS网络训练速度慢

Summary

oneflow
oneflow copied to clipboard