oneflow
oneflow copied to clipboard
NeuS网络训练速度慢
Summary
三维重建NeuS网络实现遇到总体训练耗时太长的问题,用time.time()计算了下每次迭代的正向和反向的时间,与pytorch对比,结果正向平均单步耗时()远大于pytorch(),反向差别不大。用oneflow只跑正向发现每隔10步左右会有一轮特别耗时,大概1s左右,除此之外的每步耗时和pytorch也差不多。
这是oneflow框架跑网络,用time.time()在python脚本里统计的每个迭代的时间,会发现大部分迭代时间都正常,在30ms左右,偶尔出现耗时达到1s左右的。
您好,感谢你的反馈。如果方便的话提供一下相关的代码片段,如果有对应的pytorch代码就更好了。
这应该是之江实验室同学在群里反馈的问题。猜测过打印loss,还猜测可能是某个kernel太慢导致触发了背压机制。
这个是oneflow框架,nsys的结果:
Using report1.sqlite export for stats reports. Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/cudaapisum.py report1.sqlite] to console...
Time(%) Total Time (ns) Num Calls Average Minimum Maximum Name
46.1 12711800961 1747010 7276.3 4118 403176487 cudaLaunchKernel
19.9 5496537524 2005 2741415.2 4929 4737916 cudaStreamSynchronize
14.6 4031587693 6078605 663.2 541 405742 cudaEventQuery
11.3 3105266453 28 110902373.3 2614 3105158569 cudaDeviceSynchronize
2.7 743080939 1423 522193.2 491 74279641 cudaEventCreateWithFlags
2.6 726735566 24 30280648.6 1624 718796969 cudaStreamCreateWithFlags
1.2 342830975 21072 16269.5 4017 1605465 cudaMemcpyAsync
0.6 178897182 139 1287030.1 2855 26218483 cudaMalloc
0.3 72037957 147 490054.1 491 21929000 cudaFree
0.2 51232928 10012 5117.2 3176 36960 cudaMemsetAsync
0.1 39216972 24024 1632.4 721 29376 cudaEventRecord
0.1 22998636 3 7666212.0 11893 22972716 cudaHostAlloc
0.1 22652045 4 5663011.3 24857 22010224 cudaFreeHost
0.1 15068004 12 1255667.0 1894 14636442 cudaStreamCreateWithPriority
0.1 14050235 99 141921.6 15229 689332 cudaMemset
0.0 2548437 1423 1790.9 441 702086 cudaEventDestroy
0.0 1750521 1 1750521.0 1750521 1750521 cudaMallocHost
0.0 1316224 5 263244.8 25889 460666 cudaMemcpy
0.0 303119 39 7772.3 5421 46108 cudaStreamDestroy
0.0 132712 3 44237.3 6031 115770 cudaStreamCreate
0.0 1683 1 1683.0 1683 1683 cuInit
Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/gpukernsum.py report1.sqlite] to console...
Time(%) Total Time (ns) Instances Average Minimum Maximum Name
11.2 3585473768 28000 128052.6 70337 240610 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_16x3_tn_align4>(cutlass_80_tensorop_s168…
11.0 3508108269 57000 61545.8 4608 120993 void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
7.6 2416975189 189000 12788.2 3488 262403 void oneflow::ep::primitive::(anonymous namespace)::CopyNdKernel<2ul, 4ul, int>(oneflow::ep::primit…
4.7 1512021232 15000 100801.4 4064 130881 void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
3.7 1185983039 6000 197663.8 174018 223203 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x6_nt_align4>(cutlass_80_tensorop_s1688…
3.7 1181605575 114000 10365.0 4352 127200 void oneflow::(anonymous namespace)::ToContiguousForwardGpuParallel<float, int, 2ul>(int, oneflow::…
3.2 1037712944 77000 13476.8 3584 147425 void cub::DeviceSegmentedReduceKernel<cub::DeviceReducePolicy<float, float, int, cub::Sum>::Policy6…
2.8 897445950 7000 128206.6 30400 277187 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_128x256_32x3_tn_align1>(cutlass_80_tensorop_s168…
2.7 862318267 5000 172463.7 105313 378275 void oneflow::user_op::DoCUDADimGather<float, int>(oneflow::NdIndexOffsetHelper<int, 8>, oneflow::N…
2.6 820490662 14000 58606.5 5376 154626 void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
2.5 792329634 7000 113189.9 30240 272003 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_32x3_tn_align1>(cutlass_80_tensorop_s168…
2.4 781378051 6000 130229.7 117440 149473 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_16x3_nn_align4>(cutlass_80_tensorop_s168…
2.3 746593555 124000 6020.9 3936 11424 void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
2.3 728704402 2000 364352.2 283491 468324 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x64_32x3_nt_align1>(cutlass_80_tensorop_s1688…
2.0 635688009 10000 63568.8 50209 84001 void cub::DeviceSegmentedRadixSortKernel<cub::DeviceRadixSortPolicy<float, int, int>::Policy700, fa…
1.9 620273851 8000 77534.2 65152 103489 void oneflow::user_op::DoCUDADimGather<float, long>(oneflow::NdIndexOffsetHelper<long, 8>, oneflow:…
1.9 618104618 18000 34339.1 5472 73313 void oneflow::(anonymous namespace)::MatrixColReduceByWarpBlock<oneflow::BinaryFuncSum, float, int,…
1.9 599720207 120000 4997.7 3456 189058 void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::ep::primitive::broadcast_elementwi…
1.8 586331824 19000 30859.6 21888 129377 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_32x3_tn_align4>(cutlass_80_tensorop_s168…
1.8 577033690 49000 11776.2 3648 141409 void oneflow::ep::primitive::(anonymous namespace)::CopyNdKernel<2ul, 16ul, int>(oneflow::ep::primi…
1.7 557819800 20000 27891.0 23392 188706 void cub::DeviceSegmentedRadixSortKernel<cub::DeviceRadixSortPolicy<float, int, int>::Policy700, tr…
1.6 496884481 6000 82814.1 44544 238531 void oneflow::(anonymous namespace)::CumsumForwardGpuDownSpaceIs1<float, oneflow::BinaryFuncMul>(fl…
1.3 410017621 15000 27334.5 3712 100608 void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
1.2 384250064 68000 5650.7 3520 10976 void oneflow::(anonymous namespace)::MathUnaryElementwiseForwardGpu<oneflow::AbsFunctor, float>(lon…
1.2 382925471 2000 191462.7 164546 225730 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_32x3_nn_align1>(cutlass_80_tensorop_s168…
1.1 353719658 5000 70743.9 4224 126913 void cub::DeviceSegmentedReduceKernel<cub::DeviceReducePolicy<cub::KeyValuePair<int, float>, cub::K…
1.0 330259081 32000 10320.6 3424 157538 void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
1.0 326702311 4000 81675.6 52480 119585 void cub::DeviceSegmentedReduceKernel<cub::DeviceReducePolicy<float, float, int, cub::Min>::Policy6…
1.0 316943119 60000 5282.4 3904 10336 void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
0.9 281895644 54000 5220.3 3488 10400 void oneflow::(anonymous namespace)::MathUnaryElementwiseForwardGpu<oneflow::CosFunctor, float>(lon…
0.9 281361897 54000 5210.4 3360 11712 void oneflow::(anonymous namespace)::MathUnaryElementwiseForwardGpu<oneflow::SinFunctor, float>(lon…
0.9 276974170 52000 5326.4 3712 9056 void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
0.9 274116137 35000 7831.9 4448 15488 void oneflow::(anonymous namespace)::ToContiguousForwardGpuParallel<float, int, 3ul>(int, oneflow::…
0.8 255028137 3000 85009.4 49184 113889 ampere_sgemm_128x64_tn
0.7 225930798 1000 225930.8 201410 257186 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_128x128_16x5_tn_align4>(cutlass_80_tensorop_s168…
0.7 225768177 49000 4607.5 3008 9984 void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
0.7 213776534 4000 53444.1 34496 86720 void oneflow::(anonymous namespace)::CumsumForwardGpuDownSpaceIs1<float, oneflow::BinaryFuncAdd>(fl…
0.6 197302527 41000 4812.3 3361 10112 void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
0.6 176546824 39000 4526.8 3328 195938 void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
0.5 155914199 32000 4872.3 3648 10239 void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
0.4 141813865 3000 47271.3 9152 127329 void oneflow::(anonymous namespace)::SliceBackwardGpu<unsigned int, 2>(int, oneflow::SliceParams, o…
0.4 133793199 4000 33448.3 18912 78593 ampere_sgemm_32x128_tn
0.3 93799280 17000 5517.6 3264 8160 void oneflow::cuda::elementwise::ApplyGeneric<4, true, oneflow::cuda::elementwise::SimpleFactory<on…
0.3 93506474 22001 4250.1 3680 6944 void oneflow::ep::primitive::(anonymous namespace)::FillGpu<float, 4ul>(float*, float, unsigned lon…
0.3 88183956 1000 88184.0 84032 200098 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x64_32x3_tn_align4>(cutlass_80_tensorop_s1688…
0.3 81435406 13000 6264.3 4416 9376 void oneflow::cuda::elementwise::ApplyGeneric<4, true, oneflow::cuda::elementwise::SimpleFactory<on…
0.3 80902347 1000 80902.3 74753 88194 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x3_nt_align1>(cutlass_80_tensorop_s1688…
0.3 79896709 1000 79896.7 76512 84353 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x3_tn_align1>(cutlass_80_tensorop_s1688…
0.2 74021695 1000 74021.7 68577 81249 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x3_nn_align1>(cutlass_80_tensorop_s1688…
0.2 73915671 13000 5685.8 3520 8448 void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
0.2 69778131 11000 6343.5 3904 195202 void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
0.2 67096953 11000 6099.7 3904 11424 void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
0.2 64703182 12000 5391.9 4448 9505 void oneflow::(anonymous namespace)::MathUnaryElementwiseForwardGpu<oneflow::SigmoidFunctor, float>…
0.2 64053090 14000 4575.2 3936 8832 void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
0.2 54780090 4000 13695.0 5887 24352 void oneflow::(anonymous namespace)::NdarrayReduceGpuInplaceReduceAxis<float, 3, oneflow::BinaryFun…
0.2 54248406 11000 4931.7 4288 8864 void oneflow::ep::primitive::(anonymous namespace)::CopyNdKernel<2ul, 8ul, int>(oneflow::ep::primit…
0.1 45596545 9000 5066.3 4095 8128 void oneflow::(anonymous namespace)::MathUnaryElementwiseForwardGpu<oneflow::NegativeFunctor, float…
0.1 38781572 6000 6463.6 4799 10241 void oneflow::(anonymous namespace)::CudaGatherNd<float, long>(oneflow::NdIndexSliceArgs<float, lon…
0.1 38268351 9000 4252.0 3712 6529 void oneflow::ep::primitive::(anonymous namespace)::FillGpu<long, 2ul>(long*, long, unsigned long)
0.1 37761279 6000 6293.5 4352 7520 void oneflow::(anonymous namespace)::MathUnaryElementwiseBackwardGpu<oneflow::CosFunctor, float>(lo…
0.1 37630304 6000 6271.7 5152 7552 void oneflow::(anonymous namespace)::MathUnaryElementwiseBackwardGpu<oneflow::SinFunctor, float>(lo…
0.1 35944761 8000 4493.1 3903 6656 void oneflow::cuda::elementwise::ApplyGeneric<2, false, oneflow::cuda::elementwise::SimpleFactory<o…
0.1 34553634 6000 5758.9 4575 8992 void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
0.1 33200374 7000 4742.9 3232 9729 void oneflow::(anonymous namespace)::CudaClipForward<float, oneflow::ClipByMinMaxFunctor<float> >(o…
0.1 27297147 6000 4549.5 3680 7168 void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
0.1 25057916 4000 6264.5 5440 8832 void oneflow::DoSearchSortedLogical<float, long>(int, bool, long, long, bool, float const*, float c…
0.1 24937028 6004 4153.4 3648 6145 void oneflow::user_op::ArangeForwardGpuKernel<float>(float, float, long, float*)
0.1 24458428 5000 4891.7 3392 9600 void oneflow::(anonymous namespace)::WriteKeysToOutput<float>(int, cub::KeyValuePair<int, float> co…
0.1 24092791 5000 4818.6 4096 8544 oneflow::(anonymous namespace)::InitializeIndices(int, int*, int)
0.1 22525727 5000 4505.1 3840 5600 void oneflow::cuda::elementwise::ApplyGeneric<4, true, oneflow::cuda::elementwise::SimpleFactory<on…
0.1 21431272 3000 7143.8 6336 8385 void oneflow::(anonymous namespace)::ToContiguousForwardGpuParallel<long, int, 2ul>(int, oneflow::(…
0.1 20133144 4000 5033.3 4383 9152 void oneflow::cuda::elementwise::ApplyGeneric<8, false, oneflow::cuda::elementwise::SimpleFactory<o…
0.1 18871696 4000 4717.9 4160 7104 void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
0.1 18704814 4000 4676.2 3168 7168 void oneflow::cuda::elementwise::ApplyGeneric<2, false, oneflow::cuda::elementwise::SimpleFactory<o…
0.1 18489455 4000 4622.4 4000 6560 void oneflow::cuda::elementwise::ApplyGeneric<2, false, oneflow::cuda::elementwise::SimpleFactory<o…
0.1 18481523 4000 4620.4 3713 8608 void oneflow::cuda::elementwise::ApplyGeneric<2, false, oneflow::cuda::elementwise::SimpleFactory<o…
0.1 18302640 4000 4575.7 4032 6272 void oneflow::cuda::elementwise::ApplyGeneric<2, false, oneflow::cuda::elementwise::SimpleFactory<o…
0.0 15382971 3000 5127.7 4383 6208 void cub::DeviceReduceSingleTileKernel<cub::DeviceReducePolicy<float, float, int, cub::Sum>::Policy…
0.0 15127826 2000 7563.9 6528 8992 void splitKreduce_kernel<float, float, float, float>(cublasSplitKParams<float>, float const*, float…
0.0 14943651 3000 4981.2 4095 6080 void oneflow::(anonymous namespace)::MathUnaryElementwiseForwardGpu<oneflow::ReciprocalNoNanFunctor…
0.0 13636836 2000 6818.4 6080 7904 void oneflow::(anonymous namespace)::ToContiguousForwardGpuParallel<float, int, 4ul>(int, oneflow::…
0.0 11795732 2000 5897.9 4576 6848 void gemv2T_kernel_val<int, int, float, float, float, float, 128, 16, 2, 2, false, false, cublasGem…
0.0 10968798 2000 5484.4 4735 6753 void oneflow::(anonymous namespace)::GenerateGpu<long>(curandStateXORWOW*, long, long*, long, long)
0.0 10703931 2000 5352.0 4640 7584 void oneflow::(anonymous namespace)::GenerateGpu<float>(curandStateXORWOW*, long, float*, float, fl…
0.0 10457145 2000 5228.6 4672 6144 void oneflow::(anonymous namespace)::NdarrayAssignReducedGpu<float, float, 3>(oneflow::XpuVarNdarra…
0.0 10139682 2000 5069.8 4480 5793 void cub::DeviceReduceKernel<cub::DeviceReducePolicy<float, float, int, cub::Sum>::Policy600, float…
0.0 9964865 2000 4982.4 4192 6015 void oneflow::(anonymous namespace)::MathUnaryElementwiseForwardGpu<oneflow::ExpFunctor, float>(lon…
0.0 9604374 2000 4802.2 4032 5695 void oneflow::(anonymous namespace)::MathUnaryElementwiseForwardGpu<oneflow::LogFunctor, float>(lon…
0.0 9454449 2000 4727.2 4063 5728 void cub::DeviceReduceSingleTileKernel<cub::DeviceReducePolicy<float, float, int, cub::Sum>::Policy…
0.0 9055546 2000 4527.8 3935 5504 void oneflow::cuda::elementwise::ApplyGeneric<4, true, oneflow::cuda::elementwise::SimpleFactory<on…
0.0 6591979 1000 6592.0 5888 8800 void oneflow::(anonymous namespace)::FlipGpuForward<float>(int, long, oneflow::(anonymous namespace…
0.0 6165883 1000 6165.9 5472 7232 void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
0.0 5754254 1000 5754.3 5088 6720 void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
0.0 5543745 1000 5543.7 4865 6368 void cub::DeviceSegmentedReduceKernel<cub::DeviceReducePolicy<float, float, int, cub::Max>::Policy6…
0.0 5525850 1000 5525.9 4960 6336 void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::ep::primitive::broadcast_elementwi…
0.0 5225746 1000 5225.7 3872 6432 void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
0.0 5218867 1000 5218.9 4512 6048 void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
0.0 5215081 1000 5215.1 4639 6112 void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
0.0 5130604 1000 5130.6 4512 6016 void oneflow::ep::primitive::broadcast_elementwise_binary::(anonymous namespace)::BroadcastElementw…
0.0 4803998 1000 4804.0 4096 5632 void oneflow::(anonymous namespace)::MathUnaryElementwiseForwardGpu<oneflow::SqrtFunctor, float>(lo…
0.0 4702151 1000 4702.2 4128 5472 void oneflow::cuda::elementwise::ApplyGeneric<4, true, oneflow::cuda::elementwise::SimpleFactory<on…
0.0 4609196 1000 4609.2 4096 5408 void oneflow::cuda::elementwise::ApplyGeneric<2, true, oneflow::cuda::elementwise::SimpleFactory<on…
0.0 4553745 1000 4553.7 3648 5377 void oneflow::cuda::elementwise::ApplyGeneric<1, false, oneflow::cuda::elementwise::SimpleFactory<o…
0.0 4509025 1000 4509.0 3744 5280 void oneflow::cuda::elementwise::ApplyGeneric<2, false, oneflow::cuda::elementwise::SimpleFactory<o…
0.0 12448 2 6224.0 6144 6304 void oneflow::cuda::elementwise::ApplyGeneric<4, true, oneflow::cuda::elementwise::SimpleFactory<on…
0.0 11487 2 5743.5 5727 5760 void oneflow::cuda::elementwise::ApplyGeneric<4, false, oneflow::cuda::elementwise::SimpleFactory<o…
0.0 6144 1 6144.0 6144 6144 oneflow::one::(anonymous namespace)::InitCurandStatesKernel(unsigned long, curandStateXORWOW*, onef…
Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/gpumemtimesum.py report1.sqlite] to console...
Time(%) Total Time (ns) Operations Average Minimum Maximum Operation
55.2 169987000 10111 16812.1 2080 79137 [CUDA memset]
29.7 91424299 9075 10074.3 2080 351780 [CUDA memcpy HtoD]
10.1 31053334 6001 5174.7 4031 10336 [CUDA memcpy DtoD]
5.0 15493532 6001 2581.8 1312 5632 [CUDA memcpy DtoH]
Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/gpumemsizesum.py report1.sqlite] to console...
Total Operations Average Minimum Maximum Operation
4416005.906 6001 735.878 0.004 960.000 [CUDA memcpy DtoD] 28036.000 6001 4.672 4.000 36.000 [CUDA memcpy DtoH] 203579842.031 10111 20134.491 0.020 110592.000 [CUDA memset] 1189827.613 9075 131.110 0.004 4096.000 [CUDA memcpy HtoD]
Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/osrtsum.py report1.sqlite] to console...
Time(%) Total Time (ns) Num Calls Average Minimum Maximum Name
97.5 27323412856064 73731 370582426.1 1001 105480219776 pthread_cond_wait
1.5 421874697623 4 105468674405.8 105461333907 105472389064 epoll_wait
0.4 101226316698 1024 98853824.9 2525 100150542 poll
0.3 96200177898 21 4580960852.3 18295 10000604148 sem_timedwait
0.3 77960483493 1274 61193472.1 1122 3660687126 pthread_cond_timedwait
0.0 4619146864 3494 1322022.6 1012 300450955 ioctl
0.0 1372138903 6999 196047.9 1001 25037686 pthread_mutex_lock
0.0 803399274 4461 180094.0 1001 74133533 pthread_rwlock_wrlock
0.0 707050444 4978 142035.0 1001 64144535 read
0.0 393542417 73180 5377.7 1002 527423 pthread_cond_signal
0.0 119381105 1126 106022.3 1012 3189529 pthread_rwlock_rdlock
0.0 78713267 265 297031.2 57660 8843479 pthread_join
0.0 68576473 8550 8020.6 1603 16973389 openat
0.0 59715695 27932 2137.9 1001 56397 sched_yield
0.0 29977499 17611 1702.2 1002 17343 fread
0.0 25641702 396 64751.8 38613 302195 pthread_create
0.0 19751752 195 101291.0 1133 18217557 fopen
0.0 8079540 948 8522.7 2024 345688 open64
0.0 2971932 241 12331.7 1844 60004 write
0.0 2737852 214 12793.7 1483 235298 mmap
0.0 2415109 157 15382.9 1001 143292 fgets
0.0 2131646 188 11338.5 1022 518296 mmap64
0.0 1713949 441 3886.5 1674 79341 pthread_cond_broadcast
0.0 1611398 284 5673.9 1012 113907 munmap
0.0 531682 176 3020.9 1002 13054 fclose
0.0 442962 4 110740.5 77398 163520 sem_wait
0.0 344970 65 5307.2 3627 10500 fopen64
0.0 123906 8 15488.3 1744 26390 mprotect
0.0 111983 6 18663.8 5090 56538 open
0.0 55966 5 11193.2 2325 25359 socket
0.0 36829 3 12276.3 2334 17453 fflush
0.0 30097 1 30097.0 30097 30097 pipe2
0.0 18816 4 4704.0 1473 8536 bind
0.0 18395 8 2299.4 1102 3597 fcntl
0.0 14528 1 14528.0 14528 14528 connect
0.0 13876 2 6938.0 1142 12734 fwrite
0.0 13337 2 6668.5 2235 11102 fgetc
0.0 4980 1 4980.0 4980 4980 epoll_ctl
0.0 3727 2 1863.5 1413 2314 listen
0.0 3346 1 3346.0 3346 3346 signal
0.0 2594 2 1297.0 1051 1543 pthread_mutex_trylock
0.0 2515 1 2515.0 2515 2515 sigaction
0.0 2444 1 2444.0 2444 2444 fputs_unlocked
这个是pytorch框架,nsys的结果:
Using report1.sqlite export for stats reports. Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/cudaapisum.py report1.sqlite] to console...
Time(%) Total Time (ns) Num Calls Average Minimum Maximum Name
37.9 7820974612 1100149 7109.0 4669 3560676 cudaLaunchKernel
21.8 4508601183 219 20587220.0 4468 4393702431 cudaMalloc
20.7 4272509677 31041 137640.9 7113 8738285 cudaMemcpyAsync
16.3 3360510597 15042 223408.5 2765 4192212 cudaStreamSynchronize
2.3 482940448 18 26830024.9 821 476609355 cudaFree
0.7 136884097 16000 8555.3 3947 206842 cudaMemsetAsync
0.2 33074394 9000 3674.9 1272 35498 cudaEventQuery
0.1 30974277 9000 3441.6 1473 28925 cudaEventRecord
0.0 7241864 4061 1783.3 661 21901 cudaStreamIsCapturing_v10000
0.0 1292359 4 323089.8 4820 1273031 cudaHostAlloc
0.0 103446 2 51723.0 47169 56277 cudaMemcpy
0.0 80534 8 10066.8 3016 31430 cudaDeviceSynchronize
0.0 70489 36 1958.0 771 12453 cudaEventCreateWithFlags
0.0 48883 36 1357.9 591 4077 cudaEventDestroy
0.0 1362 1 1362.0 1362 1362 cuInit
Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/gpukernsum.py report1.sqlite] to console...
Time(%) Total Time (ns) Instances Average Minimum Maximum Name
17.1 4743979003 28000 169427.8 99008 234688 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_16x3_tn_align4>(cutlass_80_tensorop_s168…
11.7 3226121943 83002 38868.0 4416 255136 void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator&,…
8.7 2419667331 49000 49381.0 5408 202976 void at::native::vectorized_elementwise_kernel<4, at::native::softplus_kernel(at::TensorIterator&, …
5.3 1461215572 15000 97414.4 3232 128864 void at::native::vectorized_elementwise_kernel<4, at::native::threshold_kernel_impl<float>(at::Tens…
4.6 1261166335 32000 39411.4 4320 235200 void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 2, 128, 1>(float*,…
3.7 1031306847 7000 147329.5 33600 350048 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_128x256_32x3_tn_align1>(cutlass_80_tensorop_s168…
3.7 1020162021 6000 170027.0 164417 209408 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x6_nt_align4>(cutlass_80_tensorop_s1688…
3.6 993239499 7000 141891.4 138401 144320 void at::native::vectorized_elementwise_kernel<4, at::native::softplus_backward_kernel(at::TensorIt…
3.6 990449261 161000 6151.9 4256 188512 void at::native::unrolled_elementwise_kernel<at::native::MulFunctor<float>, at::detail::Array<char*…
3.4 935567665 7000 133652.5 32512 325441 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_32x3_tn_align1>(cutlass_80_tensorop_s168…
3.3 914717375 7000 130673.9 119808 248960 void at::native::reduce_kernel<128, 4, at::native::ReduceOp<float, at::native::func_wrapper_t<float…
2.7 751298962 6000 125216.5 121888 289633 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_16x3_nn_align4>(cutlass_80_tensorop_s168…
2.6 731521425 17000 43030.7 5184 223424 void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 2, 64, 64>(float*,…
2.4 671549852 2000 335774.9 284737 475680 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x64_32x3_nt_align1>(cutlass_80_tensorop_s1688…
2.3 625225596 19000 32906.6 24031 214177 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_32x3_tn_align4>(cutlass_80_tensorop_s168…
2.0 552416751 52000 10623.4 2881 259424 void at::native::vectorized_elementwise_kernel<4, at::native::MulScalarFunctor<float, float>, at::d…
1.5 419654208 29009 14466.3 3264 186688 void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, at::detail::Array…
1.3 370691436 2000 185345.7 171616 238112 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_32x3_nn_align1>(cutlass_80_tensorop_s168…
1.3 347887402 54000 6442.4 4480 203841 void at::native::unrolled_elementwise_kernel<at::native::AddFunctor<float>, at::detail::Array<char*…
1.2 346015529 59014 5863.3 4321 187456 void at::native::(anonymous namespace)::weight_norm_fwd_first_dim_kernel<float, float>(float*, floa…
1.2 319314683 3000 106438.2 60288 213664 ampere_sgemm_128x64_tn
1.1 299271474 60000 4987.9 3232 7840 void at::native::vectorized_elementwise_kernel<4, at::native::sin_kernel_cuda(at::TensorIterator&):…
1.1 299015827 60000 4983.6 3136 7616 void at::native::vectorized_elementwise_kernel<4, at::native::cos_kernel_cuda(at::TensorIterator&):…
1.1 298473999 33000 9044.7 3169 156448 void at::native::vectorized_elementwise_kernel<4, at::native::AddFunctor<float>, at::detail::Array<…
0.9 258512720 18000 14361.8 4928 66144 void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::func_wrapper_t<float…
0.9 255586020 1000 255586.0 252800 299328 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_128x128_16x5_tn_align4>(cutlass_80_tensorop_s168…
0.5 139632350 1000 139632.4 137216 142049 void at::native::unrolled_elementwise_kernel<at::native::softplus_backward_kernel(at::TensorIterato…
0.5 134033054 32000 4188.5 3169 193888 void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::AddFunctor<…
0.5 127006064 4000 31751.5 19200 77824 ampere_sgemm_32x128_tn
0.4 98559958 1000 98560.0 92448 110400 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x3_tn_align1>(cutlass_80_tensorop_s1688…
0.3 89613656 1000 89613.7 86560 95424 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x64_32x3_tn_align4>(cutlass_80_tensorop_s1688…
0.3 83070191 1000 83070.2 82176 92801 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x3_nn_align1>(cutlass_80_tensorop_s1688…
0.3 82813174 19000 4358.6 2976 7072 void at::native::vectorized_elementwise_kernel<4, at::native::MulFunctor<float>, at::detail::Array<…
0.3 78349734 17000 4608.8 3360 6336 void at::native::vectorized_elementwise_kernel<4, at::native::DivFunctor<float>, at::detail::Array<…
0.3 76501273 1000 76501.3 74656 86688 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x3_nt_align1>(cutlass_80_tensorop_s1688…
0.2 59776342 7000 8539.5 5024 15072 void at::native::unrolled_elementwise_kernel<at::native::MulScalarFunctor<float, float>, at::detail…
0.2 59416240 8014 7414.1 5857 9952 void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::NormTwoOps<float, fl…
0.2 57802308 1000 57802.3 56512 73984 void bitonicSortKVInPlace<float, long, 2, -1, LTComp<float, true>, unsigned int, 1024>(TensorInfo<f…
0.2 57679111 12000 4806.6 3392 7136 void at::native::vectorized_elementwise_kernel<4, at::native::sigmoid_kernel_cuda(at::TensorIterato…
0.2 57532221 6000 9588.7 6656 15872 _ZN2at6native32tensor_kernel_scan_innermost_dimIfLi16ELi32ESt10multipliesIfEEENSt9enable_ifIXntsr3c…
0.2 55069919 8000 6883.7 5984 9120 void at::native::_scatter_gather_elementwise_kernel<64, 4, at::native::_cuda_scatter_gather_interna…
0.2 54685328 4000 13671.3 12352 17504 void bitonicSortKVInPlace<float, long, 2, -1, LTComp<float, true>, unsigned int, 128>(TensorInfo<fl…
0.2 50226806 11000 4566.1 3871 186208 void at::native::vectorized_elementwise_kernel<4, at::native::neg_kernel_cuda(at::TensorIterator&):…
0.2 48343357 7000 6906.2 5856 9408 void at::native::unrolled_elementwise_kernel<at::native::DivFunctor<float>, at::detail::Array<char*…
0.2 47398192 8000 5924.8 4575 7936 void at::native::unrolled_elementwise_kernel<at::native::BUnaryFunctor<at::native::CompareLTFunctor…
0.1 38678373 9000 4297.6 4095 5888 void at::native::vectorized_elementwise_kernel<4, at::native::AUnaryFunctor<at::native::AddFunctor<…
0.1 34909773 4000 8727.4 7360 11680 void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::MinOps<float>, unsig…
0.1 34577085 9032 3828.3 3392 5312 void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<long>, at::detail::Array<…
0.1 33286714 8000 4160.8 3935 5728 void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::AddFunctor<…
0.1 32970112 4000 8242.5 6688 12064 _ZN2at6native32tensor_kernel_scan_innermost_dimIfLi16ELi32ESt4plusIfEEENSt9enable_ifIXntsr3c1010is_…
0.1 32203511 5000 6440.7 4896 9345 void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 3, 128, 1>(float*,…
0.1 31983060 3000 10661.0 9408 13792 void at::native::index_elementwise_kernel<128, 4, at::native::gpu_index_kernel<at::native::index_ke…
0.1 29858411 7000 4265.5 3583 5696 void at::native::vectorized_elementwise_kernel<4, at::native::clamp_kernel_cuda(at::TensorIterator&…
0.1 26361151 4000 6590.3 5440 8255 void at::native::unrolled_elementwise_kernel<at::native::MulFunctor<float>, at::detail::Array<char*…
0.1 25538805 4000 6384.7 5920 8097 void at::native::(anonymous namespace)::searchsorted_cuda_kernel<float, long>(long*, float const*, …
0.1 24705557 6000 4117.6 2912 5280 void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::CompareLTFu…
0.1 24654968 4000 6163.7 4992 7776 void at::native::(anonymous namespace)::CatArrayBatchedCopy<long, unsigned int, 3, 128, 1>(long*, a…
0.1 24323868 5000 4864.8 3776 6336 void fillSliceWithIndex<unsigned int, 2>(TensorInfo<long, unsigned int>, unsigned int, unsigned int…
0.1 21275934 6004 3543.6 3360 4960 void (anonymous namespace)::elementwise_kernel_with_index<int, at::native::linspace_cuda_out(at::Te…
0.1 18267190 3000 6089.1 5824 7616 void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator&,…
0.1 17843824 3000 5947.9 4640 7808 void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator&,…
0.1 16921323 4000 4230.3 4064 5792 void at::native::vectorized_elementwise_kernel<4, at::native::BitwiseOrFunctor<bool>, at::detail::A…
0.1 16632258 4000 4158.1 3040 5376 void at::native::vectorized_elementwise_kernel<4, at::native::maximum_kernel_cuda(at::TensorIterato…
0.1 16625110 4000 4156.3 3968 5279 void at::native::vectorized_elementwise_kernel<4, at::native::minimum_kernel_cuda(at::TensorIterato…
0.1 16584496 4000 4146.1 4000 5152 void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::MulFunctor<…
0.1 16443999 4000 4111.0 3744 5184 void at::native::vectorized_elementwise_kernel<4, at::native::(anonymous namespace)::where_kernel_i…
0.0 13843055 2000 6921.5 5696 8896 void splitKreduce_kernel<float, float, float, float>(cublasSplitKParams<float>, float const*, float…
0.0 11622419 3000 3874.1 3360 4799 void (anonymous namespace)::elementwise_kernel_with_index<int, at::native::arange_cuda_out(at::Tens…
0.0 10615435 2000 5307.7 4704 6912 void gemv2T_kernel_val<int, int, float, float, float, float, 128, 16, 2, 2, false, false, cublasGem…
0.0 9529063 2000 4764.5 4224 6112 void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 1, 128, 1>(float*,…
0.0 9270520 2052 4517.8 4319 7168 _ZN2at6native90_GLOBAL__N__66_tmpxft_00005537_00000000_13_DistributionUniform_compute_86_cpp1_ii_8b…
0.0 9209249 2000 4604.6 4384 5984 void at::native::(anonymous namespace)::distribution_elementwise_grid_stride_kernel<unsigned int, 4…
0.0 8748149 1000 8748.1 7456 11776 void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::MaxOps<float>, unsig…
0.0 8487633 2000 4243.8 3392 6176 void at::native::vectorized_elementwise_kernel<4, at::native::exp_kernel_cuda(at::TensorIterator&):…
0.0 8457925 2000 4229.0 3711 5632 void at::native::vectorized_elementwise_kernel<4, at::native::reciprocal_kernel_cuda(at::TensorIter…
0.0 8309075 2000 4154.5 3328 5376 void at::native::vectorized_elementwise_kernel<4, at::native::(anonymous namespace)::pow_tensor_sca…
0.0 8126762 1000 8126.8 7968 10080 void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 3, 64, 64>(float*,…
0.0 6878162 1000 6878.2 5856 8577 void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::MeanOps<float, float…
0.0 6362163 1000 6362.2 6080 7872 void at::native::unrolled_elementwise_kernel<at::native::reciprocal_kernel_cuda(at::TensorIterator&…
0.0 5839958 1000 5840.0 5600 7328 void at::native::unrolled_elementwise_kernel<at::native::(anonymous namespace)::pow_tensor_scalar_k…
0.0 5478436 1000 5478.4 5312 6881 void at::native::(anonymous namespace)::CatArrayBatchedCopy<long, unsigned int, 2, 128, 1>(long*, a…
0.0 4928042 1000 4928.0 3776 6017 void at::native::kernel_pointwise_flip_apply2<float, long>(at::cuda::detail::TensorInfo<float, long…
0.0 4098339 1000 4098.3 3776 5184 void at::native::vectorized_elementwise_kernel<4, at::native::log10_kernel_cuda(at::TensorIterator&…
0.0 3979337 1000 3979.3 3839 5088 void at::native::vectorized_elementwise_kernel<4, at::native::AbsFunctor<float>, at::detail::Array<…
0.0 3974670 1000 3974.7 3712 5184 void at::native::vectorized_elementwise_kernel<4, at::native::sqrt_kernel_cuda(at::TensorIterator&)…
0.0 3671683 1000 3671.7 3520 5152 void at::native::unrolled_elementwise_kernel<at::native::AddFunctor<float>, at::detail::Array<char*…
0.0 58080 8 7260.0 6848 7424 _ZN2at6native89_GLOBAL__N__65_tmpxft_000054da_00000000_13_DistributionNormal_compute_86_cpp1_ii_7d8…
0.0 24256 4 6064.0 5568 6336 void at::native::unrolled_elementwise_kernel<at::native::(anonymous namespace)::pow_tensor_tensor_k…
0.0 11680 2 5840.0 5728 5952 void at::native::unrolled_elementwise_kernel<at::native::FillFunctor<float>, at::detail::Array<char…
0.0 9856 1 9856.0 9856 9856 void sgetrf_batched_smallsq_noshfl_kernel<4, 4>(float**, int, int**, int*, int)
0.0 8991 1 8991.0 8991 8991 void trsm_template_batched_lNL_kernel<float, 4, 32>(magma_diag_t, int, int, float, float**, int, fl…
0.0 8991 1 8991.0 8991 8991 slaswp_columnserial_kernel_batched(int, float**, int, int, int, int**)
0.0 8864 1 8864.0 8864 8864 void trsm_template_batched_lNU_kernel<float, 4, 32>(magma_diag_t, int, int, float, float**, int, fl…
0.0 8448 2 4224.0 4000 4448 void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<int>, at::detail::Array<c…
0.0 7712 1 7712.0 7712 7712 slaset_full_kernel_batched(int, int, float, float, float**, int)
0.0 7040 1 7040.0 7040 7040 _ZN2at6native89_GLOBAL__N__65_tmpxft_000054da_00000000_13_DistributionNormal_compute_86_cpp1_ii_7d8…
Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/gpumemtimesum.py report1.sqlite] to console...
Time(%) Total Time (ns) Operations Average Minimum Maximum Operation
84.3 380885037 16000 23805.3 3168 251008 [CUDA memcpy DtoD]
7.6 34510530 16000 2156.9 2079 3649 [CUDA memset]
6.3 28298124 12002 2357.8 1312 179361 [CUDA memcpy DtoH]
1.8 8248738 3041 2712.5 2079 4576 [CUDA memcpy HtoD]
Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/gpumemsizesum.py report1.sqlite] to console...
Total Operations Average Minimum Maximum Operation
324.219 16000 0.020 0.008 0.035 [CUDA memset]
28043.219 12002 2.337 0.004 6.000 [CUDA memcpy DtoH]
20020.051 3041 6.583 0.004 20.000 [CUDA memcpy HtoD]
200066000.000 16000 12504.125 2.000 65792.000 [CUDA memcpy DtoD]
Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/osrtsum.py report1.sqlite] to console...
Time(%) Total Time (ns) Num Calls Average Minimum Maximum Name
35.3 45957635098 472 97367871.0 3717 100176399 poll
31.5 40984272066 2002 20471664.4 23034 124189592 pthread_cond_wait
31.4 40931464392 17 2407733199.5 14478 10000184789 sem_timedwait
1.5 1913495171 4241 451189.6 1152 290886421 ioctl
0.1 97439113 25 3897564.5 173359 5066508 pthread_cond_timedwait
0.1 82557276 40664 2030.2 1001 160924 sched_yield
0.0 50803760 1443 35207.0 1002 31465169 read
0.0 27119248 17573 1543.2 1012 134184 fread
0.0 22618854 257 88011.1 54072 304738 sleep
0.0 19923053 11 1811186.6 1613 19820359 open
0.0 14003578 196 71446.8 51397 346887 pthread_create
0.0 10542299 2041 5165.3 1082 105701 pthread_cond_signal
0.0 7664737 1209 6339.7 1764 372275 open64
0.0 7222006 279 25885.3 1002 72036 fgets
0.0 7170386 33 217284.4 1633 688264 pthread_mutex_lock
0.0 4185312 365 11466.6 1843 129044 write
0.0 3386553 334 10139.4 1072 122663 munmap
0.0 3115098 283 11007.4 1012 526678 mmap64
0.0 3056612 178 17172.0 1062 856552 fopen
0.0 2969495 1 2969495.0 2969495 2969495 fork
0.0 1906173 331 5758.8 1352 91363 mmap
0.0 1238575 62 19977.0 3907 634191 pthread_join
0.0 958327 6 159721.2 56376 314757 sem_wait
0.0 474301 158 3001.9 1012 7173 fclose
0.0 348921 68 5131.2 1162 10480 fopen64
0.0 47961 1 47961.0 47961 47961 waitpid
0.0 47068 31 1518.3 1002 9828 fflush
0.0 41709 3 13903.0 2084 28494 pipe2
0.0 21371 2 10685.5 6011 15360 socket
0.0 17433 7 2490.4 1172 3637 fcntl
0.0 16752 2 8376.0 2485 14267 fgetc
0.0 11371 8 1421.4 1002 1874 pthread_mutex_trylock
0.0 9489 1 9489.0 9489 9489 connect
0.0 3828 1 3828.0 3828 3828 bind
0.0 3678 2 1839.0 1413 2265 sigaction
0.0 1523 1 1523.0 1523 1523 fputs_unlocked
这个是pytorch框架,nsys的结果:
Using report1.sqlite export for stats reports. Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/cudaapisum.py report1.sqlite] to console...
Time(%) Total Time (ns) Num Calls Average Minimum Maximum Name
37.9 7820974612 1100149 7109.0 4669 3560676 cudaLaunchKernel 21.8 4508601183 219 20587220.0 4468 4393702431 cudaMalloc 20.7 4272509677 31041 137640.9 7113 8738285 cudaMemcpyAsync 16.3 3360510597 15042 223408.5 2765 4192212 cudaStreamSynchronize 2.3 482940448 18 26830024.9 821 476609355 cudaFree 0.7 136884097 16000 8555.3 3947 206842 cudaMemsetAsync 0.2 33074394 9000 3674.9 1272 35498 cudaEventQuery 0.1 30974277 9000 3441.6 1473 28925 cudaEventRecord 0.0 7241864 4061 1783.3 661 21901 cudaStreamIsCapturing_v10000 0.0 1292359 4 323089.8 4820 1273031 cudaHostAlloc 0.0 103446 2 51723.0 47169 56277 cudaMemcpy 0.0 80534 8 10066.8 3016 31430 cudaDeviceSynchronize 0.0 70489 36 1958.0 771 12453 cudaEventCreateWithFlags 0.0 48883 36 1357.9 591 4077 cudaEventDestroy 0.0 1362 1 1362.0 1362 1362 cuInit
Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/gpukernsum.py report1.sqlite] to console...
Time(%) Total Time (ns) Instances Average Minimum Maximum Name
17.1 4743979003 28000 169427.8 99008 234688 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_16x3_tn_align4>(cutlass_80_tensorop_s168… 11.7 3226121943 83002 38868.0 4416 255136 void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator&,… 8.7 2419667331 49000 49381.0 5408 202976 void at::native::vectorized_elementwise_kernel<4, at::native::softplus_kernel(at::TensorIterator&, … 5.3 1461215572 15000 97414.4 3232 128864 void at::native::vectorized_elementwise_kernel<4, at::native::threshold_kernel_impl<float>(at::Tens… 4.6 1261166335 32000 39411.4 4320 235200 void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 2, 128, 1>(float*,… 3.7 1031306847 7000 147329.5 33600 350048 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_128x256_32x3_tn_align1>(cutlass_80_tensorop_s168… 3.7 1020162021 6000 170027.0 164417 209408 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x6_nt_align4>(cutlass_80_tensorop_s1688… 3.6 993239499 7000 141891.4 138401 144320 void at::native::vectorized_elementwise_kernel<4, at::native::softplus_backward_kernel(at::TensorIt… 3.6 990449261 161000 6151.9 4256 188512 void at::native::unrolled_elementwise_kernel<at::native::MulFunctor<float>, at::detail::Array<char*… 3.4 935567665 7000 133652.5 32512 325441 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_32x3_tn_align1>(cutlass_80_tensorop_s168… 3.3 914717375 7000 130673.9 119808 248960 void at::native::reduce_kernel<128, 4, at::native::ReduceOp<float, at::native::func_wrapper_t<float… 2.7 751298962 6000 125216.5 121888 289633 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_16x3_nn_align4>(cutlass_80_tensorop_s168… 2.6 731521425 17000 43030.7 5184 223424 void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 2, 64, 64>(float*,… 2.4 671549852 2000 335774.9 284737 475680 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x64_32x3_nt_align1>(cutlass_80_tensorop_s1688… 2.3 625225596 19000 32906.6 24031 214177 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_32x3_tn_align4>(cutlass_80_tensorop_s168… 2.0 552416751 52000 10623.4 2881 259424 void at::native::vectorized_elementwise_kernel<4, at::native::MulScalarFunctor<float, float>, at::d… 1.5 419654208 29009 14466.3 3264 186688 void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, at::detail::Array… 1.3 370691436 2000 185345.7 171616 238112 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x128_32x3_nn_align1>(cutlass_80_tensorop_s168… 1.3 347887402 54000 6442.4 4480 203841 void at::native::unrolled_elementwise_kernel<at::native::AddFunctor<float>, at::detail::Array<char*… 1.2 346015529 59014 5863.3 4321 187456 void at::native::(anonymous namespace)::weight_norm_fwd_first_dim_kernel<float, float>(float*, floa… 1.2 319314683 3000 106438.2 60288 213664 ampere_sgemm_128x64_tn 1.1 299271474 60000 4987.9 3232 7840 void at::native::vectorized_elementwise_kernel<4, at::native::sin_kernel_cuda(at::TensorIterator&):… 1.1 299015827 60000 4983.6 3136 7616 void at::native::vectorized_elementwise_kernel<4, at::native::cos_kernel_cuda(at::TensorIterator&):… 1.1 298473999 33000 9044.7 3169 156448 void at::native::vectorized_elementwise_kernel<4, at::native::AddFunctor<float>, at::detail::Array<… 0.9 258512720 18000 14361.8 4928 66144 void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::func_wrapper_t<float… 0.9 255586020 1000 255586.0 252800 299328 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_128x128_16x5_tn_align4>(cutlass_80_tensorop_s168… 0.5 139632350 1000 139632.4 137216 142049 void at::native::unrolled_elementwise_kernel<at::native::softplus_backward_kernel(at::TensorIterato… 0.5 134033054 32000 4188.5 3169 193888 void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::AddFunctor<… 0.5 127006064 4000 31751.5 19200 77824 ampere_sgemm_32x128_tn 0.4 98559958 1000 98560.0 92448 110400 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x3_tn_align1>(cutlass_80_tensorop_s1688… 0.3 89613656 1000 89613.7 86560 95424 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_256x64_32x3_tn_align4>(cutlass_80_tensorop_s1688… 0.3 83070191 1000 83070.2 82176 92801 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x3_nn_align1>(cutlass_80_tensorop_s1688… 0.3 82813174 19000 4358.6 2976 7072 void at::native::vectorized_elementwise_kernel<4, at::native::MulFunctor<float>, at::detail::Array<… 0.3 78349734 17000 4608.8 3360 6336 void at::native::vectorized_elementwise_kernel<4, at::native::DivFunctor<float>, at::detail::Array<… 0.3 76501273 1000 76501.3 74656 86688 void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_64x128_32x3_nt_align1>(cutlass_80_tensorop_s1688… 0.2 59776342 7000 8539.5 5024 15072 void at::native::unrolled_elementwise_kernel<at::native::MulScalarFunctor<float, float>, at::detail… 0.2 59416240 8014 7414.1 5857 9952 void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::NormTwoOps<float, fl… 0.2 57802308 1000 57802.3 56512 73984 void bitonicSortKVInPlace<float, long, 2, -1, LTComp<float, true>, unsigned int, 1024>(TensorInfo<f… 0.2 57679111 12000 4806.6 3392 7136 void at::native::vectorized_elementwise_kernel<4, at::native::sigmoid_kernel_cuda(at::TensorIterato… 0.2 57532221 6000 9588.7 6656 15872 _ZN2at6native32tensor_kernel_scan_innermost_dimIfLi16ELi32ESt10multipliesIfEEENSt9enable_ifIXntsr3c… 0.2 55069919 8000 6883.7 5984 9120 void at::native::_scatter_gather_elementwise_kernel<64, 4, at::native::_cuda_scatter_gather_interna… 0.2 54685328 4000 13671.3 12352 17504 void bitonicSortKVInPlace<float, long, 2, -1, LTComp<float, true>, unsigned int, 128>(TensorInfo<fl… 0.2 50226806 11000 4566.1 3871 186208 void at::native::vectorized_elementwise_kernel<4, at::native::neg_kernel_cuda(at::TensorIterator&):… 0.2 48343357 7000 6906.2 5856 9408 void at::native::unrolled_elementwise_kernel<at::native::DivFunctor<float>, at::detail::Array<char*… 0.2 47398192 8000 5924.8 4575 7936 void at::native::unrolled_elementwise_kernel<at::native::BUnaryFunctor<at::native::CompareLTFunctor… 0.1 38678373 9000 4297.6 4095 5888 void at::native::vectorized_elementwise_kernel<4, at::native::AUnaryFunctor<at::native::AddFunctor<… 0.1 34909773 4000 8727.4 7360 11680 void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::MinOps<float>, unsig… 0.1 34577085 9032 3828.3 3392 5312 void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<long>, at::detail::Array<… 0.1 33286714 8000 4160.8 3935 5728 void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::AddFunctor<… 0.1 32970112 4000 8242.5 6688 12064 _ZN2at6native32tensor_kernel_scan_innermost_dimIfLi16ELi32ESt4plusIfEEENSt9enable_ifIXntsr3c1010is_… 0.1 32203511 5000 6440.7 4896 9345 void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 3, 128, 1>(float*,… 0.1 31983060 3000 10661.0 9408 13792 void at::native::index_elementwise_kernel<128, 4, at::native::gpu_index_kernel<at::native::index_ke… 0.1 29858411 7000 4265.5 3583 5696 void at::native::vectorized_elementwise_kernel<4, at::native::clamp_kernel_cuda(at::TensorIterator&… 0.1 26361151 4000 6590.3 5440 8255 void at::native::unrolled_elementwise_kernel<at::native::MulFunctor<float>, at::detail::Array<char*… 0.1 25538805 4000 6384.7 5920 8097 void at::native::(anonymous namespace)::searchsorted_cuda_kernel<float, long>(long*, float const*, … 0.1 24705557 6000 4117.6 2912 5280 void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::CompareLTFu… 0.1 24654968 4000 6163.7 4992 7776 void at::native::(anonymous namespace)::CatArrayBatchedCopy<long, unsigned int, 3, 128, 1>(long*, a… 0.1 24323868 5000 4864.8 3776 6336 void fillSliceWithIndex<unsigned int, 2>(TensorInfo<long, unsigned int>, unsigned int, unsigned int… 0.1 21275934 6004 3543.6 3360 4960 void (anonymous namespace)::elementwise_kernel_with_index<int, at::native::linspace_cuda_out(at::Te… 0.1 18267190 3000 6089.1 5824 7616 void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator&,… 0.1 17843824 3000 5947.9 4640 7808 void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator&,… 0.1 16921323 4000 4230.3 4064 5792 void at::native::vectorized_elementwise_kernel<4, at::native::BitwiseOrFunctor<bool>, at::detail::A… 0.1 16632258 4000 4158.1 3040 5376 void at::native::vectorized_elementwise_kernel<4, at::native::maximum_kernel_cuda(at::TensorIterato… 0.1 16625110 4000 4156.3 3968 5279 void at::native::vectorized_elementwise_kernel<4, at::native::minimum_kernel_cuda(at::TensorIterato… 0.1 16584496 4000 4146.1 4000 5152 void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::MulFunctor<… 0.1 16443999 4000 4111.0 3744 5184 void at::native::vectorized_elementwise_kernel<4, at::native::(anonymous namespace)::where_kernel_i… 0.0 13843055 2000 6921.5 5696 8896 void splitKreduce_kernel<float, float, float, float>(cublasSplitKParams<float>, float const*, float… 0.0 11622419 3000 3874.1 3360 4799 void (anonymous namespace)::elementwise_kernel_with_index<int, at::native::arange_cuda_out(at::Tens… 0.0 10615435 2000 5307.7 4704 6912 void gemv2T_kernel_val<int, int, float, float, float, float, 128, 16, 2, 2, false, false, cublasGem… 0.0 9529063 2000 4764.5 4224 6112 void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 1, 128, 1>(float*,… 0.0 9270520 2052 4517.8 4319 7168 _ZN2at6native90_GLOBAL__N__66_tmpxft_00005537_00000000_13_DistributionUniform_compute_86_cpp1_ii_8b… 0.0 9209249 2000 4604.6 4384 5984 void at::native::(anonymous namespace)::distribution_elementwise_grid_stride_kernel<unsigned int, 4… 0.0 8748149 1000 8748.1 7456 11776 void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::MaxOps<float>, unsig… 0.0 8487633 2000 4243.8 3392 6176 void at::native::vectorized_elementwise_kernel<4, at::native::exp_kernel_cuda(at::TensorIterator&):… 0.0 8457925 2000 4229.0 3711 5632 void at::native::vectorized_elementwise_kernel<4, at::native::reciprocal_kernel_cuda(at::TensorIter… 0.0 8309075 2000 4154.5 3328 5376 void at::native::vectorized_elementwise_kernel<4, at::native::(anonymous namespace)::pow_tensor_sca… 0.0 8126762 1000 8126.8 7968 10080 void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 3, 64, 64>(float*,… 0.0 6878162 1000 6878.2 5856 8577 void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::MeanOps<float, float… 0.0 6362163 1000 6362.2 6080 7872 void at::native::unrolled_elementwise_kernel<at::native::reciprocal_kernel_cuda(at::TensorIterator&… 0.0 5839958 1000 5840.0 5600 7328 void at::native::unrolled_elementwise_kernel<at::native::(anonymous namespace)::pow_tensor_scalar_k… 0.0 5478436 1000 5478.4 5312 6881 void at::native::(anonymous namespace)::CatArrayBatchedCopy<long, unsigned int, 2, 128, 1>(long*, a… 0.0 4928042 1000 4928.0 3776 6017 void at::native::kernel_pointwise_flip_apply2<float, long>(at::cuda::detail::TensorInfo<float, long… 0.0 4098339 1000 4098.3 3776 5184 void at::native::vectorized_elementwise_kernel<4, at::native::log10_kernel_cuda(at::TensorIterator&… 0.0 3979337 1000 3979.3 3839 5088 void at::native::vectorized_elementwise_kernel<4, at::native::AbsFunctor<float>, at::detail::Array<… 0.0 3974670 1000 3974.7 3712 5184 void at::native::vectorized_elementwise_kernel<4, at::native::sqrt_kernel_cuda(at::TensorIterator&)… 0.0 3671683 1000 3671.7 3520 5152 void at::native::unrolled_elementwise_kernel<at::native::AddFunctor<float>, at::detail::Array<char*… 0.0 58080 8 7260.0 6848 7424 _ZN2at6native89_GLOBAL__N__65_tmpxft_000054da_00000000_13_DistributionNormal_compute_86_cpp1_ii_7d8… 0.0 24256 4 6064.0 5568 6336 void at::native::unrolled_elementwise_kernel<at::native::(anonymous namespace)::pow_tensor_tensor_k… 0.0 11680 2 5840.0 5728 5952 void at::native::unrolled_elementwise_kernel<at::native::FillFunctor<float>, at::detail::Array<char… 0.0 9856 1 9856.0 9856 9856 void sgetrf_batched_smallsq_noshfl_kernel<4, 4>(float**, int, int**, int*, int) 0.0 8991 1 8991.0 8991 8991 void trsm_template_batched_lNL_kernel<float, 4, 32>(magma_diag_t, int, int, float, float**, int, fl… 0.0 8991 1 8991.0 8991 8991 slaswp_columnserial_kernel_batched(int, float**, int, int, int, int**) 0.0 8864 1 8864.0 8864 8864 void trsm_template_batched_lNU_kernel<float, 4, 32>(magma_diag_t, int, int, float, float**, int, fl… 0.0 8448 2 4224.0 4000 4448 void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<int>, at::detail::Array<c… 0.0 7712 1 7712.0 7712 7712 slaset_full_kernel_batched(int, int, float, float, float**, int) 0.0 7040 1 7040.0 7040 7040 _ZN2at6native89_GLOBAL__N__65_tmpxft_000054da_00000000_13_DistributionNormal_compute_86_cpp1_ii_7d8…
Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/gpumemtimesum.py report1.sqlite] to console...
Time(%) Total Time (ns) Operations Average Minimum Maximum Operation
84.3 380885037 16000 23805.3 3168 251008 [CUDA memcpy DtoD] 7.6 34510530 16000 2156.9 2079 3649 [CUDA memset] 6.3 28298124 12002 2357.8 1312 179361 [CUDA memcpy DtoH] 1.8 8248738 3041 2712.5 2079 4576 [CUDA memcpy HtoD]
Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/gpumemsizesum.py report1.sqlite] to console...
Total Operations Average Minimum Maximum Operation
324.219 16000 0.020 0.008 0.035 [CUDA memset] 28043.219 12002 2.337 0.004 6.000 [CUDA memcpy DtoH] 20020.051 3041 6.583 0.004 20.000 [CUDA memcpy HtoD]
200066000.000 16000 12504.125 2.000 65792.000 [CUDA memcpy DtoD]
Exporting [/opt/nvidia/nsight-systems/2020.4.3/target-linux-x64/reports/osrtsum.py report1.sqlite] to console...
Time(%) Total Time (ns) Num Calls Average Minimum Maximum Name
35.3 45957635098 472 97367871.0 3717 100176399 poll 31.5 40984272066 2002 20471664.4 23034 124189592 pthread_cond_wait 31.4 40931464392 17 2407733199.5 14478 10000184789 sem_timedwait 1.5 1913495171 4241 451189.6 1152 290886421 ioctl 0.1 97439113 25 3897564.5 173359 5066508 pthread_cond_timedwait 0.1 82557276 40664 2030.2 1001 160924 sched_yield 0.0 50803760 1443 35207.0 1002 31465169 read 0.0 27119248 17573 1543.2 1012 134184 fread 0.0 22618854 257 88011.1 54072 304738 sleep 0.0 19923053 11 1811186.6 1613 19820359 open 0.0 14003578 196 71446.8 51397 346887 pthread_create 0.0 10542299 2041 5165.3 1082 105701 pthread_cond_signal 0.0 7664737 1209 6339.7 1764 372275 open64 0.0 7222006 279 25885.3 1002 72036 fgets 0.0 7170386 33 217284.4 1633 688264 pthread_mutex_lock 0.0 4185312 365 11466.6 1843 129044 write 0.0 3386553 334 10139.4 1072 122663 munmap 0.0 3115098 283 11007.4 1012 526678 mmap64 0.0 3056612 178 17172.0 1062 856552 fopen 0.0 2969495 1 2969495.0 2969495 2969495 fork 0.0 1906173 331 5758.8 1352 91363 mmap 0.0 1238575 62 19977.0 3907 634191 pthread_join 0.0 958327 6 159721.2 56376 314757 sem_wait 0.0 474301 158 3001.9 1012 7173 fclose 0.0 348921 68 5131.2 1162 10480 fopen64 0.0 47961 1 47961.0 47961 47961 waitpid 0.0 47068 31 1518.3 1002 9828 fflush 0.0 41709 3 13903.0 2084 28494 pipe2 0.0 21371 2 10685.5 6011 15360 socket 0.0 17433 7 2490.4 1172 3637 fcntl 0.0 16752 2 8376.0 2485 14267 fgetc 0.0 11371 8 1421.4 1002 1874 pthread_mutex_trylock 0.0 9489 1 9489.0 9489 9489 connect 0.0 3828 1 3828.0 3828 3828 bind 0.0 3678 2 1839.0 1413 2265 sigaction 0.0 1523 1 1523.0 1523 1523 fputs_unlocked
可以顺带附上nsys的qdrep文件吗?
目前粗略的看,感觉oneflow的cudaLaunchKernel、cudaStreamSynchronize比pytorch多,可能在oneflow里有一些op是拼接得到的,多一些cuda_kernel。
summary主要是为了看具体某个kernel优化前后对比。如果想看和pytorch的对比,最好先看nsys的qdrep文件。
从这里https://oneflow-test.oss-cn-beijing.aliyuncs.com/NeuS/nsys/report1.qdrep 可以看到
cuda kernel之间应该有很多cpu op。也许是某处代码直接写了cpu device type。
@yoonlee888 跑性能测试的时候可以少跑写iter。否则文件太大了
从这里https://oneflow-test.oss-cn-beijing.aliyuncs.com/NeuS/nsys/report1.qdrep 可以看到
cuda kernel之间应该有很多cpu op。也许是某处代码直接写了cpu device type。
好的我排查下哪里初始化用了cpu,感谢