Knet.jl Test CuArrays on tutorial and examples, benchmark against KnetArrays

Test CuArrays on tutorial and examples, benchmark against KnetArrays

Open denizyuret opened this issue 3 years ago • 1 comments

CuArray compatibility and speed on Knet tutorial/examples

Knet/tutorial

[x] 15.quickstart
[x] 23.learning
[x] 30.lin
[x] 40.mlp
[x] 50.cnn
[x] 60.rnn
[x] 70.imdb
[x] 80.charlm
[x] 90.s2s

Knet/examples

[ ] cifar10-cnn: CUDA 25% slower (7 vs 9 secs/epoch on dy03)
[ ] dcgan-mnist: CUDA 50% slower (24 i/s vs 17 i/s on dy03)
[x] DeepLearningFrameworks/Knet_CNN: ~CUDA 50% slower~ (15 vs 15.8 secs/epoch on dy03)
[x] DeepLearningFrameworks/Knet_RNN
[ ] DeepLearningFrameworks/ResNet50-Knet: needs pool mode=2, CUDA ~100%~ 25% slower (7.7 vs 9.7 secs on dy03)
[x] dynet-benchmark/treenn
[ ] dynet-benchmark/rnnlm-batch: CUDA 10% slower (35 vs 38 i/s on dy03)
[x] dynet-benchmark/bilstm-tagger
[x] dynet-benchmark/bilstm-tagger-withchar
[x] fashion-mnist
[x] housing-linreg
[x] ~julia-tutorial~ not a real example
[x] lenet: update! interface changed
[x] mnist-mlp
[x] optimizers
[x] ~reinforcement-learning/dp: does not use KnetArray~
[x] reinforcement-learning/dqn
[x] reinforcement-learning/pg
[x] resnet: mode=2 is not supported for CPU pool. ~Knet 50% faster~: https://github.com/FluxML/NNlib.jl/issues/218
[x] rnnlm: (26 vs 27 secs/epoch on dy03)
[x] ~rnn-tutorial~: this is the same as 90.s2s
[x] synthetic-linreg
[x] variational-autoencoder: ~binary_cross_entropy gives error: https://github.com/JuliaGPU/CUDA.jl/issues/346~
[x] vgg: ~CuArray works but 50% slower.~

Other:

[x] 2014-Sutskever: sequence to sequence rnn
[x] 2015-Luong: s2s rnn with attention
[x] 2017-Vaswani: s2s transformer ~Knet is 50% faster~.
[ ] test/karray.jl: test for CuArray

Jul 25 '20 10:07 denizyuret

Array/KnetArray/CuArray comparison for Knet.Ops20 operators

[email protected]/prof/ops20.jl (Aug 19, 2020)

a1=forw(Array) k1=forw(KnetArray) c1=forw(CuArray) a2=diff(Array) k2=diff(KnetArray) c2=diff(CuArray) n=samples
identity(1000×32)                 a1:319.025 ns k1:942.316 ns c1:952.389 ns a2:11.798 μs  k2:64.614 μs  c2:49.882 μs  n=10000-10000
getindex1(1000×32)                a1:510.808 ns k1:9.749 μs   c1:8.727 μs   a2:6.291 μs   k2:33.746 μs  c2:25.879 μs  n=8155-10000
sum(1000×32)                      a1:2.179 μs   k1:14.486 μs  c1:14.476 μs  a2:17.009 μs  k2:47.072 μs  c2:36.536 μs  n=10000-10000
drop(1000×32)                     a1:24.852 μs  k1:16.820 μs  c1:14.865 μs  a2:52.835 μs  k2:86.923 μs  c2:69.385 μs  n=9758-10000
logsoftmax(1000×32)               a1:469.273 μs k1:15.265 μs  c1:12.648 μs  a2:702.613 μs k2:135.968 μs c2:117.517 μs n=1188-10000
softmax(1000×32)                  a1:243.598 μs k1:14.657 μs  c1:13.595 μs  a2:324.258 μs k2:134.791 μs c2:117.355 μs n=2512-10000
logsumexp(1000×32)                a1:245.609 μs k1:27.933 μs  c1:25.499 μs  a2:463.436 μs k2:117.638 μs c2:95.872 μs  n=1791-10000
nll1(1000×32)                     a1:470.326 μs k1:34.868 μs  c1:32.265 μs  a2:737.573 μs k2:206.748 μs c2:172.005 μs n=1183-10000
accuracy1(1000×32)                a1:120.658 μs k1:156.162 μs c1:153.449 μs a2:120.751 μs k2:153.622 μs c2:152.840 μs n=5445-6760
bce1(1000-element)                a1:32.878 μs  k1:59.998 μs  c1:55.322 μs  a2:104.363 μs k2:403.382 μs c2:371.057 μs n=1941-10000
logistic1(1000-element)           a1:34.042 μs  k1:59.558 μs  c1:56.479 μs  a2:105.700 μs k2:405.263 μs c2:381.921 μs n=1995-10000
*(1000×2048,2048×32)              a1:910.156 μs k1:107.436 μs c1:105.338 μs a2:3.460 ms   k2:242.989 μs c2:252.957 μs n=155-9195
adddot(1000×32,1000-element)      a1:11.065 μs  k1:13.862 μs  c1:19.127 μs  a2:37.279 μs  k2:116.366 μs c2:98.497 μs  n=7089-10000
conv4(3×3×256×256,14×14×256×32)   a1:33.000 ms  k1:929.012 μs c1:887.566 μs a2:112.036 ms k2:1.320 ms   c2:1.298 ms   n=9-961
deconv4(3×3×256×256,14×14×256×32) a1:72.921 ms  k1:1.230 ms   c1:1.167 ms   a2:171.488 ms k2:1.550 ms   c2:1.557 ms   n=6-690
pool(14×14×256×32)                a1:2.648 ms   k1:74.450 μs  c1:72.991 μs  a2:14.309 ms  k2:278.388 μs c2:274.323 μs n=65-6635
unpool(7×7×256×32)                a1:2.055 ms   k1:148.128 μs c1:215.504 μs a2:3.942 ms   k2:301.559 μs c2:399.223 μs n=163-5135
mat(14×14×256×32)                 a1:2.382 μs   k1:3.462 μs   c1:3.375 μs   a2:406.480 μs k2:115.034 μs c2:107.836 μs n=1567-10000
eludot(14×14×256×32)              a1:1.372 ms   k1:102.058 μs c1:100.842 μs a2:3.851 ms   k2:285.918 μs c2:286.169 μs n=222-8270
reludot(14×14×256×32)             a1:1.351 ms   k1:101.732 μs c1:101.529 μs a2:3.378 ms   k2:289.566 μs c2:283.286 μs n=254-8017
seludot(14×14×256×32)             a1:2.079 ms   k1:121.703 μs c1:122.047 μs a2:4.282 ms   k2:301.173 μs c2:306.186 μs n=199-6695
sigmdot(14×14×256×32)             a1:14.990 ms  k1:100.962 μs c1:101.391 μs a2:17.238 ms  k2:290.463 μs c2:284.894 μs n=56-7626
bn(14×14×256×32,512-element)      a1:4.727 ms   k1:134.792 μs c1:151.998 μs a2:31.203 ms  k2:847.873 μs c2:783.175 μs n=31-5601
bmm(64×256×256×32,256×64×256×32)  a1:214.682 ms k1:11.394 ms  c1:12.218 ms  a2:1.062 s    k2:14.622 ms  c2:16.899 ms  n=1-71
rnntest(1×1×526336,256×32×256)    a1:363.646 ms k1:7.660 ms   c1:7.719 ms   a2:1.332 s    k2:10.371 ms  c2:10.342 ms  n=1-127
embed(256×10000)                  a1:15.709 ms  k1:1.217 ms   c1:1.631 ms   a2:54.638 ms  k2:1.731 ms   c2:2.079 ms   n=18-694

julia> versioninfo()
Julia Version 1.5.0
Commit 96786e22cc (2020-08-01 23:44 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

julia> CUDA.device()
CuDevice(0): GeForce GTX 1060 with Max-Q Design

julia> pkg"st"
Status `~/.julia/environments/v1.5/Project.toml`
  [c7e460c6] ArgParse v1.1.0
  [6710c13c] AutoGrad v1.2.4 `~/.julia/dev/AutoGrad`
  [6e4b80f9] BenchmarkTools v0.5.0
  [052768ef] CUDA v1.2.1 `~/.julia/dev/CUDA`
  [864edb3b] DataStructures v0.17.20
  [5789e2e9] FileIO v1.4.1
  [587475ba] Flux v0.11.1 `~/.julia/dev/Flux`
  [0c68f7d7] GPUArrays v5.1.0 `../../dev/GPUArrays`
  [7073ff75] IJulia v1.21.3
  [6218d12a] ImageMagick v1.1.5
  [916415d5] Images v0.22.4
  [c8e1da08] IterTools v1.3.0
  [033835bb] JLD2 v0.1.14
  [682c06a0] JSON v0.21.0
  [1902f260] Knet v1.4.0 `~/.julia/dev/Knet`
  [23992714] MAT v0.8.0
  [eb30cadb] MLDatasets v0.5.2
  [0db19996] NBInclude v2.2.0
  [872c559c] NNlib v0.7.4 `~/.julia/dev/NNlib`
  [69de0a69] Parsers v1.0.10
  [91a5bcdd] Plots v1.6.0
  [438e738f] PyCall v1.91.4
  [295af30f] Revise v2.7.3
  [276daf66] SpecialFunctions v0.10.3
  [a5390f91] ZipFile v0.9.2

Aug 19 '20 10:08 denizyuret

Knet.jl Knet.jl copied to clipboard

Test CuArrays on tutorial and examples, benchmark against KnetArrays

CuArray compatibility and speed on Knet tutorial/examples

Knet/tutorial

Knet/examples

Other:

Array/KnetArray/CuArray comparison for Knet.Ops20 operators

Knet.jl
Knet.jl copied to clipboard