Knet.jl
Knet.jl copied to clipboard
Test CuArrays on tutorial and examples, benchmark against KnetArrays
CuArray compatibility and speed on Knet tutorial/examples
Knet/tutorial
- [x] 15.quickstart
- [x] 23.learning
- [x] 30.lin
- [x] 40.mlp
- [x] 50.cnn
- [x] 60.rnn
- [x] 70.imdb
- [x] 80.charlm
- [x] 90.s2s
Knet/examples
- [ ] cifar10-cnn: CUDA 25% slower (7 vs 9 secs/epoch on dy03)
- [ ] dcgan-mnist: CUDA 50% slower (24 i/s vs 17 i/s on dy03)
- [x] DeepLearningFrameworks/Knet_CNN: ~CUDA 50% slower~ (15 vs 15.8 secs/epoch on dy03)
- [x] DeepLearningFrameworks/Knet_RNN
- [ ] DeepLearningFrameworks/ResNet50-Knet: needs pool mode=2, CUDA ~100%~ 25% slower (7.7 vs 9.7 secs on dy03)
- [x] dynet-benchmark/treenn
- [ ] dynet-benchmark/rnnlm-batch: CUDA 10% slower (35 vs 38 i/s on dy03)
- [x] dynet-benchmark/bilstm-tagger
- [x] dynet-benchmark/bilstm-tagger-withchar
- [x] fashion-mnist
- [x] housing-linreg
- [x] ~julia-tutorial~ not a real example
- [x] lenet: update! interface changed
- [x] mnist-mlp
- [x] optimizers
- [x] ~reinforcement-learning/dp: does not use KnetArray~
- [x] reinforcement-learning/dqn
- [x] reinforcement-learning/pg
- [x] resnet: mode=2 is not supported for CPU pool. ~Knet 50% faster~: https://github.com/FluxML/NNlib.jl/issues/218
- [x] rnnlm: (26 vs 27 secs/epoch on dy03)
- [x] ~rnn-tutorial~: this is the same as 90.s2s
- [x] synthetic-linreg
- [x] variational-autoencoder: ~binary_cross_entropy gives error: https://github.com/JuliaGPU/CUDA.jl/issues/346~
- [x] vgg: ~CuArray works but 50% slower.~
Other:
- [x] 2014-Sutskever: sequence to sequence rnn
- [x] 2015-Luong: s2s rnn with attention
- [x] 2017-Vaswani: s2s transformer ~Knet is 50% faster~.
- [ ] test/karray.jl: test for CuArray
Array/KnetArray/CuArray comparison for Knet.Ops20 operators
[email protected]/prof/ops20.jl (Aug 19, 2020)
a1=forw(Array) k1=forw(KnetArray) c1=forw(CuArray) a2=diff(Array) k2=diff(KnetArray) c2=diff(CuArray) n=samples
identity(1000×32) a1:319.025 ns k1:942.316 ns c1:952.389 ns a2:11.798 μs k2:64.614 μs c2:49.882 μs n=10000-10000
getindex1(1000×32) a1:510.808 ns k1:9.749 μs c1:8.727 μs a2:6.291 μs k2:33.746 μs c2:25.879 μs n=8155-10000
sum(1000×32) a1:2.179 μs k1:14.486 μs c1:14.476 μs a2:17.009 μs k2:47.072 μs c2:36.536 μs n=10000-10000
drop(1000×32) a1:24.852 μs k1:16.820 μs c1:14.865 μs a2:52.835 μs k2:86.923 μs c2:69.385 μs n=9758-10000
logsoftmax(1000×32) a1:469.273 μs k1:15.265 μs c1:12.648 μs a2:702.613 μs k2:135.968 μs c2:117.517 μs n=1188-10000
softmax(1000×32) a1:243.598 μs k1:14.657 μs c1:13.595 μs a2:324.258 μs k2:134.791 μs c2:117.355 μs n=2512-10000
logsumexp(1000×32) a1:245.609 μs k1:27.933 μs c1:25.499 μs a2:463.436 μs k2:117.638 μs c2:95.872 μs n=1791-10000
nll1(1000×32) a1:470.326 μs k1:34.868 μs c1:32.265 μs a2:737.573 μs k2:206.748 μs c2:172.005 μs n=1183-10000
accuracy1(1000×32) a1:120.658 μs k1:156.162 μs c1:153.449 μs a2:120.751 μs k2:153.622 μs c2:152.840 μs n=5445-6760
bce1(1000-element) a1:32.878 μs k1:59.998 μs c1:55.322 μs a2:104.363 μs k2:403.382 μs c2:371.057 μs n=1941-10000
logistic1(1000-element) a1:34.042 μs k1:59.558 μs c1:56.479 μs a2:105.700 μs k2:405.263 μs c2:381.921 μs n=1995-10000
*(1000×2048,2048×32) a1:910.156 μs k1:107.436 μs c1:105.338 μs a2:3.460 ms k2:242.989 μs c2:252.957 μs n=155-9195
adddot(1000×32,1000-element) a1:11.065 μs k1:13.862 μs c1:19.127 μs a2:37.279 μs k2:116.366 μs c2:98.497 μs n=7089-10000
conv4(3×3×256×256,14×14×256×32) a1:33.000 ms k1:929.012 μs c1:887.566 μs a2:112.036 ms k2:1.320 ms c2:1.298 ms n=9-961
deconv4(3×3×256×256,14×14×256×32) a1:72.921 ms k1:1.230 ms c1:1.167 ms a2:171.488 ms k2:1.550 ms c2:1.557 ms n=6-690
pool(14×14×256×32) a1:2.648 ms k1:74.450 μs c1:72.991 μs a2:14.309 ms k2:278.388 μs c2:274.323 μs n=65-6635
unpool(7×7×256×32) a1:2.055 ms k1:148.128 μs c1:215.504 μs a2:3.942 ms k2:301.559 μs c2:399.223 μs n=163-5135
mat(14×14×256×32) a1:2.382 μs k1:3.462 μs c1:3.375 μs a2:406.480 μs k2:115.034 μs c2:107.836 μs n=1567-10000
eludot(14×14×256×32) a1:1.372 ms k1:102.058 μs c1:100.842 μs a2:3.851 ms k2:285.918 μs c2:286.169 μs n=222-8270
reludot(14×14×256×32) a1:1.351 ms k1:101.732 μs c1:101.529 μs a2:3.378 ms k2:289.566 μs c2:283.286 μs n=254-8017
seludot(14×14×256×32) a1:2.079 ms k1:121.703 μs c1:122.047 μs a2:4.282 ms k2:301.173 μs c2:306.186 μs n=199-6695
sigmdot(14×14×256×32) a1:14.990 ms k1:100.962 μs c1:101.391 μs a2:17.238 ms k2:290.463 μs c2:284.894 μs n=56-7626
bn(14×14×256×32,512-element) a1:4.727 ms k1:134.792 μs c1:151.998 μs a2:31.203 ms k2:847.873 μs c2:783.175 μs n=31-5601
bmm(64×256×256×32,256×64×256×32) a1:214.682 ms k1:11.394 ms c1:12.218 ms a2:1.062 s k2:14.622 ms c2:16.899 ms n=1-71
rnntest(1×1×526336,256×32×256) a1:363.646 ms k1:7.660 ms c1:7.719 ms a2:1.332 s k2:10.371 ms c2:10.342 ms n=1-127
embed(256×10000) a1:15.709 ms k1:1.217 ms c1:1.631 ms a2:54.638 ms k2:1.731 ms c2:2.079 ms n=18-694
julia> versioninfo()
Julia Version 1.5.0
Commit 96786e22cc (2020-08-01 23:44 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
julia> CUDA.device()
CuDevice(0): GeForce GTX 1060 with Max-Q Design
julia> pkg"st"
Status `~/.julia/environments/v1.5/Project.toml`
[c7e460c6] ArgParse v1.1.0
[6710c13c] AutoGrad v1.2.4 `~/.julia/dev/AutoGrad`
[6e4b80f9] BenchmarkTools v0.5.0
[052768ef] CUDA v1.2.1 `~/.julia/dev/CUDA`
[864edb3b] DataStructures v0.17.20
[5789e2e9] FileIO v1.4.1
[587475ba] Flux v0.11.1 `~/.julia/dev/Flux`
[0c68f7d7] GPUArrays v5.1.0 `../../dev/GPUArrays`
[7073ff75] IJulia v1.21.3
[6218d12a] ImageMagick v1.1.5
[916415d5] Images v0.22.4
[c8e1da08] IterTools v1.3.0
[033835bb] JLD2 v0.1.14
[682c06a0] JSON v0.21.0
[1902f260] Knet v1.4.0 `~/.julia/dev/Knet`
[23992714] MAT v0.8.0
[eb30cadb] MLDatasets v0.5.2
[0db19996] NBInclude v2.2.0
[872c559c] NNlib v0.7.4 `~/.julia/dev/NNlib`
[69de0a69] Parsers v1.0.10
[91a5bcdd] Plots v1.6.0
[438e738f] PyCall v1.91.4
[295af30f] Revise v2.7.3
[276daf66] SpecialFunctions v0.10.3
[a5390f91] ZipFile v0.9.2