marian
marian copied to clipboard
Marian 1.9.0 requires roughly 30-100% more CPU memory than 1.7.6 in GPU decoding
The RSS of Marian 1.9.0 seems to be roughly twice as high as that of 1.7.6 (02f4af4eeefa79a24cd52d279a5d4d374423d631)
We are running multiple instances of marian-server
on a machine with 16GB of RAM and a Nvidia T4 GPU with 1.9.0 it is no longer possible to run the same amount of instances.
All instances are configured with a RNN translation model.
Output of ps aux
for 1.9.0 looks like this
mt 17 0.1 7.0 8739420 2281416 ? Sl 14:47 0:06 /marian/marian-server -c model/config.yml --port 8080 -w 256
mt 29 0.1 3.7 7609444 1221032 ? Sl 14:47 0:03 /marian/marian-server -c model/config.yml --port 8081 -w 256
mt 41 0.1 4.0 7877104 1317996 ? Sl 14:47 0:04 /marian/marian-server -c model/config.yml --port 8082 -w 256
mt 53 0.1 3.9 7814612 1284828 ? Sl 14:47 0:04 /marian/marian-server -c model/config.yml --port 8083 -w 256
mt 65 0.1 3.7 7612860 1226752 ? Sl 14:47 0:04 /marian/marian-server -c model/config.yml --port 8084 -w 256
mt 77 0.1 6.1 7697944 2010200 ? Sl 14:47 0:05 /marian/marian-server -c model/config.yml --port 8085 -w 256
mt 89 0.1 4.8 7811908 1566724 ? Sl 14:47 0:05 /marian/marian-server -c model/config.yml --port 8086 -w 256
mt 101 0.1 4.2 7603044 1381960 ? Sl 14:47 0:04 /marian/marian-server -c model/config.yml --port 8087 -w 256
mt 113 0.1 6.2 7724384 2031064 ? Sl 14:47 0:05 /marian/marian-server -c model/config.yml --port 8088 -w 256
mt 125 0.1 4.9 7628728 1622356 ? Sl 14:47 0:05 /marian/marian-server -c model/config.yml --port 8089 -w 256
mt 139 0.1 4.6 7609540 1498732 ? Sl 14:47 0:05 /marian/marian-server -c model/config.yml --port 8090 -w 256
mt 151 0.1 3.7 7650484 1233120 ? Sl 14:47 0:04 /marian/marian-server -c model/config.yml --port 8091 -w 256
mt 163 0.1 5.8 7581884 1896544 ? Sl 14:47 0:05 /marian/marian-server -c model/config.yml --port 8092 -w 256
While for 1.7.6 it looks like this
mt 29 0.1 5.6 7636852 904680 ? Sl 03:48 0:19 /marian/marian-server -c model/config.yml --port 8081 -w 256
mt 41 0.1 7.7 7998440 1243596 ? Sl 03:48 0:22 /marian/marian-server -c model/config.yml --port 8082 -w 256
mt 53 0.1 5.9 7902308 965664 ? Sl 03:48 0:19 /marian/marian-server -c model/config.yml --port 8083 -w 256
mt 65 0.1 5.4 7636728 872188 ? Sl 03:48 0:20 /marian/marian-server -c model/config.yml --port 8084 -w 256
mt 77 0.1 4.8 7759832 775840 ? Sl 03:48 0:20 /marian/marian-server -c model/config.yml --port 8085 -w 256
mt 89 0.1 9.5 7906960 1538212 ? Sl 03:48 0:19 /marian/marian-server -c model/config.yml --port 8086 -w 256
mt 101 0.1 5.1 7629504 838252 ? Sl 03:48 0:20 /marian/marian-server -c model/config.yml --port 8087 -w 256
mt 113 0.1 5.7 7772152 932028 ? Sl 03:48 0:19 /marian/marian-server -c model/config.yml --port 8088 -w 256
mt 127 0.1 5.5 7651396 899736 ? Sl 03:48 0:21 /marian/marian-server -c model/config.yml --port 8089 -w 256
mt 139 0.1 5.5 7632304 898152 ? Sl 03:48 0:22 /marian/marian-server -c model/config.yml --port 8090 -w 256
mt 153 0.1 5.6 7709644 916260 ? Sl 03:48 0:19 /marian/marian-server -c model/config.yml --port 8091 -w 256
mt 165 0.1 5.5 7607724 892860 ? Sl 03:48 0:20 /marian/marian-server -c model/config.yml --port 8092 -w 256
Notice that for 1.9.0 RSS ranges between 1.2GB and 2GB while for 1.7.6 it ranges between 0.9GB and 1.2GB
Both versions are compiled on identical systems against CUDA 10.1 with MKL and CPU decoding enabled. The instance in the ps
output however have cpu-threads
set to 0
Is there a reason for the increased memory usage? Could it be decreased again?
Will take a look. In our production code we see no increase (we are actually monitoring that), but the initialization is a bit different there. If that is indeed the case this might be easy to fix and I have a hunch.
What's the size and type of your model?
It's a "nematus" type RNN model with ~90k vocabulary size. The model file is ~600MB. I compiled 1.9.0 with a more recent version of Intel MKL, could that make a difference?
You are using that model on the GPU, right?
Yes. The GPU is also used and the process shows up in nvidia-smi
Confirmed. I see it, too. Investigating.
All hail to git bisect :)
@frankseide This is caused by initialization of the cuSparse handle (which is ridiculous) here: https://github.com/marian-nmt/marian/blob/master/src/tensors/gpu/backend.h#L33 What do you think about doing a lazy init for all the handles? So it gets initialized on first usage when using all things factored.
@frzme Can you just comment out the two lines that mention cusparseCreate/Destroy in that file and check?
I agree with lazy initialization.
@frzme you can now try the master branch from https://github.com/marian-nmt/marian-dev
Looks very good! Marian 1.9.1 (https://github.com/marian-nmt/marian-dev/commit/adba021a5e6fee65870d16eae9d88319b07fa9bb)
ps aux | grep marian-server && free -h
mt 16 0.8 5.2 7572348 853528 ? Sl 10:43 0:03 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8080
mt 28 1.0 5.5 7801620 893924 ? Sl 10:43 0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8081
mt 40 0.9 5.9 7633008 955420 ? Sl 10:43 0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8082
mt 52 0.8 5.4 7663456 885668 ? Sl 10:43 0:03 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8083
mt 64 1.0 8.4 7608100 1363004 ? Sl 10:43 0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8084
mt 76 1.1 5.8 7537940 935136 ? Sl 10:43 0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8085
mt 88 1.2 7.2 7676028 1170376 ? Sl 10:43 0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8086
mt 100 1.0 5.6 7446772 910324 ? Sl 10:43 0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8087
mt 114 1.0 6.2 7280376 1003104 ? Sl 10:43 0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8088
mt 126 1.0 6.7 7329728 1095392 ? Sl 10:43 0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8089
mt 140 1.0 7.5 7432484 1213360 ? Sl 10:43 0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8090
mt 152 1.0 6.4 7345968 1046788 ? Sl 10:43 0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8091
mt 471 0.0 0.0 13216 1048 pts/0 S+ 10:50 0:00 grep marian-server
total used free shared buff/cache available
Mem: 15G 12G 238M 296M 2.3G 10G
Swap: 0B 0B 0B
Marian 1.7.6 (https://github.com/marian-nmt/marian/commit/02f4af4eeefa79a24cd52d279a5d4d374423d631)
ps aux | grep marian-server && free -h
mt 17 1.1 5.8 7765224 945228 ? Sl 03:46 4:38 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8080
mt 29 1.3 6.2 8002348 1004228 ? Sl 03:46 5:35 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8081
mt 41 1.2 5.9 7856064 960036 ? Sl 03:46 5:10 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8082
mt 53 1.2 6.2 7887100 1012904 ? Sl 03:46 5:15 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8083
mt 65 1.2 10.8 7804492 1751732 ? Sl 03:46 5:05 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8084
mt 77 0.4 6.9 7735904 1120588 ? Sl 03:46 1:41 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8085
mt 89 0.5 5.2 7871276 841908 ? Sl 03:46 2:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8086
mt 101 0.4 5.7 7644236 924792 ? Sl 03:46 1:40 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8087
mt 115 0.5 5.4 7562716 876304 ? Sl 03:46 2:06 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8088
mt 127 0.4 5.2 7530616 844868 ? Sl 03:46 1:58 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8089
mt 139 0.5 5.4 7626556 877988 ? Sl 03:46 2:09 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8090
mt 151 0.3 5.1 7578144 823292 ? Sl 03:46 1:20 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8091
total used free shared buff/cache available
Mem: 15G 12G 1.0G 296M 1.5G 7.1G
Swap: 0B 0B 0B
Note: I don't know/don't think that we can draw the conclusion that Marian 1.9.1 is using significantly less memory than 1.7.6, but likely not more! Can this change be brought to the stable repo?
Since you were already looking into this: Why is "available" memory so much higher than Free+Cache? Is it caused by memory mapped files?
Great. Are you OK with using marian-dev for a while? I want to wait a bit if other people report more problems before I do an official release with this fix and potential others. If no one complains say for a week, I can do a release for 1.9.1.
As for cache, CUDA is doing something weird here, it's not caused by Marian. I never had any actual consequences from that, so I treat it as fake. When for instance you use the CPU-only version that effect is gone.
BTW, @frzme If you use Marian in production, you might want to add your company logo to https://marian-nmt.github.io (bottom)
Corresponding issue: https://github.com/marian-nmt/marian/issues/230
I think I can make using marian-dev work (for a while) - I've requested an approval but am rather confident that it will be possible We have discussed the logo issue internally but unfortunately there seem to be reasons which are above my level of influence that are preventing it from happening (for now?), sorry :(
I will keep the issue open until I update master here.
Logo, sure thing. I know about big companies :)
Hi, I just tried the 1.10 release and unfortunate it seems like memory requirements have again gone up (compared to 1.9.1). Is that expected? If not do you have a suggestion on what I could do to pinpoint the issue?
Ah, I was messing around with that code recently. Will take a look, might very well be the same problem. V1.11 should drop this or next week. Will try to include a fix.
@frzme which commit exactly were you using until now?
@emjotde I changed my github handle in the meantime (I'm the issue creator). I've been using marian-dev 1.9.1 (adba021a5e6fee65870d16eae9d88319b07fa9bb) https://github.com/marian-nmt/marian-dev/commit/adba021a5e6fee65870d16eae9d88319b07fa9bb
When upgrading to 1.10 I also upgraded a lot of other components so I'm not sure if that could have also made a difference (if it makes sense to try something let me know!): Ubuntu1804 -> Ubuntu 2004 CUDA 10.2 -> 11.2 boost 1.65 -> 1.71 intel mkl 2019.1-053 -> 2020.0-088
I also noticed that the 1.10 binary is almost twice the size of the 1.9.1 one (because of more GPU support?) but I don't think that should cause much higher memory utilisation (?)
It might actually, the binary still has to go into RAM. You can switch off specific GPU types like -DCOMPILE_CUDA_SM80=off
, this will soon be renamed to -DCOMPILE_AMPERE=off
(v1.11.0). You can also use -DCMAKE_BUILD_TYPE=Slim
to get rid of debug symbols etc.
I will check against that revision in the meantime.
I think the actual binary should be shared between multiple running binaries, experiments didn't show benefits in memory usage by switching to type "slim". I tried switching off sentencepiece (became default to on) and all unused CUDA targets. -DCMAKE_BUILD_TYPE=Slim
decreases the binary size significantly (down to ~163Mi from ~600Mi at Release configuration).
However neither of these changes brought a major improvement in memory usage. We are running ~13 instances of marian-server
on a single 16GB Ram GPU enabled node.
Comparing memory usage between 1.9.1 and 1.10 shows that each marian-server
instance requires 200~400Mi more "RSS"
1.9.1 (CUDA 10.2)
ps aux | grep marian-server && free -h
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
mt 15 0.3 3.2 6182316 516056 ? Sl 12:59 0:01 /marian/marian-server
mt 27 0.3 3.2 6251644 520308 ? Sl 12:59 0:01 /marian/marian-server
mt 39 0.3 3.1 6138676 501324 ? Sl 12:59 0:01 /marian/marian-server
mt 51 0.3 3.0 6110784 485732 ? Sl 12:59 0:01 /marian/marian-server
mt 65 0.5 4.6 6654728 740688 ? Sl 12:59 0:02 /marian/marian-server
mt 79 0.3 2.8 6066880 459428 ? Sl 12:59 0:01 /marian/marian-server
mt 93 0.5 5.4 6666000 877796 ? Sl 12:59 0:02 /marian/marian-server
mt 108 0.5 4.7 6240916 766528 ? Sl 12:59 0:02 /marian/marian-server
mt 120 0.5 5.3 6571724 856388 ? Sl 12:59 0:02 /marian/marian-server
mt 134 0.4 4.3 6151448 704788 ? Sl 12:59 0:01 /marian/marian-server
mt 146 0.4 5.3 6224868 863296 ? Sl 12:59 0:02 /marian/marian-server
mt 161 0.4 4.5 6088536 736992 ? Sl 12:59 0:01 /marian/marian-server
mt 173 0.8 8.7 7423636 1410408 ? Sl 12:59 0:03 /marian/marian-server
mt 352 0.0 0.0 13216 1112 pts/0 S+ 13:06 0:00 grep marian-server
total used free shared buff/cache available
Mem: 15G 11G 181M 131M 3.8G 8.8G
Swap: 0B 0B 0B
1.10 (Release) (CUDA 11.2)
ps aux | grep marian-server && free -h
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
mt 16 0.3 5.2 6410792 840620 ? Sl 12:58 0:01 /marian/marian-server
mt 28 0.5 6.2 6987576 1009812 ? Sl 12:58 0:02 /marian/marian-server
mt 40 0.5 5.9 6914916 965648 ? Sl 12:58 0:02 /marian/marian-server
mt 53 0.5 6.0 6908508 966444 ? Sl 12:58 0:02 /marian/marian-server
mt 66 0.3 5.6 6473144 902716 ? Sl 12:58 0:01 /marian/marian-server
mt 82 0.5 5.8 6859112 945136 ? Sl 12:58 0:02 /marian/marian-server
mt 96 0.4 5.6 6483392 913672 ? Sl 12:58 0:01 /marian/marian-server
mt 108 0.4 5.5 6467912 896480 ? Sl 12:58 0:01 /marian/marian-server
mt 120 0.3 5.3 6422920 854368 ? Sl 12:58 0:01 /marian/marian-server
mt 132 0.3 5.0 6380352 805496 ? Sl 12:58 0:01 /marian/marian-server
mt 149 0.6 6.0 6960196 979292 ? Sl 12:58 0:02 /marian/marian-server
mt 162 0.3 4.6 6317456 744600 ? Sl 12:58 0:01 /marian/marian-server
mt 175 0.6 8.5 6943300 1372828 ? Sl 12:58 0:02 /marian/marian-server
mt 356 0.0 0.0 5192 724 pts/0 S+ 13:05 0:00 grep marian-server
total used free shared buff/cache available
Mem: 15Gi 12Gi 193Mi 131Mi 2.2Gi 2.1Gi
Swap: 0B 0B 0B
1.10 (Slim) (CUDA 11.2)
ps aux | grep marian-server && free -h
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
mt 16 1.5 5.6 6481732 912608 ? Sl 13:12 0:01 /marian/marian-server
mt 28 1.7 5.6 6477692 909312 ? Sl 13:12 0:01 /marian/marian-server
mt 40 1.8 4.9 6368688 804292 ? Sl 13:12 0:01 /marian/marian-server
mt 54 1.8 4.7 6337260 769160 ? Sl 13:12 0:01 /marian/marian-server
mt 68 2.3 5.6 6471172 903512 ? Sl 13:12 0:01 /marian/marian-server
mt 80 1.7 4.5 6294392 730320 ? Sl 13:12 0:01 /marian/marian-server
mt 94 2.3 5.6 6481420 911684 ? Sl 13:12 0:01 /marian/marian-server
mt 107 2.4 5.5 6465936 897852 ? Sl 13:12 0:01 /marian/marian-server
mt 121 2.1 5.3 6420944 856220 ? Sl 13:12 0:01 /marian/marian-server
mt 135 2.1 5.0 6377924 813952 ? Sl 13:12 0:01 /marian/marian-server
mt 150 2.5 5.4 6450768 875932 ? Sl 13:12 0:02 /marian/marian-server
mt 162 1.9 4.6 6315028 742584 ? Sl 13:12 0:01 /marian/marian-server
mt 174 3.6 8.5 6940920 1369900 ? Sl 13:12 0:02 /marian/marian-server
mt 320 0.0 0.0 5192 736 pts/0 S+ 13:14 0:00 grep marian-server
total used free shared buff/cache available
Mem: 15Gi 13Gi 173Mi 131Mi 1.7Gi 1.5Gi
Swap: 0B 0B 0B
I will try to see if switching to CUDA 10.2 makes a difference
I tried again with CUDA 10.2 (so only upgrading Marian and not upgrading "everything") and could NOT reproduce the issue anymore. With Marian 1.10 on CUDA 10.2 I also have 8.5G available on that machine in this setup. I'll try downgrading mkl on the cuda 11.2 setup but I suspect that it's actually caused by the different cuda version. Does Marian do anything differently for CUDA 11 or might this just be CUDA 11 requiring a higher amount of memory?
Good info. The only thing I can think of would be switch to newer Cusparse stuff which was involved last time, but init is still lazy and should not be called if you don't use it. Standard models don't. Until I have a detailed look the CUDA11 theory might be the most probable one.
After fighting CUDA 11 to compile with 1.9.1, it seems it is indeed that. I see about 10 MB difference between 1.9.1 and 1.10.0 with CUDA 11. For both versions I see a drop of about 40 MB when going back to CUDA 10.2.
Thank you for looking into it! How big was the model you tested this with? I wonder why you are seeing a 40MB difference while I am getting a ~400MB difference.
It strongly looks like this new phenomena is not a Marian issue but instead a CUDA thing/issue (?)
Are you getting 400 MB per process?
Ah yeah, I see it above. Hm. Can you share model configs and server settings?
I hope this is what you are looking for: model config: https://gist.github.com/patrickhuy/a5e86535debced6b390decb9bd405096 inference/server config: https://gist.github.com/patrickhuy/b164ba4cfb2848f50ea82a66803a1376
Marian is built with
cmake .. -DCOMPILE_SERVER=on -DCMAKE_BUILD_TYPE=Slim -DUSE_SENTENCEPIECE=false -DBUILD_ARCH=westmere -DCOMPILE_CUDA_SM35=false -DCOMPILE_CUDA_SM50=false -DCOMPILE_CUDA_SM60=false -DCOMPILE_CUDA_SM80=false -DINTRINSICS="-mtune=cascadelake -msse2 -msse3 -msse4.1 -msse4.2"
model.npz.best-bleu.npz
is ~300MB (if this makes a difference)
There is a pytorch issue about CUDA allocating a lot of memory to load kernels: https://github.com/pytorch/pytorch/issues/12873 I wonder if this is related and if something changed there in CUDA 11. I also wonder if it's actually possible to influence this behavior.
Note: I don't really understand how the "Available" memory number is calculated (as it's higher than free+cache for cuda 10), but it seems that the memory is actually usable.