marian icon indicating copy to clipboard operation
marian copied to clipboard

Suggestions for GPUs

Open hobodrifterdavid opened this issue 4 years ago • 6 comments

Hello. Me and a friend are experimenting with using Marian to translate movie subtitles for language learners. Initially we are trying to better filter the opensubtitles corpus, and train a single es->en model on a RTX2060 with 6GB. If it goes well, we'd maybe like to get a faster setup. How do you recommend to spend a budget of $1500 - $2000 USD on GPUs, to run Marian for training? (probably we'd look for deals on eBay).

1x TITAN RTX 24GB 2x RTX2080Ti 11GB 3x RTX2080 8GB 4x RTX2070 8GB (or perhaps older GTX1080Ti 11Gb?)

I guess one of the top two, as I read RAM is critical, and can benefit from FP16? Thank you for the excellent software.

David

hobodrifterdavid avatar May 24 '20 20:05 hobodrifterdavid

If you intend to parallelize over GPUs, aim for a power of 2. 1 is a power of 2.

kpu avatar May 24 '20 20:05 kpu

I would say two GPUs are preferable to one. With synchronous SGD the RAM in the two cards basically adds up in terms of batch size (not model size though) while doubling the speed. 8 GB is a bit small, and you will likely not get the full benefit from having four GPUs without good interconnect. I don't know those chips very well, so someone should comment on the relative benefits of the specific cards.

emjotde avatar May 24 '20 20:05 emjotde

Further info: we've got an old Fujitsu TX300 S7 Xeon E5 (~sandy bridge) server that cost ~$200.. had to cut a hole in the side of the case to get the GPU in, but no problem. It has two 16x slots, and two more 8x slots that can be used for GPUs too. It supports 4x (inexpensive) power supplies, and ram is $1/Gb. Very pleased with the machine, but, needs to be in a room without people, servers are noisey. Happy to give more info on this if it's of interest.

hobodrifterdavid avatar May 24 '20 20:05 hobodrifterdavid

I'm training with this, from the sentencePiece example:

$MARIAN/build/marian
--devices $GPUS
--type s2s
--model model/model.npz
--train-sets data/corpus.es data/corpus.en
--vocabs model/vocab.esen.spm model/vocab.esen.spm
--dim-vocabs 32000 32000
--mini-batch-fit -w 4000
--layer-normalization --tied-embeddings-all
--dropout-rnn 0.2 --dropout-src 0.1 --dropout-trg 0.1
--early-stopping 5 --max-length 100
--valid-freq 10000 --save-freq 10000 --disp-freq 1000
--cost-type ce-mean-words --valid-metrics ce-mean-words bleu-detok
--valid-sets data/subs-dev.es data/subs-dev.en
--log model/train.log --valid-log model/valid.log --tempdir model
--overwrite --keep-best
--seed 1111 --exponential-smoothing
--normalize=0.6 --beam-size=6 --quiet-translation

I see the marian process is using one core, stuck at 100% mostly. It looks a bit like Marian is limited by the single thread speed.. unless the CPU is just polling something or collecting some stats in a loop. I'll try the dev branch next I guess.

hobodrifterdavid avatar May 24 '20 22:05 hobodrifterdavid

Initial test was very promising. I hope to become more familiar with the code and perhaps even contribute something to the project at some point.

hobodrifterdavid avatar May 25 '20 14:05 hobodrifterdavid

Did you check your GPUs usage ? watch -n1 nvidia-smi

I think you have one GPU running that's why you see one cpu core at 100% (from my own experience).

adjouama avatar Jul 08 '20 16:07 adjouama