marian-dev
marian-dev copied to clipboard
Marian does not do Sanity Checking before inserting into SQLite
Bug description
Currently I am training a multilingual model. I tried using both Transformer-Base as well as Transformer-Big.
My training corpus size is around 330M sentences, because of which I use --sqlite for Marian shuffling.
The size of my source corpus was 39GB and target was 94GB.
The sqlite DB's size is around 154GB.
How to reproduce
- I cannot upload the whole corpus. Basically, 200M sentences in the corpus are back-translated data, and 130M sentences are real data (4X oversampled).
- I have thoroughly checked the corpus to see if it is corrupt. It looked good to me.
- I had generated SPM vocabs myself as described here using a sub-set (2M sentences) of the corpus.
Context
After some iterations, the model seems to just produce/repeat the same sentence for any input sentence. Could this be because of such large sqlite DB? Or any clues as to why this is happening?
Thanks!
Hm. Based on the training log I would still say there is an issue with the corpus, the training cost indicates it stops learning. Can you try with a random sample from the corpus, say 1M lines each, so it is smaller and see if it learns anything?
Thanks. OK, I will try that. It might also be that monolingual data using which I generated back-translations is very diverse compared to the learned SPM vocab from the 2M sentences. Do you think that could be the case? Because the above report is based not even on a single full epoch.
Also, what do we generally do when the amount of back-translation data far exceeds (10X-20X) the real data?
Do we oversample the real data? Or is it recommended to use --data-weighting?
- Unless there is something wonky going on I would not blame the SPM, but who knows. I would rather suspect that at some point you have misalignments in there, like source not corresponding to target.
- I would oversample. We have used the data-weighting for quite specific things, not sure if it would work very well for balancing. The training progression like learning-rate would still be based on the number of actual sentences which might make conversion even worse.
Thanks @emjotde I finally managed to find it. I think some kind of SQL(ite) Injection has happened because of a noisy line in the data.
So my corpus was perfectly aligned when I inspected it again manually.
Then I doubted the sqlite DB, so I checked if there are any misalignments somewhere by manually going through the DB (thank you binary search).
I found a misalignment starting from a point till the end of the DB. I have attached the sample lines which caused the issue.
When you open the files with a file editor, you will see that the no. of lines do not seem to match. I did that on purpose so that you can reproduce the issue at your end.
Now run the following command to populate the data into sqlite:
~/marian/build/marian \
--devices 0 1 2 3 \
--model model/model.npz --type transformer --task transformer-big \
--train-sets tmp.en tmp.hi \
--max-length 256 \
--vocabs model/vocab.en.spm model/vocab.hi.spm \
--dim-vocabs 32000 32000 \
--mini-batch-fit -w 12000 --mini-batch 1024 --maxi-batch 1024 \
--early-stopping 20 \
--valid-freq 10000 --save-freq 20000 --disp-freq 500 \
--valid-metrics ce-mean-words bleu-detok \
--valid-sets ~/datasets/test.en ~/datasets/test.hi \
--quiet-translation \
--valid-mini-batch 64 \
--cost-type=ce-mean-words \
--beam-size 12 --normalize 1 \
--log model/train.log --valid-log model/valid.log \
--tempdir ~/dumps/tmp/ --sqlite ~/dumps/tmp/db.sqlite --sqlite-drop \
--keep-best \
--lr-report \
--tied-embeddings \
--sync-sgd --seed 1111 \
--exponential-smoothing
Now this is what we have in db.sqlite :
sqlite> SELECT * from lines;
0|<hi> The Supreme Court has refused an immediate hearing on public interest litigation seeking to investigate the claim of former Justice Kurian Joseph of the Apex Court.|शीर्ष अदालत के पूर्व जस्टिस कुरियन जोसफ के दावे की जांच की मांग वाली जनहित याचिका पर सुप्रीमकोर्ट ने तत्काल सुनवाई से इनकार कर दिया है।
1|<hi> On the Congress side, the Assembly Vice-President Rajendra Kumar Singh started the discussion and said that he had the opportunity to speak for the first time in this Assembly.|कांग्रेस की ओर
से विधानसभा उपाध्यक्ष राजेंद्र कुमार सिंह ने चर्चा की शुरूआत की और कहा कि उन्हें इस विधानसभा में पहली बार बोलने का अवसर मिला है।
2|<hi> Even from here, many people have left the institution.|यहां से भी कई लोगों ने संस्थान को बाय कर दिया है.
3|<hi> The police with the family are preparing to register a case.|परिवार वाले पुलिस में केस दर्ज कराने की तैयारी में है।
4|<hi> Let me tell you that the Election Commission has also issued a notice to PTI, the party of PM Imran, directing them to provide details of the expenditure incurred during the campaign.It may be recalled that Chairman of Pakistan Tehreek-e-Insaf Imran Khan took over as the 22nd Prime Minister of Pakistan in August last year.|आपको बता दें कि चुनाव आयोग ने पीएम इमरान की पार्टी पीटीआई को भी नोटिस जारी किया गया है। उनसे भी चुनाव प्रचार के दौरान किए गए खर्च का ब्यौरा देने के निर्देश दिए गए हैं। गौरतलब है कि पाकिस्तान तहरीक-ए-इंसाफ के चेयरमैन इमरान खान ने बीते अगस्त में ही 22वें प्रधानमंत्री के रूप में पाक की सत्ता की कमान संभाली है।
odaimaker_db_1 docker-entrypoint.sh mysqld Up 0.0.0.0:3306->3306/tcp, 33060/tcp-----
6|<hi> Name Command State Ports|दुनिया की आबादी 600करोड़ से भी ज्यादा है । तो एक यमराज के लिए संभव नहीं कि वह सभी मृत इंसानों की आत्मा यमलोक ला सके । इसीलिए ब्रह्माजी ने यमराज का क्लोन बनाया । उसमें से 'यमराम 5003' सबसे थोड़ा अलग था । ब्रह्मा जब इस यमराज को आकार दे रहे थे तभी देवी सरस्वती ने उन्हें बुलाया और वह चले गए । तभी वहां नारद मुनि पहुंच कर एक मानव का ब्रेन सेल 'यमराज नम्बर 5003' में लगा दिया जोकि एक कवि था । श्रृंगार रस का कवि । उसकी मौत दिल का दौरा पड़ने से हुई । उसी 'यमराज नम्बर 5003' को सुमति सामंत नामक लड़की की आत्मा लाने को कहा गया। वह ट्रेन के सामने आकर आत्महत्या करने वाली थी लेकिन उसके प्रेमी सुब्रत ने यमराज से सुमति का जीवनदान मांगा । लेकिन यमराज जीवनदान नहीं देते वह तो जीवन लेते । यह उनके सिद्धांतों के खिलाफ है । तो क्या 'यमराज नम्बर 5003' सुमति को जीवनदान देता है ? उसके बाद इंसानी दिमागवाले इस यमराज का क्या होता है ? इन सभी सवालों का जवाब जानने के लिए आपको किताब पढ़नी पडे़गी ।
7|<hi> -----------------------------------------------------------------------------------------------------------------|आज उनको लखनऊ में पेश किया गया।
8|<hi> odaimaker-db-1 docker-entrypoint.sh mysqld UP 0.0.0.0:3306-?3306/tcp, 33060/tcp|आध्यात्मिक उन्नति हेतुः चलते फिरते, दैनिक कार्य करते हुए भगवन्नाम का जप, सब में भगवन्नाम, हर दो कार्यों के बीच थोड़ा शांत होना, सबकी भलाई में अपना भला मानना, मन के विचारों पर निगरानी रखना, आदरपूर्वक सत्संग व स्वाध्याय करना आदि शीघ्र आध्यात्मिक उन्नति के उपाय हैं।
9|<hi> The world population is more than 600 crore. So it is not possible for a Yamraj to bring the souls of all the dead. So Brahmaji created the clone of Yamraj.|करीब एक घण्टे से ज्यादा देरी तक बारिश चलती रही।
It seems like some SQL Injection could have happened for _id number: 5, from where the misalignment starts.
Can you please check this? Does Marian sanitize the SQL query before executing it?
Probably doesn't. https://github.com/marian-nmt/marian-dev/blob/f74d055d204d7ed417f3dd26ad469192d70e8112/src/data/corpus_sqlite.cpp#L57-L67
There's sqlite3_mprintf in 3rd-party/sqlitecpp/.. which can be used here.
Or simply sanitise once outside in your cleaning pipeline and ensure the corpus is clean.