CTranslate2 icon indicating copy to clipboard operation
CTranslate2 copied to clipboard

Multiquery architectures broken - OpenNMT-py - score_batch

Open ArtanisTheOne opened this issue 1 year ago • 2 comments

I keep running into a ValueError when using score_batch on a model architecture I've just converted. I'm using the score_batch function to filter some translation data to continue training the model with better data, but whenever the function is called on the Translator instance of the converted model checkpoint, it throws a shape error. The x and y seem to vary based on each individual batch, as when I try to just use score_batch on a single example there is no thrown error.

I've checked the structure/values of the passed in encoded source/target, and that all seems to be handled correctly. Passing in sentence batches of 4096-6048, with a max_batch_size of 2048 tokens. The processing of this function has worked for other architecture models with the exact code (no changes have been made). Translating using translate_batch works just fine, so I'm not sure what's going on.

Please let me know if a converted model / opennmt-py checkpoint would help or if I can assist.

ArtanisTheOne avatar Sep 25 '23 17:09 ArtanisTheOne

Did some testing and so far have only found that it works only when it is processing one batch of one line at a time.

Test with en-tr failed
Batch size: 2048 of tokens
Total tokens: src - 32924 tgt - 31022
Example count: 997
ERROR OUTPUT: new shape size (2050048) is incompatible with current size (3112960)
----
Test with en-tr failed
Batch size: 1000 of examples
Total tokens: src - 32924 tgt - 31022
Example count: 997
ERROR OUTPUT: new shape size (32787456) is incompatible with current size (96988160)
----
Test with en-tr failed
Batch size: 4096 of tokens
Total tokens: src - 32924 tgt - 31022
Example count: 997
ERROR OUTPUT: new shape size (4104192) is incompatible with current size (6906880)
----
Test with en-tr failed
Batch size: 100 of examples
Total tokens: src - 32924 tgt - 31022
Example count: 997
ERROR OUTPUT: new shape size (5474304) is incompatible with current size (9728000)
----
TEST WITH en-tr PASSED
Batch size: 1 of examples
Total tokens: src - 32924, tgt - 31022
Example count: 997
Avg perplexity: 36.15941306393914
----
Test with en-tr failed
Batch size: 100 of examples
Total tokens: src - 3648 tgt - 3580
Example count: 100
ERROR OUTPUT: new shape size (3768320) is incompatible with current size (8704000)
----
Test with en-tr failed
Batch size: 2048 of tokens
Total tokens: src - 361 tgt - 372
Example count: 10
ERROR OUTPUT: new shape size (391168) is incompatible with current size (788480)
----
Test with en-tr failed
Batch size: 2048 of examples
Total tokens: src - 114 tgt - 125
Example count: 2
ERROR OUTPUT: new shape size (130048) is incompatible with current size (157696)
----
Test with en-tr failed
Batch size: 2 of examples
Total tokens: src - 114 tgt - 125
Example count: 2
ERROR OUTPUT: new shape size (130048) is incompatible with current size (157696)
----
TEST WITH en-tr PASSED
Batch size: 2 of examples
Total tokens: src - 59, tgt - 76
Example count: 1
Avg perplexity: 51.4047366491775
----

max_size in the script seemingly also shortens the actual built_batches, hence descending order of max_size in tests. Here it is:

import ctranslate2
import sentencepiece

pivot_lang = "en"
langs = ["tr"]
base_folder = "C:/Machine_Learning/NLP"
lang_pair = "middle_east"

ct2_model = ctranslate2.Translator(
        f"{base_folder}/models/{lang_pair}/ct2",
        device="cuda",
        compute_type="int8",
    )
source_sentence = sentencepiece.SentencePieceProcessor(
        f"{base_folder}/models/{lang_pair}/general_multi.model"
    )
target_sentence = sentencepiece.SentencePieceProcessor(
        f"{base_folder}/models/{lang_pair}/general_multi.model"
    )

def encode(src_sents, tgt_sents, tgt_lang, src_lang):
    if type(src_sents) == str:
        src_sents = [src_sents]
    if type(tgt_sents) == str:
        tgt_sents = [tgt_sents]

    src_sents = source_sentence.Encode(src_sents, out_type=str)

    tgt_sents = target_sentence.Encode(tgt_sents, out_type=str)

    return src_sents, tgt_sents

class Test():
    def __init__(self, src_lang, tgt_lang, batch_size, batch, batch_type = "tokens", max_size = 2048):

        self.src_lang = src_lang
        self.tgt_lang = tgt_lang
        self.batch_size = batch_size
        self.batch_type = batch_type
        self.batch = batch
        for x in range(len(self.batch)):
            new_x = [self.batch[x][y] for y in range(min(max_size, len(self.batch[x])))]
            self.batch[x] = new_x

    def run_test(self):
        src_token, tgt_token = encode(self.batch[0], self.batch[1], self.tgt_lang, self.src_lang)
        assert len(src_token) == len(tgt_token), "Source and target example count must be equal"
        src_token_amount = sum(len(x) for x in src_token)
        tgt_token_amount = sum(len(x) for x in tgt_token)
        try:
            results = ct2_model.score_batch(source=src_token, target=tgt_token, max_batch_size=self.batch_size, batch_type=self.batch_type)
        except Exception as e:

            print(f"----\nTest with {self.src_lang}-{self.tgt_lang} failed\nBatch size: {self.batch_size} of {self.batch_type}\nTotal tokens: src - {src_token_amount} tgt - {tgt_token_amount} \
                \nExample count: {len(src_token)}\nERROR OUTPUT:", e)

            return
        simplified_perps = [x.log_probs for x in results]
        simplified_perps = [
            sum(abs(x) ** 2 for x in y) / len(y)
            for y in simplified_perps
        ]
        simplified_perps = sum(simplified_perps) / len(simplified_perps)



        print(f"----\nTEST WITH {self.src_lang}-{self.tgt_lang} PASSED\nBatch size: {self.batch_size} of {self.batch_type}\nTotal tokens: src - {src_token_amount}, tgt - {tgt_token_amount} \
                \nExample count: {len(src_token)}\nPerplexity: {simplified_perps}\n----")


src, tgt = pivot_lang, langs[0]
source_file = "C:/TranslationData/flores200_dataset/dev/eng_Latn.dev"
target_file = "C:/TranslationData/flores200_dataset/dev/tur_Latn.dev"

built_batches = []
with (open(source_file, encoding="utf8") as src_file, open(target_file, encoding="utf8") as tgt_file):
    original = [line.replace("\n", "") for line in src_file.readlines()]
    references = [line.replace("\n", "") for line in tgt_file.readlines()]
    built_batches = [original, references]

Test(src, tgt, 2048, built_batches).run_test()
Test(src, tgt, 1000, built_batches, "examples").run_test()
Test(src, tgt, 4096, built_batches).run_test()
Test(src, tgt, 100, built_batches, "examples").run_test()
Test(src, tgt, 1, built_batches, "examples", ).run_test()
Test(src, tgt, 100, built_batches, "examples", 100).run_test()
Test(src, tgt, 2048, built_batches, "tokens", 10).run_test()
Test(src, tgt, 2048, built_batches, "examples", 2).run_test()
Test(src, tgt, 2, built_batches, "examples", 2).run_test()
Test(src, tgt, 2, built_batches, "examples", 1).run_test()

ArtanisTheOne avatar Oct 03 '23 16:10 ArtanisTheOne