CTranslate2
CTranslate2 copied to clipboard
Multiquery architectures broken - OpenNMT-py - score_batch
I keep running into a ValueError when using score_batch
on a model architecture I've just converted. I'm using the score_batch
function to filter some translation data to continue training the model with better data, but whenever the function is called on the Translator instance of the converted model checkpoint, it throws a shape error. The x and y seem to vary based on each individual batch, as when I try to just use score_batch on a single example there is no thrown error.
I've checked the structure/values of the passed in encoded source/target, and that all seems to be handled correctly. Passing in sentence batches of 4096-6048, with a max_batch_size of 2048 tokens. The processing of this function has worked for other architecture models with the exact code (no changes have been made). Translating using translate_batch
works just fine, so I'm not sure what's going on.
Please let me know if a converted model / opennmt-py checkpoint would help or if I can assist.
Did some testing and so far have only found that it works only when it is processing one batch of one line at a time.
Test with en-tr failed
Batch size: 2048 of tokens
Total tokens: src - 32924 tgt - 31022
Example count: 997
ERROR OUTPUT: new shape size (2050048) is incompatible with current size (3112960)
----
Test with en-tr failed
Batch size: 1000 of examples
Total tokens: src - 32924 tgt - 31022
Example count: 997
ERROR OUTPUT: new shape size (32787456) is incompatible with current size (96988160)
----
Test with en-tr failed
Batch size: 4096 of tokens
Total tokens: src - 32924 tgt - 31022
Example count: 997
ERROR OUTPUT: new shape size (4104192) is incompatible with current size (6906880)
----
Test with en-tr failed
Batch size: 100 of examples
Total tokens: src - 32924 tgt - 31022
Example count: 997
ERROR OUTPUT: new shape size (5474304) is incompatible with current size (9728000)
----
TEST WITH en-tr PASSED
Batch size: 1 of examples
Total tokens: src - 32924, tgt - 31022
Example count: 997
Avg perplexity: 36.15941306393914
----
Test with en-tr failed
Batch size: 100 of examples
Total tokens: src - 3648 tgt - 3580
Example count: 100
ERROR OUTPUT: new shape size (3768320) is incompatible with current size (8704000)
----
Test with en-tr failed
Batch size: 2048 of tokens
Total tokens: src - 361 tgt - 372
Example count: 10
ERROR OUTPUT: new shape size (391168) is incompatible with current size (788480)
----
Test with en-tr failed
Batch size: 2048 of examples
Total tokens: src - 114 tgt - 125
Example count: 2
ERROR OUTPUT: new shape size (130048) is incompatible with current size (157696)
----
Test with en-tr failed
Batch size: 2 of examples
Total tokens: src - 114 tgt - 125
Example count: 2
ERROR OUTPUT: new shape size (130048) is incompatible with current size (157696)
----
TEST WITH en-tr PASSED
Batch size: 2 of examples
Total tokens: src - 59, tgt - 76
Example count: 1
Avg perplexity: 51.4047366491775
----
max_size in the script seemingly also shortens the actual built_batches, hence descending order of max_size in tests. Here it is:
import ctranslate2
import sentencepiece
pivot_lang = "en"
langs = ["tr"]
base_folder = "C:/Machine_Learning/NLP"
lang_pair = "middle_east"
ct2_model = ctranslate2.Translator(
f"{base_folder}/models/{lang_pair}/ct2",
device="cuda",
compute_type="int8",
)
source_sentence = sentencepiece.SentencePieceProcessor(
f"{base_folder}/models/{lang_pair}/general_multi.model"
)
target_sentence = sentencepiece.SentencePieceProcessor(
f"{base_folder}/models/{lang_pair}/general_multi.model"
)
def encode(src_sents, tgt_sents, tgt_lang, src_lang):
if type(src_sents) == str:
src_sents = [src_sents]
if type(tgt_sents) == str:
tgt_sents = [tgt_sents]
src_sents = source_sentence.Encode(src_sents, out_type=str)
tgt_sents = target_sentence.Encode(tgt_sents, out_type=str)
return src_sents, tgt_sents
class Test():
def __init__(self, src_lang, tgt_lang, batch_size, batch, batch_type = "tokens", max_size = 2048):
self.src_lang = src_lang
self.tgt_lang = tgt_lang
self.batch_size = batch_size
self.batch_type = batch_type
self.batch = batch
for x in range(len(self.batch)):
new_x = [self.batch[x][y] for y in range(min(max_size, len(self.batch[x])))]
self.batch[x] = new_x
def run_test(self):
src_token, tgt_token = encode(self.batch[0], self.batch[1], self.tgt_lang, self.src_lang)
assert len(src_token) == len(tgt_token), "Source and target example count must be equal"
src_token_amount = sum(len(x) for x in src_token)
tgt_token_amount = sum(len(x) for x in tgt_token)
try:
results = ct2_model.score_batch(source=src_token, target=tgt_token, max_batch_size=self.batch_size, batch_type=self.batch_type)
except Exception as e:
print(f"----\nTest with {self.src_lang}-{self.tgt_lang} failed\nBatch size: {self.batch_size} of {self.batch_type}\nTotal tokens: src - {src_token_amount} tgt - {tgt_token_amount} \
\nExample count: {len(src_token)}\nERROR OUTPUT:", e)
return
simplified_perps = [x.log_probs for x in results]
simplified_perps = [
sum(abs(x) ** 2 for x in y) / len(y)
for y in simplified_perps
]
simplified_perps = sum(simplified_perps) / len(simplified_perps)
print(f"----\nTEST WITH {self.src_lang}-{self.tgt_lang} PASSED\nBatch size: {self.batch_size} of {self.batch_type}\nTotal tokens: src - {src_token_amount}, tgt - {tgt_token_amount} \
\nExample count: {len(src_token)}\nPerplexity: {simplified_perps}\n----")
src, tgt = pivot_lang, langs[0]
source_file = "C:/TranslationData/flores200_dataset/dev/eng_Latn.dev"
target_file = "C:/TranslationData/flores200_dataset/dev/tur_Latn.dev"
built_batches = []
with (open(source_file, encoding="utf8") as src_file, open(target_file, encoding="utf8") as tgt_file):
original = [line.replace("\n", "") for line in src_file.readlines()]
references = [line.replace("\n", "") for line in tgt_file.readlines()]
built_batches = [original, references]
Test(src, tgt, 2048, built_batches).run_test()
Test(src, tgt, 1000, built_batches, "examples").run_test()
Test(src, tgt, 4096, built_batches).run_test()
Test(src, tgt, 100, built_batches, "examples").run_test()
Test(src, tgt, 1, built_batches, "examples", ).run_test()
Test(src, tgt, 100, built_batches, "examples", 100).run_test()
Test(src, tgt, 2048, built_batches, "tokens", 10).run_test()
Test(src, tgt, 2048, built_batches, "examples", 2).run_test()
Test(src, tgt, 2, built_batches, "examples", 2).run_test()
Test(src, tgt, 2, built_batches, "examples", 1).run_test()