ragas The output answer is the same as the reference answer, and the blue score has always been 0

The output answer is the same as the reference answer, and the blue score has always been 0

demo：

if name == 'main': query = "今天天气怎么样"

reference = """今天天气很不错"""
response ="""今天天气很不错"""

metrics = [BleuScore()]

result = evaluate(
    dataset=dataset,
    metrics=metrics,
    llm=llm,
    embeddings=embeddings
)

print(result)

Jul 04 '25 03:07 LillyChen

Hi @LillyChen,

Thanks for raising this!

Under the hood, the BLEU metric in Ragas uses the sacrebleu library. By default, sacrebleu uses a tokenizer that’s not optimized for Mandarin, which can lead to a BLEU score of 0 even when the output and reference are identical.

Since you’re evaluating Chinese text, you should explicitly set tokenize='zh' in the BLEU calculation. This tells sacrebleu to use a Chinese-specific tokenizer that better handles character boundaries.

from ragas.dataset_schema import SingleTurnSample, EvaluationDataset
from ragas import evaluate
from ragas.metrics import BleuScore

reference = "今天天气很不错"
response ="今天天气很不错"


dataset = EvaluationDataset(samples=[SingleTurnSample(response=response, reference=reference)])
metrics = [BleuScore(kwargs={"tokenize": "zh"})]

result = evaluate(
    dataset=dataset,
    metrics=metrics,
)

print(result)

Ouput

Evaluating: 100%|██████████| 1/1 [00:00<00:00, 232.56it/s]
{'bleu_score': 1.0000}

Jul 10 '25 09:07 sahusiddharth

I encountered the same problem when calculating RougeScore for Chinese. How should I handle it

Jul 23 '25 07:07 yhfwww