The output answer is the same as the reference answer, and the blue score has always been 0
The output answer is the same as the reference answer, and the blue score has always been 0
demo:
if name == 'main': query = "今天天气怎么样"
reference = """今天天气很不错"""
response ="""今天天气很不错"""
metrics = [BleuScore()]
result = evaluate(
dataset=dataset,
metrics=metrics,
llm=llm,
embeddings=embeddings
)
print(result)
Hi @LillyChen,
Thanks for raising this!
Under the hood, the BLEU metric in Ragas uses the sacrebleu library. By default, sacrebleu uses a tokenizer that’s not optimized for Mandarin, which can lead to a BLEU score of 0 even when the output and reference are identical.
Since you’re evaluating Chinese text, you should explicitly set tokenize='zh' in the BLEU calculation. This tells sacrebleu to use a Chinese-specific tokenizer that better handles character boundaries.
from ragas.dataset_schema import SingleTurnSample, EvaluationDataset
from ragas import evaluate
from ragas.metrics import BleuScore
reference = "今天天气很不错"
response ="今天天气很不错"
dataset = EvaluationDataset(samples=[SingleTurnSample(response=response, reference=reference)])
metrics = [BleuScore(kwargs={"tokenize": "zh"})]
result = evaluate(
dataset=dataset,
metrics=metrics,
)
print(result)
Ouput
Evaluating: 100%|██████████| 1/1 [00:00<00:00, 232.56it/s]
{'bleu_score': 1.0000}
I encountered the same problem when calculating RougeScore for Chinese. How should I handle it