OPUS
OPUS copied to clipboard
The website and the API report difference sentence counts for CCMatrix
https://opus.nlpl.eu/opusapi/?source=en&target=ru&preprocessing=moses&version=latest
[
...
{
"alignment_pairs": 139937785,
"corpus": "CCMatrix",
...
"version": "v1"
},
{
"alignment_pairs": 139937785,
"corpus": "NLLB",
...
"version": "v1"
}
]
Then from: https://opus.nlpl.eu/
| corpus | doc's | sent's | en tokens | ru tokens | XCES/XML | raw | TMX | Moses | mono | raw | ud | alg | dic | freq | other files | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NLLB v1 | 1 | 139.9M | 2.7G | 2.5G | xces en ru | en ru | tmx | moses | en ru | en ru | en ru | sample | |||||
| CCMatrix v1 | 1 | 35.1M | 581.4M | 527.5M | xces en ru | en ru | tmx | moses | en ru | en ru | en ru | sample |
Yes, something seems wrong here. Better look at the re-designed website: https://opus.nlpl.eu/CCMatrix/en&ru/v1/CCMatrix That should be correct I hope.