OPUS icon indicating copy to clipboard operation
OPUS copied to clipboard

The website and the API report difference sentence counts for CCMatrix

Open gregtatum opened this issue 1 year ago • 1 comments

https://opus.nlpl.eu/opusapi/?source=en&target=ru&preprocessing=moses&version=latest

[
  ...
  {
    "alignment_pairs": 139937785,
    "corpus": "CCMatrix",
    ...
    "version": "v1"
  },
  {
    "alignment_pairs": 139937785,
    "corpus": "NLLB",
    ...
    "version": "v1"
  }
]

Then from: https://opus.nlpl.eu/

corpus doc's sent's en tokens ru tokens XCES/XML raw TMX Moses mono raw ud alg dic freq other files
NLLB v1 1 139.9M 2.7G 2.5G xces en ru en ru tmx moses en ru en ru en ru sample
CCMatrix v1 1 35.1M 581.4M 527.5M xces en ru en ru tmx moses en ru en ru en ru sample
image

gregtatum avatar Jan 16 '24 20:01 gregtatum

Yes, something seems wrong here. Better look at the re-designed website: https://opus.nlpl.eu/CCMatrix/en&ru/v1/CCMatrix That should be correct I hope.

jorgtied avatar Jul 17 '24 20:07 jorgtied