immich icon indicating copy to clipboard operation
immich copied to clipboard

feat(ml): better multilingual search with nllb models

Open mertalev opened this issue 1 year ago • 3 comments

Description

The NLLB models have the highest quality results for multilingual text in our model catalog. However, these models expect the input to explicitly indicate the language of the text. Since we don't provide this, the results are worse than other multilingual models.

This PR makes the machine learning service accept a language option that it maps to the corresponding FLORES200 token that NLLB expects. The server is updated to accept this parameter and forward it to the machine learning service, while web and mobile are updated to provide the language based on current user settings. Search still works without it, and the option only has an impact when using an NLLB model.

How Has This Been Tested?

I tested on web and can confirm the results for Turkish are significantly better. Changing the language in the user settings has a marked effect on the ranking. I haven't tested mobile so I'm not sure if all the language codes there line up with the map in the machine learning service.

mertalev avatar Oct 17 '24 23:10 mertalev

@mertalev I think this is good to go, but it needs rebasing, there's a bunch of conflicts. If you want it to go in still, can you work on that? :slightly_smiling_face:

zackpollard avatar Mar 03 '25 14:03 zackpollard

Sure! I got caught up trying to make it work in mobile, but the language codes for localization were different IIRC. The web portion should be good though after rebasing.

mertalev avatar Mar 03 '25 15:03 mertalev

Tested on web and mobile with a variety of languages and confirmed the language is passed to the ML service, which uses it to pass the right language token to the NLLB model.

mertalev avatar Mar 28 '25 18:03 mertalev

Do the model benchmarks account for this token being passed?

bo0tzz avatar Mar 28 '25 19:03 bo0tzz

The results in the docs are when the language token is passed correctly. The models might not even make the list when no token is passed because the input is interpreted as English.

mertalev avatar Mar 28 '25 19:03 mertalev