datashare
datashare copied to clipboard
Force specific language for NLP
Is your feature request related to a problem? Please describe. I know for a fact that all my documents are in the same specific language, but when running NLP on them, it auto-detects different ones, which I guess impacts the quality of the results.
Describe the solution you'd like It would be useful to configure a specific language for NLP instead of auto-detecting it.
Describe alternatives you've considered A wokraround hack that I thought of is to replace the NLP model files of other languages with symlinks to the one we want, so it still thinks it's a different language but when loading the model it actually loads the correct one, but I haven't tried this as I fear it could bring other consequences.
Regarding language detection, another issue was opened: https://github.com/ICIJ/datashare/issues/781
Closed in favor of #938