Optimising for loops
Should we take the init_factory() outside the detect() so that if we are using this function on dataframes or in loops, it won't have to load the 55 language files over and over again? What do you think? @Mimino666
For what it's worth, I hacked around this as follows:
from langdetect import DetectorFactory, PROFILES_DIRECTORY
factory = DetectorFactory()
factory.load_profile(PROFILES_DIRECTORY)
detector = factory.create()
def detect(text, detector=detector):
detector.text = ""
detector.append(text)
return detector.detect()
Obviously not a proper solution but might be useful as a temporary speed-up. Hopefully this can be fixed within langdetect itself.
The detect function in https://github.com/Mimino666/langdetect/issues/77#issuecomment-880545747 needs to be updated to something like:
def detect(text, detector=detector):
detector.text = ""
detector.langprob = None
detector.append(text)
return detector.detect()
because in the get_probabilities method, the previously-generated self.langprob is re-used if it's not None. This means that, if running the detect function on a list of strings from various languages, it will always return the language detected from the first string.