langdetect icon indicating copy to clipboard operation
langdetect copied to clipboard

Optimising for loops

Open vmdhhh opened this issue 5 years ago • 2 comments

Should we take the init_factory() outside the detect() so that if we are using this function on dataframes or in loops, it won't have to load the 55 language files over and over again? What do you think? @Mimino666

vmdhhh avatar Sep 23 '20 19:09 vmdhhh

For what it's worth, I hacked around this as follows:

from langdetect import DetectorFactory, PROFILES_DIRECTORY

factory = DetectorFactory()
factory.load_profile(PROFILES_DIRECTORY)
detector = factory.create()

def detect(text, detector=detector):
    detector.text = ""
    detector.append(text)
    return detector.detect()

Obviously not a proper solution but might be useful as a temporary speed-up. Hopefully this can be fixed within langdetect itself.

rafguns avatar Jul 15 '21 09:07 rafguns

The detect function in https://github.com/Mimino666/langdetect/issues/77#issuecomment-880545747 needs to be updated to something like:

def detect(text, detector=detector):
   detector.text = ""
   detector.langprob = None
   detector.append(text)
   return detector.detect()

because in the get_probabilities method, the previously-generated self.langprob is re-used if it's not None. This means that, if running the detect function on a list of strings from various languages, it will always return the language detected from the first string.

trislee avatar Jul 07 '22 17:07 trislee