efficient-language-detector icon indicating copy to clipboard operation
efficient-language-detector copied to clipboard

Language detector always loads the small ngrams set

Open Toflar opened this issue 6 months ago โ€ข 3 comments

Hey, thanks for the cool library! I'm using it in Loupe.

I noticed a performance issue:

<?php

use Nitotm\Eld\LanguageDetector;

include __DIR__ . '/vendor/autoload.php';

$languageDetector = new LanguageDetector();
$languageDetector->langSubset(['de']);

var_dump($languageDetector->detect('Guten Tag.'));

On the first call, this will write a subset for just de into the vendor (/subsets) directory. This is perfect! However, the LanguageDetector will always load the small.php ngrams, even though I only need the subset for German.

Also it is loading the dataset already in the __construct() which means that you cannot instantiate the LanguageDetector object without causing the data to be loaded into memory. This means, it's always loaded even if nobody calls ->detect() on the object later. This should ideally be converted to a lazy evaluation. So only load the data once it's used for the first time ๐Ÿ˜Š

Toflar avatar Jun 27 '25 14:06 Toflar

I just saw that according to the docs, one should pass the subset via constructor. That doesn't make sense to me, why can't LanguageDetector do that itself? Something like

$languageDetector = new LanguageDetector(); // This does not do anything at all, so no data is loaded
$languageDetector->langSubset(['de']); // This sets the internal subset choice, still no data is loaded

var_dump($languageDetector->detect('Guten Tag.')); // Now the data is loaded, either the subset or `small.php`

Also, small.php is 3 MB on disk. But it's 64 MB when PHP compiles it into its memory structures. Maybe the storage format is not the best? What's your opinion on this? ๐Ÿ˜Š

Toflar avatar Jun 27 '25 15:06 Toflar

Regarding your initial concern and issue title, as you have already seen, you donโ€™t need to load a database you donโ€™t need. In your specific case, you would do:

$languageDetector = new LanguageDetector(โ€˜small_1_emโ€™);
$languageDetector->detect('Guten Tag.')

The โ€˜emโ€™ from small_1_em, itโ€™s just internal encoding, I agree is not the most intuitive, itโ€™s just the file that ->langSubset(['de']) returns. I could change this behavior, and maintain internal file names, internal. Also point out that subsets have โ€œsizeโ€ too, so each database has two characteristics: size and languages.


As for the database size on memory, I imagine you have seen https://github.com/nitotm/efficient-language-detector?tab=readme-ov-file#databases where I added detailed information. For example small.php when cached should use 21+4 MB from the OPcache memory.

Still, yes, itโ€™s a concern, especially for the larger databases. I would say it is not a problem of the array structure, I believe itโ€™s good, but that PHP arrays are high level. I tried SplFixedArray, and some other solutions, with bad results. I am open to suggestions and try new things, but I prefer not to use extensions.


Regarding loading data on first detect, I don't have a thoughtful answer right now. I was probably optimizing for the most common/normal usage. I will think about it.

You can leave this issue open.

nitotm avatar Jun 28 '25 12:06 nitotm

The โ€˜emโ€™ from small_1_em, itโ€™s just internal encoding, I agree is not the most intuitive, itโ€™s just the file that ->langSubset(['de']) returns. I could change this behavior, and maintain internal file names, internal. Also point out that subsets have โ€œsizeโ€ too, so each database has two characteristics: size and languages.

Yes, I understood but this means that every user of your library needs to store the result of ->langSubset() elsewhere in order to then pass it to the constructor which seems not very intuitive ๐Ÿ˜Š

As for the database size on memory, I imagine you have seen https://github.com/nitotm/efficient-language-detector?tab=readme-ov-file#databases where I added detailed information. For example small.php when cached should use 21+4 MB from the OPcache memory.

Yes, I did read that. OPcache doesn't help on CLI for worker processes because it's mostly disabled there.

I am open to suggestions and try new things, but I prefer not to use extensions.

I haven't looked into the format at all but depending on what exactly needs to be looked up it might be interesting to see how sqlite performs. It's widely available. It would likely be slower but depending on how many % of the entire dataset you actually need to load in order to detect, it might be more efficient to load the partial data vs. loading the entire dataset into memory. Would need to be investigated but that's one suggestion ๐Ÿ˜Š

Regarding loading data on first detect, I don't have a thoughtful answer right now. I was probably optimizing for the most common/normal usage. I will think about it.

Sure, thanks!

Toflar avatar Jun 28 '25 12:06 Toflar

Do you want me to create a PR for this? ๐Ÿ˜Š

Toflar avatar Nov 20 '25 07:11 Toflar

Do you want me to create a PR for this? ๐Ÿ˜Š

Hi, sorry, these are my thoughts.

I don't think it is necessarily a must to make ELD lazy load, but some necessary changes will fit better with lazy load, like make the settings more intuitive . The initial confusion on how to load the ['de'] database from the start, I think exemplifies the biggest problem to address.

These are the changes I would like to do, which would be included in version 4, as they change the usage.

  • Set the initial languages with a method, as you said, and remove arguments from the instance creation.
    • I think I would like to use languages() but I would allow langSubset() as a deprecated method for v4.
     		$eld->languages(['en','es','de']);
    
  • Then the other settings also with their own method
    		$eld->databaseSize('big');
    		$eld->outputFormat('ISO639_1');
    

Then just check at detect() if data is loaded with a simple IF. I would also make an array with all language codes ISO and text names joined, so searching and setting languages will be more efficient than the current makeSubset(), as outputFormat might not be set.

What do you think?

nitotm avatar Nov 21 '25 18:11 nitotm

I don't think there's a need for a new major version and huge adjustments like that. For me, the langSubset() was just fine. All I would do is probably something like this:

class LanguageDetector extends LanguageData
{
    private bool $isInitialized = false;

    public function __construct(private ?string $databaseFile = null, private ?string $outputFormat = null)
    {
    }

    private function initialize(): void
    {
        if ($this->isInitialized) {
            return;
        }

        $this->isInitialized = true;
        $this->loadData($this->databaseFile, $this->outputFormat);
    }

    public function detect(string $text): LanguageResult
    {
        $this->initialize();
        ...
    }
}

I didn't look into it in more detail though.

Toflar avatar Nov 25 '25 09:11 Toflar

Ok, I understand you just want lazy load for v3 as a minor update. I can do that.

For v4 I think I still want to hide the internal database name for subsets like โ€˜small_1_emโ€™, from the users view, so subsets would be always done with the languages array.

nitotm avatar Nov 25 '25 10:11 nitotm

@Toflar. Hi, it turned out to be more complicated than expected.

Problem

The issue occurs when langSubset() is called before data is loaded. Questions that arose:

  • Which languages are available?
  • What format should be returned?
  • What is the database size?
  • What should langSubset() return when no subset exists?

I have the code mostly done; I had to change it extensively to address these issues.

Current behaviour (3.0.0)

  • langSubset() currently returns the languages found in the actually loaded database.

Changes I made

  • If data is not initialized, langSubset() now returns all potentially available languages (selected) and then attempts to retrieve the languages for the full database at dataLoad(), ( when calling detec() ) even if a subset file name was passed.

Remaining issue

If the full database was deleted (for whatever reason, or if they are custom versions) and only a subset exists, langSubset() may have returned languages that are not actually available. I fixed several minor problems, but this case would require more complexity to handle perfectly.

Proposed approach

Example usage:

$languageDetector = new LanguageDetector();
$languageDetector->langSubset(['de']);

Behaviour I propose:

  • If the ['de'] subset is already stored: do not do the load, return the correct subset data (since it is already created), and defer a lazy load to detect().
  • If the ['de'] subset is not stored: create it by loading the full data and return trustworthy subset data.

This would make the functionality more robust and solve the scenario above. The implementation has become a bit messy: I currently use a pending-subset variable in loadData() that calls langSubset() and repeats some processing (at least for the first run), to avoid making it overly complex for the initial working version.

Question

Do you think it is acceptable for langSubset() to perform a one-time database load on its first call when the requested subset is not stored?

nitotm avatar Nov 25 '25 22:11 nitotm

Hi @nitotm First and foremost: Thank you for looking into this issue and trying to make the library better! โค๏ธ

Do you think it is acceptable for langSubset() to perform a one-time database load on its first call when the requested subset is not stored?

I think this is an improvement, of course! But maybe you did misunderstand me: I'm not asking to fix this in version 3 at all costs! If you would like to solve it properly and differently with API changes, then of course I can upgrade to a version 4!

Also, I wanted to let you know that I recently stumbled over https://github.com/wikimedia/wikimedia-textcat. It is not mentioned in your performance comparisons but I noticed that it uses almost no memory. So maybe there are ideas to pull from that library ๐Ÿ˜‰

Toflar avatar Nov 26 '25 09:11 Toflar

Well then I will try to call loadData() if langSubset() does not find a cached subset.

Regarding wikimedia-textcat, I hope it uses no memory, because if I calculated correctly, the test single-words_v3.txt that takes ELD 0.33 seconds (to do the full 52k detects of said file), it seems to take 9 hours to textcat, like x100.000 times slower? like 0.6 sec per detection, on my setup.

ELD stores all ngrams on memory, that's why it is so fast (among other things), since PHP is a high level language, arrays on memory are bloated, I don't think there is much more I can do about it. What I've been thinking for a while is to make a C version, and try to use it with PHP with an extension or something, but as a future project in a different repository.

nitotm avatar Nov 26 '25 12:11 nitotm

Well then I will try to call loadData() if langSubset() does not find a cached subset.

Sounds great!

Regarding wikimedia-textcat, I hope it uses no memory, because if I calculated correctly, the test single-words_v3.txt that takes ELD 0.33 seconds (to do the full 52k detects of said file), it seems to take 9 hours to textcat, like x100.000 times slower? like 0.6 sec per detection, on my setup.

I didn't do any in depth performance tests but sometimes spending a bit more time for detection in favor of less memory could be beneficial. Especially if the text is very short. In Loupe, I'm using it to detect the language of a search query. This is usually one or two words and rarely more than 5. So for me, it would be better to e.g. spend 10ms more on detection but use only 50 MB of RAM rather than 1 GB when in the web request ๐Ÿ˜‰

Toflar avatar Nov 26 '25 15:11 Toflar

I understand, you are using 'small' database right? If I'm correct that is 76 MB on compilation (or <30MB as cached OPcache). Are you comfortable with that?

nitotm avatar Nov 26 '25 16:11 nitotm

Yes, all the others are not an option. Unfortunately, OPcache is disabled pretty often on CLI and a lot of stuff happens on CLI (background jobs/workers) so we cannot really rely on that either.

Toflar avatar Nov 27 '25 17:11 Toflar

Yes, all the others are not an option. Unfortunately, OPcache is disabled pretty often on CLI and a lot of stuff happens on CLI (background jobs/workers) so we cannot really rely on that either.

Hi, I did the lazy load, the langSubset() got a bit complex with all possibilities, I need to add some tests for edge cases.

I've been working on a new config option called mode that introduces 3 new low memory database 'modes', they use a new blob database, and change how the database data is loaded and accessed. Current modes are: array (original), string (parsed & OPcache'd), bytes (raw string, not cached), and disk (streamed from disk).

I wanted to ask you about the names, for example if you think that I should call disk as stream, and bytes as raw, but I have the feeling that disk and bytes are more self-explanatory of what they actually do.

Take a look at main and tell me what you think.

nitotm avatar Dec 15 '25 16:12 nitotm

Wow, thank you! I don't understand why there are so many modes just yet. Wanted to give dev-main a try but seems like your packagist.org autosync does not work anymore: https://packagist.org/packages/nitotm/efficient-language-detector#dev-main (latest sync is from 2025-07-09 18:59 UTC). Let me know when that's fixed. I'll run some tests on my side then and I think that will shed some light on the modes and also help me with the naming ๐Ÿ˜Š

Toflar avatar Dec 15 '25 17:12 Toflar

Wow, thank you! I don't understand why there are so many modes just yet. Wanted to give dev-main a try but seems like your packagist.org autosync does not work anymore: https://packagist.org/packages/nitotm/efficient-language-detector#dev-main (latest sync is from 2025-07-09 18:59 UTC). Let me know when that's fixed. I'll run some tests on my side then and I think that will shed some light on the modes and also help me with the naming ๐Ÿ˜Š

I manually synced, should work.

Modes

  • array: The original mode since v1, the database is a very fast array and consumes a lot of memory.

  • string and bytes: The database is accessed as a string in memory, occupying much less memory.

    • string is compiled and therefore can be cached with OPcache; the first load is slower, subsequent loads are very fast (if cached by OPcache).

    • bytes cannot be OPcache cached; it is loaded directly as a raw file, its load time is always the same, reasonable fast.

  • disk is a mode that occupies only 0.5MB of memory, basically what ELD uses. The database is read from disk and is not loaded into memory.

In conclusion, this means that by using for example bytes mode, you can now use the extralarge database with a memory peak consumption of 52MB, compared to array-small, where the peak consumption is 77MB; or with array-extralarge peak is 2083MB.

Furthermore, bytes and string are only about ~2x slower than array, so they are still quite fast.

nitotm avatar Dec 15 '25 18:12 nitotm

I see, I do understand the 4 types now. So I would say, having bytes and string is a huge win already because they are still reasonably fast but use a lot less memory ๐Ÿฅณ

Here are some statistics from my project. Current version (small and array): Indexed in 61.11 s using 147.50 MiB

And here are some tests of dev-main@c4ae6e6:

database mode time in s RAM in MiB
Small ARRAY 61.48 147.50
Extra-large DISK 78.51 104.50
Small DISK 76.71 104.50
Small STRING 67.44 108.69
Extra-large STRING 72.35 167.38
Small BYTES 70.46 108.69

So in other words, using small and string I can get to a very competitive speed but using a lot less RAM. Also note that these are all on CLI without OPcache, so it's just memory_get_peak_usage(true).

There are a few questions that are raised now:

  • Some combinations seem to make no sense then:
    • A small database with DISK is almost as fast (slow) as the extra-large. Why would you want to use the small database in that case? ๐Ÿ˜Š
    • A small database with STRING is about the same as BYTES. Would you ever not want to have this in OPcache? Depends on how big the database is, right - but for small that's probably not much?
  • Would you say that there's a significantly better language detection between small and medium because then we might consider shipping medium pre-compiled as well. Because one could now get significantly better detection for a similar amount of RAM now then?

Toflar avatar Dec 15 '25 18:12 Toflar

A small database with DISK is almost as fast (slow) as the extra-large. Why would you want to use the small database in that case? ๐Ÿ˜Š

Internally string, bytes & disk actually use the same database file; the only difference is load/access, so any DB available for bytes it is also for disk.

A small database with STRING is about the same as BYTES. Would you ever not want to have this in OPcache? Depends on how big the database is, right - but for small that's probably not much?

As stated, they are the same files; it is not that I don't want bytes to be cached; it is an option, to get the benefit of a faster 1st/uncahed load and less memory peak.

It is true that for small differences are small... but for extralarge, if you don't have OPcache, bytes peak is 53MB and load 0.04 sec, vs string peak is 80MB and load 0.25 sec (uncahed); so in this case you might want to use bytes.

It is just that these combinations are available with no extra effort, not that they are all necessary.

Would you say that there's a significantly better language detection between small and medium because then we might consider shipping medium pre-compiled as well. Because one could now get significantly better detection for a similar amount of RAM now then?

I would ship all pre-compiled, but I am just concerned that people will complain about install size. I personally like large size; it is like ~24MB file size & memory use. medium ~8MB

nitotm avatar Dec 15 '25 19:12 nitotm

I personally like large size; it is like ~24MB file size & memory use. medium ~8MB

I would like to run some tests with medium and large to compare.

I would ship all pre-compiled, but I am just concerned that people will complain about install size.

dev-main is currently 55 MB as ZIP downloaded by Composer. I see your point. How long does it take to compile the files? Is it conceivable to do that on-the-fly or would that take too long?

Toflar avatar Dec 15 '25 19:12 Toflar

I would like to run some tests with medium and large to compare.

So do you want me to upload them to dev-main? You can try to build them yourself, but you will need memory_limit to match array mode requirements

$eldBuilder = new Nitotm\Eld\BlobDataBuilder('large');
$eldBuilder->buildDatabase();

Or CLI php demo_blob_builder.php -d large

dev-main is currently 55 MB as ZIP downloaded by Composer. I see your point. How long does it take to compile the files? Is it conceivable to do that on-the-fly or would that take too long?

extralarge takes like 1 minute, small a few seconds, the problem is the RAM usage, as it first loads array mode to then convert to the others. So any system that cannot run array-large cannot build the "low memory" database version for large

nitotm avatar Dec 15 '25 19:12 nitotm

Or CLI php demo_blob_builder.php -d large

I did that for medium and large now.

database mode time in s RAM in MiB
Large STRING 67.91 127.25 MiB
Medium STRING 76.75 110.77 MiB

What's interesting is that medium takes longer than large. That seems unexpected to me.

But compared to current v3 I could be using large and STRING now which would make it about 9% slower, safe 15% RAM and give me large vs small accuracy. That's a win in any case! Although small is maybe still good enough for me - I guess that depends ๐Ÿ˜Š

Toflar avatar Dec 15 '25 20:12 Toflar

But the small STRING takes just as long as large. So it's really just about RAM there.

Toflar avatar Dec 15 '25 20:12 Toflar

I would ship all pre-compiled, but I am just concerned that people will complain about install size.

One solution to that problem could be separate composer packages. So you install the one(s) you need.

Toflar avatar Dec 15 '25 20:12 Toflar

What's interesting is that medium takes longer than large. That seems unexpected to me.

Well, while the mode is what determines most of the speed, the size also influences, it is true that medium is the slowest size and large is the fastest size, also a reason why I like it. It is mostly due to the combination of ngram size and ngram count.

One solution to that problem could be separate composer packages. So you install the one(s) you need.

That was already on my mind, but I'm not sure what would be the best way to do the separation.

nitotm avatar Dec 15 '25 20:12 nitotm

Well, while the mode is what determines most of the speed, the size also influences, it is true that medium is the slowest size and large is the fastest size, also a reason why I like it. It is mostly due to the combination of ngram size and ngram count.

I see. Maybe that should also be documented in the README? (forgive me if it already is and I missed it).

That was already on my mind, but I'm not sure what would be the best way to do the separation.

I guess the small can be kept in the library. And it would just be about requiring nitotm/efficient-language-detector-database-medium, nitotm/efficient-language-detector-database-large or nitotm/efficient-language-detector-database-extra-large?

Toflar avatar Dec 16 '25 08:12 Toflar

I see. Maybe that should also be documented in the README? (forgive me if it already is and I missed it).

The differences in time execution between sizes should be small, at least they are on my bench. You can see the differences in the benchmarks of the README, for example ELD-test is 1.8" for array-medium and 1.5" for array-large, extralarge is very similar, but then going to string mode time goes up to 3.9" (3.8" - 4.3" depending on model size), then disk is around 27".

The fact that on your bench, string-medium is as fast as disk-small seems odd to me. Well, the differences between sizes that you get, up to 15% that is expected, the differences between modes, not really. But I guess it depends on what your benchmarks does, seems to do more than just ELD detect(), the text to be detected, the disk speed, etc.

nitotm avatar Dec 16 '25 10:12 nitotm

Yeah, doesn't really matter. I will have to see how well small detects vs. the other databases anyway. That would be the much more valuable information - but no idea if there's a way to express this "detection correctness probability".

Toflar avatar Dec 16 '25 13:12 Toflar

Yeah, doesn't really matter. I will have to see how well small detects vs. the other databases anyway. That would be the much more valuable information - but no idea if there's a way to express this "detection correctness probability".

At the README there is the accuracy benchmarks, if that is what you meant, and a tiny explanation of each one, the tested files are at dev-main, /benchmark

Tatoeba-50 ELD test Sentences Word pairs Single words
Nito-ELD-S 96.8% 99.7% 99.2% 90.9% 75.1%
Nito-ELD-M 97.9% 99.7% 99.3% 93.0% 80.1%
Nito-ELD-L 98.3% 99.8% 99.4% 94.8% 83.5%
Nito-ELD-XL 98.5% 99.8% 99.5% 95.4% 85.1%
Lingua 96.1% 99.2% 98.7% 93.4% 80.7%
fasttext-subset 94.1% 98.0% 97.9% 83.1% 67.8%
fasttext-all -- 97.4% 97.6% 81.5% 65.7%
CLD2 * 92.1% * 98.1% 97.4% 85.6% 70.7%
Lingua-low 89.3 97.3% 96.3% 84.1% 68.6%
patrickschur 84.1% 94.8% 93.6% 71.9% 57.1%
franc 76.9% 93.8% 92.3% 67.0% 53.8%

You can make your own benchmarks also, with already classified languages, you can use /benchmark/bench.php and edit $files to try your own benchmarks, it expects line separated texts, with tabulation separated language and text ISO639-1\tPhrase

en	unique internet
en	shock bumpers
en	obligations membership

nitotm avatar Dec 16 '25 15:12 nitotm

In my case, I'll probably stick to small anyway then ๐Ÿ˜Š

Toflar avatar Dec 17 '25 08:12 Toflar