Language detector always loads the small ngrams set
Hey, thanks for the cool library! I'm using it in Loupe.
I noticed a performance issue:
<?php
use Nitotm\Eld\LanguageDetector;
include __DIR__ . '/vendor/autoload.php';
$languageDetector = new LanguageDetector();
$languageDetector->langSubset(['de']);
var_dump($languageDetector->detect('Guten Tag.'));
On the first call, this will write a subset for just de into the vendor (/subsets) directory. This is perfect!
However, the LanguageDetector will always load the small.php ngrams, even though I only need the subset for German.
Also it is loading the dataset already in the __construct() which means that you cannot instantiate the LanguageDetector object without causing the data to be loaded into memory. This means, it's always loaded even if nobody calls ->detect() on the object later. This should ideally be converted to a lazy evaluation. So only load the data once it's used for the first time ๐
I just saw that according to the docs, one should pass the subset via constructor. That doesn't make sense to me, why can't LanguageDetector do that itself?
Something like
$languageDetector = new LanguageDetector(); // This does not do anything at all, so no data is loaded
$languageDetector->langSubset(['de']); // This sets the internal subset choice, still no data is loaded
var_dump($languageDetector->detect('Guten Tag.')); // Now the data is loaded, either the subset or `small.php`
Also, small.php is 3 MB on disk. But it's 64 MB when PHP compiles it into its memory structures. Maybe the storage format is not the best? What's your opinion on this? ๐
Regarding your initial concern and issue title, as you have already seen, you donโt need to load a database you donโt need. In your specific case, you would do:
$languageDetector = new LanguageDetector(โsmall_1_emโ);
$languageDetector->detect('Guten Tag.')
The โemโ from small_1_em, itโs just internal encoding, I agree is not the most intuitive, itโs just the file that ->langSubset(['de']) returns. I could change this behavior, and maintain internal file names, internal.
Also point out that subsets have โsizeโ too, so each database has two characteristics: size and languages.
As for the database size on memory, I imagine you have seen https://github.com/nitotm/efficient-language-detector?tab=readme-ov-file#databases where I added detailed information. For example small.php when cached should use 21+4 MB from the OPcache memory.
Still, yes, itโs a concern, especially for the larger databases. I would say it is not a problem of the array structure, I believe itโs good, but that PHP arrays are high level. I tried SplFixedArray, and some other solutions, with bad results. I am open to suggestions and try new things, but I prefer not to use extensions.
Regarding loading data on first detect, I don't have a thoughtful answer right now. I was probably optimizing for the most common/normal usage. I will think about it.
You can leave this issue open.
The โemโ from small_1_em, itโs just internal encoding, I agree is not the most intuitive, itโs just the file that
->langSubset(['de'])returns. I could change this behavior, and maintain internal file names, internal. Also point out that subsets have โsizeโ too, so each database has two characteristics: size and languages.
Yes, I understood but this means that every user of your library needs to store the result of ->langSubset() elsewhere in order to then pass it to the constructor which seems not very intuitive ๐
As for the database size on memory, I imagine you have seen https://github.com/nitotm/efficient-language-detector?tab=readme-ov-file#databases where I added detailed information. For example small.php when cached should use 21+4 MB from the OPcache memory.
Yes, I did read that. OPcache doesn't help on CLI for worker processes because it's mostly disabled there.
I am open to suggestions and try new things, but I prefer not to use extensions.
I haven't looked into the format at all but depending on what exactly needs to be looked up it might be interesting to see how sqlite performs. It's widely available. It would likely be slower but depending on how many % of the entire dataset you actually need to load in order to detect, it might be more efficient to load the partial data vs. loading the entire dataset into memory. Would need to be investigated but that's one suggestion ๐
Regarding loading data on first detect, I don't have a thoughtful answer right now. I was probably optimizing for the most common/normal usage. I will think about it.
Sure, thanks!
Do you want me to create a PR for this? ๐
Do you want me to create a PR for this? ๐
Hi, sorry, these are my thoughts.
I don't think it is necessarily a must to make ELD lazy load, but some necessary changes will fit better with lazy load, like make the settings more intuitive .
The initial confusion on how to load the ['de'] database from the start, I think exemplifies the biggest problem to address.
These are the changes I would like to do, which would be included in version 4, as they change the usage.
- Set the initial languages with a method, as you said, and remove arguments from the instance creation.
- I think I would like to use
languages()but I would allowlangSubset()as a deprecated method for v4.
$eld->languages(['en','es','de']); - I think I would like to use
- Then the other settings also with their own method
$eld->databaseSize('big'); $eld->outputFormat('ISO639_1');
Then just check at detect() if data is loaded with a simple IF.
I would also make an array with all language codes ISO and text names joined, so searching and setting languages will be more efficient than the current makeSubset(), as outputFormat might not be set.
What do you think?
I don't think there's a need for a new major version and huge adjustments like that. For me, the langSubset() was just fine. All I would do is probably something like this:
class LanguageDetector extends LanguageData
{
private bool $isInitialized = false;
public function __construct(private ?string $databaseFile = null, private ?string $outputFormat = null)
{
}
private function initialize(): void
{
if ($this->isInitialized) {
return;
}
$this->isInitialized = true;
$this->loadData($this->databaseFile, $this->outputFormat);
}
public function detect(string $text): LanguageResult
{
$this->initialize();
...
}
}
I didn't look into it in more detail though.
Ok, I understand you just want lazy load for v3 as a minor update. I can do that.
For v4 I think I still want to hide the internal database name for subsets like โsmall_1_emโ, from the users view, so subsets would be always done with the languages array.
@Toflar. Hi, it turned out to be more complicated than expected.
Problem
The issue occurs when langSubset() is called before data is loaded. Questions that arose:
- Which languages are available?
- What format should be returned?
- What is the database size?
- What should
langSubset()return when no subset exists?
I have the code mostly done; I had to change it extensively to address these issues.
Current behaviour (3.0.0)
langSubset()currently returns the languages found in the actually loaded database.
Changes I made
- If data is not initialized,
langSubset()now returns all potentially available languages (selected) and then attempts to retrieve the languages for the full database atdataLoad(), ( when callingdetec()) even if a subset file name was passed.
Remaining issue
If the full database was deleted (for whatever reason, or if they are custom versions) and only a subset exists, langSubset() may have returned languages that are not actually available. I fixed several minor problems, but this case would require more complexity to handle perfectly.
Proposed approach
Example usage:
$languageDetector = new LanguageDetector();
$languageDetector->langSubset(['de']);
Behaviour I propose:
- If the
['de']subset is already stored: do not do the load, return the correct subset data (since it is already created), and defer a lazy load todetect(). - If the
['de']subset is not stored: create it by loading the full data and return trustworthy subset data.
This would make the functionality more robust and solve the scenario above. The implementation has become a bit messy: I currently use a pending-subset variable in loadData() that calls langSubset() and repeats some processing (at least for the first run), to avoid making it overly complex for the initial working version.
Question
Do you think it is acceptable for langSubset() to perform a one-time database load on its first call when the requested subset is not stored?
Hi @nitotm First and foremost: Thank you for looking into this issue and trying to make the library better! โค๏ธ
Do you think it is acceptable for
langSubset()to perform a one-time database load on its first call when the requested subset is not stored?
I think this is an improvement, of course! But maybe you did misunderstand me: I'm not asking to fix this in version 3 at all costs! If you would like to solve it properly and differently with API changes, then of course I can upgrade to a version 4!
Also, I wanted to let you know that I recently stumbled over https://github.com/wikimedia/wikimedia-textcat. It is not mentioned in your performance comparisons but I noticed that it uses almost no memory. So maybe there are ideas to pull from that library ๐
Well then I will try to call loadData() if langSubset() does not find a cached subset.
Regarding wikimedia-textcat, I hope it uses no memory, because if I calculated correctly, the test single-words_v3.txt that takes ELD 0.33 seconds (to do the full 52k detects of said file), it seems to take 9 hours to textcat, like x100.000 times slower? like 0.6 sec per detection, on my setup.
ELD stores all ngrams on memory, that's why it is so fast (among other things), since PHP is a high level language, arrays on memory are bloated, I don't think there is much more I can do about it. What I've been thinking for a while is to make a C version, and try to use it with PHP with an extension or something, but as a future project in a different repository.
Well then I will try to call
loadData()iflangSubset()does not find a cached subset.
Sounds great!
Regarding wikimedia-textcat, I hope it uses no memory, because if I calculated correctly, the test single-words_v3.txt that takes ELD 0.33 seconds (to do the full 52k detects of said file), it seems to take 9 hours to textcat, like x100.000 times slower? like 0.6 sec per detection, on my setup.
I didn't do any in depth performance tests but sometimes spending a bit more time for detection in favor of less memory could be beneficial. Especially if the text is very short. In Loupe, I'm using it to detect the language of a search query. This is usually one or two words and rarely more than 5. So for me, it would be better to e.g. spend 10ms more on detection but use only 50 MB of RAM rather than 1 GB when in the web request ๐
I understand, you are using 'small' database right? If I'm correct that is 76 MB on compilation (or <30MB as cached OPcache). Are you comfortable with that?
Yes, all the others are not an option. Unfortunately, OPcache is disabled pretty often on CLI and a lot of stuff happens on CLI (background jobs/workers) so we cannot really rely on that either.
Yes, all the others are not an option. Unfortunately, OPcache is disabled pretty often on CLI and a lot of stuff happens on CLI (background jobs/workers) so we cannot really rely on that either.
Hi, I did the lazy load, the langSubset() got a bit complex with all possibilities, I need to add some tests for edge cases.
I've been working on a new config option called mode that introduces 3 new low memory database 'modes', they use a new blob database, and change how the database data is loaded and accessed.
Current modes are: array (original), string (parsed & OPcache'd), bytes (raw string, not cached), and disk (streamed from disk).
I wanted to ask you about the names, for example if you think that I should call disk as stream, and bytes as raw, but I have the feeling that disk and bytes are more self-explanatory of what they actually do.
Take a look at main and tell me what you think.
Wow, thank you!
I don't understand why there are so many modes just yet. Wanted to give dev-main a try but seems like your packagist.org autosync does not work anymore: https://packagist.org/packages/nitotm/efficient-language-detector#dev-main (latest sync is from 2025-07-09 18:59 UTC).
Let me know when that's fixed. I'll run some tests on my side then and I think that will shed some light on the modes and also help me with the naming ๐
Wow, thank you! I don't understand why there are so many modes just yet. Wanted to give
dev-maina try but seems like your packagist.org autosync does not work anymore: https://packagist.org/packages/nitotm/efficient-language-detector#dev-main (latest sync is from2025-07-09 18:59 UTC). Let me know when that's fixed. I'll run some tests on my side then and I think that will shed some light on the modes and also help me with the naming ๐
I manually synced, should work.
Modes
-
array: The original mode since v1, the database is a very fast array and consumes a lot of memory. -
stringandbytes: The database is accessed as a string in memory, occupying much less memory.-
stringis compiled and therefore can be cached with OPcache; the first load is slower, subsequent loads are very fast (if cached by OPcache). -
bytescannot be OPcache cached; it is loaded directly as a raw file, its load time is always the same, reasonable fast.
-
-
diskis a mode that occupies only 0.5MB of memory, basically what ELD uses. The database is read from disk and is not loaded into memory.
In conclusion, this means that by using for example bytes mode, you can now use the extralarge database with a memory peak consumption of 52MB, compared to array-small, where the peak consumption is 77MB; or with array-extralarge peak is 2083MB.
Furthermore, bytes and string are only about ~2x slower than array, so they are still quite fast.
I see, I do understand the 4 types now. So I would say, having bytes and string is a huge win already because they are still reasonably fast but use a lot less memory ๐ฅณ
Here are some statistics from my project. Current version (small and array):
Indexed in 61.11 s using 147.50 MiB
And here are some tests of dev-main@c4ae6e6:
| database | mode | time in s | RAM in MiB |
|---|---|---|---|
| Small | ARRAY | 61.48 | 147.50 |
| Extra-large | DISK | 78.51 | 104.50 |
| Small | DISK | 76.71 | 104.50 |
| Small | STRING | 67.44 | 108.69 |
| Extra-large | STRING | 72.35 | 167.38 |
| Small | BYTES | 70.46 | 108.69 |
So in other words, using small and string I can get to a very competitive speed but using a lot less RAM. Also note that these are all on CLI without OPcache, so it's just memory_get_peak_usage(true).
There are a few questions that are raised now:
- Some combinations seem to make no sense then:
- A
smalldatabase with DISK is almost as fast (slow) as the extra-large. Why would you want to use the small database in that case? ๐ - A
smalldatabase with STRING is about the same as BYTES. Would you ever not want to have this in OPcache? Depends on how big the database is, right - but forsmallthat's probably not much?
- A
- Would you say that there's a significantly better language detection between
smallandmediumbecause then we might consider shippingmediumpre-compiled as well. Because one could now get significantly better detection for a similar amount of RAM now then?
A small database with DISK is almost as fast (slow) as the extra-large. Why would you want to use the small database in that case? ๐
Internally string, bytes & disk actually use the same database file; the only difference is load/access, so any DB available for bytes it is also for disk.
A small database with STRING is about the same as BYTES. Would you ever not want to have this in OPcache? Depends on how big the database is, right - but for small that's probably not much?
As stated, they are the same files; it is not that I don't want bytes to be cached; it is an option, to get the benefit of a faster 1st/uncahed load and less memory peak.
It is true that for small differences are small... but for extralarge, if you don't have OPcache, bytes peak is 53MB and load 0.04 sec, vs string peak is 80MB and load 0.25 sec (uncahed); so in this case you might want to use bytes.
It is just that these combinations are available with no extra effort, not that they are all necessary.
Would you say that there's a significantly better language detection between small and medium because then we might consider shipping medium pre-compiled as well. Because one could now get significantly better detection for a similar amount of RAM now then?
I would ship all pre-compiled, but I am just concerned that people will complain about install size.
I personally like large size; it is like ~24MB file size & memory use. medium ~8MB
I personally like
largesize; it is like ~24MB file size & memory use.medium~8MB
I would like to run some tests with medium and large to compare.
I would ship all pre-compiled, but I am just concerned that people will complain about install size.
dev-main is currently 55 MB as ZIP downloaded by Composer. I see your point. How long does it take to compile the files? Is it conceivable to do that on-the-fly or would that take too long?
I would like to run some tests with medium and large to compare.
So do you want me to upload them to dev-main?
You can try to build them yourself, but you will need memory_limit to match array mode requirements
$eldBuilder = new Nitotm\Eld\BlobDataBuilder('large');
$eldBuilder->buildDatabase();
Or CLI php demo_blob_builder.php -d large
dev-main is currently 55 MB as ZIP downloaded by Composer. I see your point. How long does it take to compile the files? Is it conceivable to do that on-the-fly or would that take too long?
extralarge takes like 1 minute, small a few seconds, the problem is the RAM usage, as it first loads array mode to then convert to the others. So any system that cannot run array-large cannot build the "low memory" database version for large
Or CLI
php demo_blob_builder.php -d large
I did that for medium and large now.
| database | mode | time in s | RAM in MiB |
|---|---|---|---|
| Large | STRING | 67.91 | 127.25 MiB |
| Medium | STRING | 76.75 | 110.77 MiB |
What's interesting is that medium takes longer than large. That seems unexpected to me.
But compared to current v3 I could be using large and STRING now which would make it about 9% slower, safe 15% RAM and give me large vs small accuracy. That's a win in any case!
Although small is maybe still good enough for me - I guess that depends ๐
But the small STRING takes just as long as large. So it's really just about RAM there.
I would ship all pre-compiled, but I am just concerned that people will complain about install size.
One solution to that problem could be separate composer packages. So you install the one(s) you need.
What's interesting is that medium takes longer than large. That seems unexpected to me.
Well, while the mode is what determines most of the speed, the size also influences, it is true that medium is the slowest size and large is the fastest size, also a reason why I like it. It is mostly due to the combination of ngram size and ngram count.
One solution to that problem could be separate composer packages. So you install the one(s) you need.
That was already on my mind, but I'm not sure what would be the best way to do the separation.
Well, while the mode is what determines most of the speed, the size also influences, it is true that
mediumis the slowest size andlargeis the fastest size, also a reason why I like it. It is mostly due to the combination of ngram size and ngram count.
I see. Maybe that should also be documented in the README? (forgive me if it already is and I missed it).
That was already on my mind, but I'm not sure what would be the best way to do the separation.
I guess the small can be kept in the library. And it would just be about requiring nitotm/efficient-language-detector-database-medium, nitotm/efficient-language-detector-database-large or nitotm/efficient-language-detector-database-extra-large?
I see. Maybe that should also be documented in the README? (forgive me if it already is and I missed it).
The differences in time execution between sizes should be small, at least they are on my bench.
You can see the differences in the benchmarks of the README, for example ELD-test is 1.8" for array-medium and 1.5" for array-large, extralarge is very similar, but then going to string mode time goes up to 3.9" (3.8" - 4.3" depending on model size), then disk is around 27".
The fact that on your bench, string-medium is as fast as disk-small seems odd to me. Well, the differences between sizes that you get, up to 15% that is expected, the differences between modes, not really.
But I guess it depends on what your benchmarks does, seems to do more than just ELD detect(), the text to be detected, the disk speed, etc.
Yeah, doesn't really matter. I will have to see how well small detects vs. the other databases anyway. That would be the much more valuable information - but no idea if there's a way to express this "detection correctness probability".
Yeah, doesn't really matter. I will have to see how well
smalldetects vs. the other databases anyway. That would be the much more valuable information - but no idea if there's a way to express this "detection correctness probability".
At the README there is the accuracy benchmarks, if that is what you meant, and a tiny explanation of each one, the tested files are at dev-main, /benchmark
| Tatoeba-50 | ELD test | Sentences | Word pairs | Single words | |
|---|---|---|---|---|---|
| Nito-ELD-S | 96.8% | 99.7% | 99.2% | 90.9% | 75.1% |
| Nito-ELD-M | 97.9% | 99.7% | 99.3% | 93.0% | 80.1% |
| Nito-ELD-L | 98.3% | 99.8% | 99.4% | 94.8% | 83.5% |
| Nito-ELD-XL | 98.5% | 99.8% | 99.5% | 95.4% | 85.1% |
| Lingua | 96.1% | 99.2% | 98.7% | 93.4% | 80.7% |
| fasttext-subset | 94.1% | 98.0% | 97.9% | 83.1% | 67.8% |
| fasttext-all | -- | 97.4% | 97.6% | 81.5% | 65.7% |
| CLD2 * | 92.1% * | 98.1% | 97.4% | 85.6% | 70.7% |
| Lingua-low | 89.3 | 97.3% | 96.3% | 84.1% | 68.6% |
| patrickschur | 84.1% | 94.8% | 93.6% | 71.9% | 57.1% |
| franc | 76.9% | 93.8% | 92.3% | 67.0% | 53.8% |
You can make your own benchmarks also, with already classified languages, you can use /benchmark/bench.php and edit $files to try your own benchmarks, it expects line separated texts, with tabulation separated language and text ISO639-1\tPhrase
en unique internet
en shock bumpers
en obligations membership
In my case, I'll probably stick to small anyway then ๐