meilisearch icon indicating copy to clipboard operation
meilisearch copied to clipboard

Storage usage is 8x larger than other full-text-search software

Open cloudrac3r opened this issue 1 year ago β€’ 7 comments

Describe the bug

Indexing movies.json using the process described in the tutorial creates a 217 MB folder in data.ms/indexes, which is 6-8x larger than indexing the same data with SQLite3 FTS5 (26 MB database file) or Solr 9.4.0 (36 MB index folder).

My measured 217 MB is comparable to the 224 MB usage stated in the documentation, so I assume I'm not making any mistakes with indexing.

I can also consistently reproduce this with other data sets. SQLite3 and Solr are within 2x of each other, but Meilisearch's storage is always around 6-10x larger. (All the data sets I've tried have less than 100,000 documents each.)

I am wondering whether this larger storage usage is known behaviour or whether this is a bug.

To Reproduce - Meilisearch

I am using as many default settings as possible.

  1. Run the meilisearch executable
  2. curl -X POST http://localhost:7700/indexes/movies/documents -H 'Content-Type: application/json' --data-binary @movies.json
  3. Wait for http://localhost:7700/tasks to succeed
  4. Run a test query: type doctor into the mini-dashboard
  5. The data.ms/indexes/e8a7bc8f-d416-42e6-88cf-248f55e29d0f folder is 217 MB.

To Reproduce - Solr

I am using as many default settings as possible.

  1. Start Solr: .\solr.cmd start -p 8983
  2. Create the movies index: .\solr.cmd create -c movies
  3. curl -X POST 'http://localhost:8983/solr/movies/update?commit=true' -H 'Content-Type: appl ication/json' --data-binary @movies.json
  4. Run a test query: http://localhost:8983/solr/#/movies/query?q=doctor&defType=edismax&indent=true&qf=title+overview
  5. The solr-9.4.0/server/solr/movies folder is 36 MB, 6x smaller than Meilisearch storage usage

To Reproduce - SQLite FTS5

I have chosen to use the porter unicode61 tokenizer and WAL mode. Other settings are defaults. Note that this tokenizer does not remove stop words.

  1. Enable WAL mode: sqlite3 movies_fts5.db 'pragma journal_mode=wal;'
  2. Create the movies table: sqlite3 movies_fts5.db 'create virtual table movies using fts5 (title, overview, genres, poster, release_date, tokenize=\'porter unicode61\');'
  3. jq '.[] | "insert into movies (title, genres, overview, poster, release_date) values (\'" + (.title | gsub("\'"; "\'\'")) + "\', \'" + (.genres | join(",")) + "\', \'" + (.overview | gsub("\'"; "\'\'")) + "\', \'" + .poster + "\', \'" + (.release_date | tostring) + "\');"' movies.json -r | sqlite3 movies_fts5.db
  4. sqlite3 movies_fts5.db
  5. Check all movies were added: sqlite> select count(*) from movies; = 31944
  6. Run a test query: sqlite> select * from movies where movies match 'doctor' order by rank limit 5;
  7. The movies_fts5.db file is 26 MB, 8x smaller than Meilisearch storage usage.

Expected behavior

I expected one of the following:

  • Meilisearch to use a comparable (within 2x) amount of storage to other full-text-search software
  • Or a note in the documentation saying why the storage use is much greater
  • Or for the documentation to explain how to optimise storage use on the storage page.

Versions:

Meilisearch: v1.5.0-rc.2 on Windows 11, downloaded from GitHub releases Solr: 9.4.0 SQLite3: 3.31.1-4ubuntu0.5 2020-01-27 19:55:54 from http://archive.ubuntu.com/ubuntu focal-updates/main with whatever compilation options they gave me

cloudrac3r avatar Nov 14 '23 21:11 cloudrac3r

Hello @cloudrac3r, I understand that is counterintuitive that data stored in a database takes more space on the disk than the same data serialized in JSON, but Meilisearch is not really a database but a search engine. Let me explain. Sure, you can store your data in Meilisearch, and because it's based on a transactional datastore, it shouldn't lose any data, but Meilisearch is more than that and provides a fast search engine that aims to respond to search queries in less than 50ms. Unfortunately, This fast search response has some drawbacks, and disk usage is one of them. Because Meilisearch has to respond as fast as possible, the engine needs shortcuts to search efficiently in the data. The best example I can give is the inverted index; this data structure is useless in terms of storage, but for the search engine, it allows to find all the documents containing a specific word in a few microseconds. When you search "Hello", Meilisearch will never iterate over each document in the database to find all the occurrences of "hello" because it would not be affordable to do it on tons of documents. Meilisearch contains several kinds of data structures, like the inverted index, to enhance the search time, but it has to be stored on the disk, which explains the size of the data.ms folder.

That's been said, we know that Meilisearch uses to much disk, and we are trying to find a better approach by choosing wisely the data structure we want to keep in Meilisearch, today's release 1.5.0 reduces the disk usage by 10 to 25%. However, the disk usage will always be higher than the initial data size.

What could you do to reduce the size of the database?

There is one main setting that impacts the database size: the searchableAttributes. The more you will have fields set in the searchable attributes list, the more the database size will grow. Maybe, setting only the field you need to search in could reduce it.

Thank you for your report, don't hesitate to ask more questions if you need them!

see you!

ManyTheFish avatar Nov 20 '23 13:11 ManyTheFish

@ManyTheFish Thanks for the detailed response! I understand that Meilisearch uses data structures like the inverted index that are optimised for full-text-search, so I get why the index size is larger than the initial data size.

However, my main question was why the disk space is much larger than other programs which also feature efficient full-text-search like Solr (which uses Lucene internally) and SQLite FTS5 (the full-text-search module for SQLite which adds more data structures, it's not just a plain RDBMS data store). My tests were also performed on version 1.5.0 (rc.2) and with the same searchable attributes, so I should already have the reduced index size you mentioned. I figured the disk usage would be comparable because different full-text-search programs would be using similar data structures - and in fact this is the case with SQLite3 FTS5 and Solr, which do have comparable disk usage.

I did discover that changing the ranking rules to ("words" "typo" "attribute" "sort" "exactness") (i.e. no proximity) and adding the Gensim stop words list has made the Meilisearch index 3x smaller, which is a great advantage. I think this would be good to mention in the storage documentation! However, even with this change, the reduced index files are still about twice as large as SQLite3 FTS5's full-text-search index even when FTS5 is storing proximity data.

Do you have any comments on why Meilisearch's disk usage is much larger than other full-text-search programs?

Regardless, I'm excited to see if any database size optimisations come in future releases!

cloudrac3r avatar Nov 21 '23 03:11 cloudrac3r

Yes, πŸ˜„ Meilisearch is taking more space on the disk because it goes further than the other databases regarding pre-computed data structures. On the top of the inverted index, we have different pre-computed data like prefix-based data structures allowing you to search "a" and find all the documents containing a word starting with "a" in a few milliseconds. Another data structure links the words with each attribute giving the corresponding documents. This way, you are able to retrieve the documents matching your query in the title before the documents matching in the description, thanks to the attribute ranking rules. Or allows you to search in the title only. These pre-computed data structures are not necessary for a search engine because you can easily find the information in the documents, but to have an average response time of under 50ms we had to pre-compute all of this. I only gave some examples, but we store up to 25 different kinds of data depending on your usage of Meilisearch, search, filtering, sorting, vector, geo-search...

That's been said, let's be honest and optimistic, Meilisearch is not fully optimal in terms of disk usage, we obviously stored too many things compared to the real needs of our users, and we were not really watching the disk usage when we created Meilisearch because we were focusing on the search time, this doesn't help about the actual disk greed of Meilisearch. But this means that we have room for improvements, and the future versions of Meilisearch should be more and more efficient. ☺️ We reduced the disk usage in version 1.5 and we already plan to reduce the disk usage in version 1.6, but I can't say that will be as efficient as the other databases you spoke about because Meilisearch aims to respond in around 50ms for any search queries whereas the others like Solr or Elastic find acceptable to respond in 1sec or more. πŸ˜„

ManyTheFish avatar Nov 21 '23 09:11 ManyTheFish

Thanks for the explanation! 😊

The only other thing I wanted to mention is I think it could be good to mention in the docs that searchable attributes, proximity rule, and stopwords can be used to reduce disk space. The storage documentation would be a good place since that's the first place I looked.

cloudrac3r avatar Nov 21 '23 19:11 cloudrac3r

Just started testing meilisearch recently and, comparing to other solutions, I wonder, appart from the disk-space used, if those other mentioned solutions like FTS5 in SQLite or Solr would handle well those not-yet-implemented features when fuzzy-searching.

For example loading the movies.json sample base and then searching fuzzy variations of moths and flame I'd expect to alwasys find the "Skyline" film (id=42684), which contains the description "...like moths to a flame where...".

[moths flame] => Finds [moths flam] => Finds [moths klame] => Finds, but caution: Because of moths but not mecause of klame, according to the test interface highlights. [moth flame] => Does not find (even the start of the word matches the prefix) [koths flame] => Does not find.

@cloudrac3r given you already have that setup done, could you confirm if Solr or SQLite+FTS5 would find the film Skyline (id=42684) for any of those 5 cases?

PD: My particular use-case is this: I have a large database of customers. Some users re-submit contact forms with typos in their names or even with similar emails. For example one day they submit

{
    "name": "Bea GonzΓ‘lez",
    "email": "[email protected]",
    "phone": "+34 677 11 22 33"
}

another day

{
    "name": "Beatriz Gonzalez",
    "email": [email protected],
    "phone": "677112233"
}

This is clearly the same person.

I want to automate a system so our agents may rapidly identify the second submit as "most probably being the same person than the first submit".

As the DB is large I don't want to have all the search data in RAM like in Typesense. It seems Tyepsense handles better the fuzzy text search than Meilisearch but I like the Meilisearch approach of being disk-based and not RAM-based.

Finding that meilisearch does not do well the fuzzy-search, I wonder if Solr or SQLite-FTS5 could fit better. Fist step would be to test if any alternative solution like Solr or FST5 make [moth flame] or [koths flame] to find the Skyline movie. If you @cloudrac3r still have those Solr and SFT5 things loaded with data, maybe you can confirm if those also "search better" appart from having a more compact disk-footprint.

Thanks to all.

xmontero avatar Dec 02 '23 00:12 xmontero

@xmontero [moths flame] SQLite FTS5 & Solr match Skyline, [moths flam] neither (because stemmer doesn't map flame to flam), [moths klame] neither (because I'm using all default settings and didn't configure Solr's spellcheck), [moth flame] SQLite FTS5 only (because Solr default schema doesn't include spellcheck or Porter's stemmer), [koths flame] neither (again, no spellcheck configured).

In my own experiments on my own full-text-search data sets I found that Meilisearch did give noticeably better result ranking and matching for all kinds of queries with its default settings. I expect Solr is configurable enough that you could make result quality similar if you learned Solr.

Hope this answer helps you, though I won't answer any follow-up questions about this because this topic is about disk usage and not about typo correction.

cloudrac3r avatar Dec 03 '23 20:12 cloudrac3r

Good enough the answer! Helps a lot. Thanks @cloudrac3r πŸ‘πŸ‘πŸ‘

xmontero avatar Dec 04 '23 00:12 xmontero

Closing this issue because the initial question was answered, see you

ManyTheFish avatar Jan 18 '24 09:01 ManyTheFish