flexsearch icon indicating copy to clipboard operation
flexsearch copied to clipboard

Weird behavior when exporting data

Open Paradoxu opened this issue 2 years ago • 0 comments

I have a collection on mongo that contains about 300k documents of translations, each document has the following structure:

{
   _id: ObjectId;
   i18n: {
      en: string;
      es: string;
      pt: string;
   }
}

Since I need to make queries against this data an mongo doesn't support partial text search I decided to use flexsearch to index this documents, and since loading all of them every time the server restarts is a heavy operation, I'm trying to export the indexes, so when the server restarts I'll import theses indexes instead. The problem is that I can't get the export to work properly. This is what I tried:

const flex = new Document({
    preset: 'memory',
    cache: 1000,
    optimize: true,
    worker: false,
    tokenize: 'forward',
    document: {
        id: '_id',
        store: false,
        index: [
            {
                field: 'i18n:en',
                tokenize: 'forward',
                language: 'en'
            },
            {
                field: 'i18n:es',
                tokenize: 'forward',
                language: 'es'
            },
            {
                field: 'i18n:pt',
                tokenize: 'forward',
                language: 'pt'
            }
        ]
    }
});

const docs = await fs.readFile('my_collection.json', { encoding: 'utf-8' }).then(data => JSON.parse(data));

// Use int as index, as recommended by the documentation
for (let i = 0; i < docs.length; i++) {
    const doc = docs[i];
    await flex.addAsync(i, doc);
}

flex.export((id, doc) => {
    try {
        spin.info(`Exporting ${id}`);
        fs.writeFileSync(`${id}.json`,  doc ?? '');
    } catch (e) {
        console.error(e);
        console.error(`Error exporting ${id}`);
        throw e;
    }
});

This export will create me two files reg.json and _id.cfg.json, which seems weird, when I limit the number of documents, to use 10k docs instead of 300k I get a lot more files which make more sense.

How could I fix this? Should I make multiple exports with small chunks of indexed documents? If so, when importing these chunks will it be ok, won't my data get overridden if I import the same document twice with different indexes?

Technical info:

  • Node 16
  • OS: Windows / Ubuntu

Paradoxu avatar Mar 16 '22 10:03 Paradoxu