flexsearch icon indicating copy to clipboard operation
flexsearch copied to clipboard

Duplicate tokens in highlighted results

Open burakcan opened this issue 8 months ago • 9 comments

Hi, I have this issue with results having highligted tokens. Example:

const index = new Document({
    document: {
      store: true,
      index: [
        {
          field: "title",
          tokenize: "forward",
          encoder: Charset.LatinBalance,
        },
        {
          field: "content",
          tokenize: "forward",
          encoder: Charset.LatinBalance,
        },
      ],
    },
  });

const search = () => index.searchAsync({
        "h",
        suggest: true,
        enrich: true,
        highlight: `<mark>$1</mark>`,
      })

index.add({
      id: 1,
      title: "Tips For Decorating Easter Eggs",
      content: `Published:

April 14, 2025

From bold color choices to intricate patterns, there are many ways to make your springtime holiday decorations stand out from the rest. The Onion shares tips for dyeing Easter eggs.`,
    });

now with this setup, if I search for "h", the marked result I get is like this:

Publis<mark>ublis</mark>ed: April 14, 2025 From bold color c<mark></mark>oices to intricate patterns, t<mark></mark>ere are many ways

as you can see, Publish became Publisublised. Also there are empty marks where actual "h" letters would be. Is there anything I'm doing wrong here or is this a bug with marking?

burakcan avatar Apr 16 '25 15:04 burakcan

Thanks for the report. There was a typo issue on a substring(). Fix is included in the latest version.

Here some minor improvements of your example:

const index = new Document({
    // use shared encoder for same style of content
    encoder: Charset.LatinBalance,
    document: {
        store: true,
        index: [
            {
                field: "title",
                tokenize: "forward"
            },
            {
                field: "content",
                tokenize: "forward"
            }
        ],
    },
});

// use async non-blocking model when modify the index in a loop (bulk)
await index.addAsync({
    id: 1,
    title: "Tips For Decorating Easter Eggs",
    content: `Published: April 14, 2025 From bold color choices to intricate patterns, there are many ways to make your springtime holiday decorations stand out from the rest. The Onion shares tips for dyeing Easter eggs.`
});

// as long as you don't execute search > 100 times per seconds you can just use the sync version:
const result = index.search("h", {
    suggest: true,
    enrich: true,
    highlight: `<mark>$1</mark>`,
});

ts-thomas avatar Apr 17 '25 08:04 ts-thomas

hi, thank you for the response @ts-thomas . Updating to the latest version solved this specific issue. I have another problem but not sure if I should create another issue.

In my setup, I use Indexeddb. When I try to add a bunch of documents to the index, it adds all of them to the :map table. However it skips a random number of documents when adding to the reg table. So when searching, it can find results but can't match it with a stored document. This happens for both sync and async adds. I currently solved this with a mutex mechhanism but not sure if there's a more elegant way. Here's my example usage:

import { Editor } from "@tiptap/core";
import Collaboration from "@tiptap/extension-collaboration";
import { Mutex } from "es-toolkit";
import { Charset, Document, IndexedDB } from "flexsearch";
import * as Y from "yjs";
import { baseExtensions } from "@/components/noteEditor/baseExtensions";
import { IndexeddbPersistence as YIndexeddbPersistence } from "./db/yIndexedDb";
import { ClientNote } from "@/types/Entities";
import {
  EnrichedDocumentSearchResults,
  NoteSearchResult,
} from "@/types/FlexSearch";

export class SearchService extends EventTarget {
  static notesIndexMutex = new Mutex();

  static notesIndex = new Document({
    encoder: Charset.LatinBalance,
    document: {
      store: true,
      index: [
        {
          field: "title",
          tokenize: "forward",
        },
        {
          field: "content",
          tokenize: "forward",
        },
      ],
    },
    context: true,
  });

  static notesDb = new IndexedDB({
    name: "search-notes",
  });

  static async init() {
    await SearchService.notesDb.mount(SearchService.notesIndex);
  }

  static async getNoteContentFromYDoc(noteId: string) {
    const yDoc = new Y.Doc();
    const persistence = new YIndexeddbPersistence(noteId, yDoc);

    await persistence.whenSynced;

    const editor = new Editor({
      extensions: [
        ...baseExtensions,
        Collaboration.configure({ document: yDoc }),
      ],
    });

    const fullText = editor.getText();
    const [title, ...content] = fullText.split("\n");

    return {
      title,
      content: content.join("\n"),
    };
  }

  static async addNote({
    id,
    title,
    content,
  }: {
    id: string;
    title: string;
    content: string;
  }) {
    await this.notesIndexMutex.acquire();
    console.log("SearchService: Adding note to search index", id, title);

    try {
      await this.notesIndex.add({
        id,
        title,
        content,
      });
      await this.notesIndex.commit();
    } catch (error) {
      console.error("SearchService: Error adding note to search index", error);
    } finally {
      this.notesIndexMutex.release();
    }
  }

  static async addNoteFromYDoc(note: ClientNote) {
    const { title, content } = await this.getNoteContentFromYDoc(note.id);

    await this.addNote({
      id: note.id,
      title,
      content,
    });
  }

  static async removeNote(id: string) {
    await this.notesIndexMutex.acquire();
    console.log("SearchService: Removing note from search index", id);

    try {
      await this.notesIndex.remove(id);
      await this.notesIndex.commit();
    } catch (error) {
      console.error(
        "SearchService: Error removing note from search index",
        error
      );
    } finally {
      this.notesIndexMutex.release();
    }
  }

  static async updateNote({
    id,
    title,
    content,
  }: {
    id: string;
    title: string;
    content: string;
  }) {
    await this.notesIndexMutex.acquire();
    console.log("SearchService: Updating note in search index", id, title);

    try {
      await this.notesIndex.update(id, {
        title,
        content,
      });
      await this.notesIndex.commit();
    } catch (error) {
      console.error(
        "SearchService: Error updating note in search index",
        error
      );
    } finally {
      this.notesIndexMutex.release();
    }
  }

  static async updateNoteFromYDoc(noteId: string) {
    const { title, content } = await this.getNoteContentFromYDoc(noteId);

    await this.updateNote({
      id: noteId,
      title,
      content,
    });
  }

  static shortenMarkedText(text: string, charsBefore = 10, charsAfter = 100) {
    const matches = text.match(/<mark>(.*?)<\/mark>/g);

    if (!matches) return text;

    const match = matches[0];

    const start = text.indexOf(match);
    const end = start + match.length;

    const startIndex = Math.max(0, start - charsBefore);
    const endIndex = Math.min(text.length, end + charsAfter);
    const shortened = text.slice(startIndex, endIndex);

    let withEllipsis = shortened;

    if (start > charsBefore) {
      withEllipsis = `...${withEllipsis}`;
    }

    if (end < text.length - charsAfter) {
      withEllipsis = `${withEllipsis}...`;
    }

    return withEllipsis;
  }

  static async searchNotes({
    query,
  }: {
    query: string;
  }): Promise<NoteSearchResult[]> {
    let searchResults: EnrichedDocumentSearchResults;

    try {
      searchResults = (await this.notesIndex.search({
        query,
        suggest: true,
        enrich: true,
        highlight: `<mark>$1</mark>`,
      })) as EnrichedDocumentSearchResults;
    } catch (error) {
      console.error("SearchService: Error searching notes", error);
      return [];
    }

    // Merge results by ID
    const mergedResultsMap = new Map<string, NoteSearchResult>();

    searchResults.forEach((group) => {
      group.result.forEach(async (item) => {
        if (!item.doc) return;

        if (!mergedResultsMap.has(item.id)) {
          mergedResultsMap.set(item.id, {
            id: item.id,
            doc: {
              title: item.doc.title as string,
              content: item.doc.content as string,
            },
            titleHighlight: undefined,
            contentHighlight: undefined,
            totalMatches: 0,
          });
        }

        const mergedItem = mergedResultsMap.get(item.id);

        // Keep highlight if it exists
        if (item.highlight && group.field && mergedItem) {
          if (group.field === "title") {
            mergedItem.titleHighlight = item.highlight;
          } else if (group.field === "content") {
            mergedItem.contentHighlight = this.shortenMarkedText(
              item.highlight
            );
          }

          mergedItem.totalMatches++;
        }
      });
    });

    // Convert map to array and sort by relevance (maintaining original order)
    const mergedResults: NoteSearchResult[] = Array.from(
      mergedResultsMap.values()
    );

    return mergedResults;
  }
}

SearchService.init();


burakcan avatar Apr 17 '25 11:04 burakcan

This looks quite complicated :) Probably this could be solved much simpler.

Below is a full working standalone example, based on your code. Results are displayed in browser console. You can change the COUNT_OF_ITEMS on top of the javascript part.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
<script type="module">

    const COUNT_OF_ITEMS = 100;

    //import { Charset, Document, IndexedDB } from "flexsearch";
    import { Charset, Document, IndexedDB } from "https://cdn.jsdelivr.net/gh/nextapps-de/flexsearch@master/src/bundle.js";

    const notesIndex = new Document({
        encoder: Charset.LatinBalance,
        // unfortunately context does not support "forward" tokenizer yet
        // context: false,
        document: {
            store: true,
            index: [
                {
                    field: "title",
                    tokenize: "forward"
                },
                {
                    field: "content",
                    tokenize: "forward"
                }
            ]
        }
    });

    const notesDb = new IndexedDB("search-notes");
    await notesDb.mount(notesIndex);

    async function getNoteContentFromYDoc(noteId) {

        // fetch your data .....

        const title = "Title" + Math.random().toString(36).substring(2);
        const content = (new Array(10)).fill("").map(() => "Content" + Math.random().toString(36).substring(2));

        return {
            id: noteId,
            title,
            content: content.join("\n")
        };
    }

    // ------------------
    // add contents
    // ------------------

    for(let i = 0; i < COUNT_OF_ITEMS; i++){
        notesIndex.add(await getNoteContentFromYDoc(i));
    }

    await notesIndex.commit();

    // ------------------
    // update contents
    // ------------------

    for(let i = 0; i < COUNT_OF_ITEMS; i++){
        notesIndex.add(await getNoteContentFromYDoc(i));
        // or:
        //notesIndex.update(i, await getNoteContentFromYDoc(i));
    }

    await notesIndex.commit();

    // search contents
    // ------------------

    for(let i = 0; i < 9; i++){
        let result = await notesIndex.search({
            query: Math.random() > 0.5 ? "title" : "content",
            suggest: true,
            enrich: true,
            highlight: "<mark>$1</mark>"
        });

        console.log(result);
    }

    // ------------------
    // remove contents
    // ------------------

    for(let i = 0; i < COUNT_OF_ITEMS; i++){
        notesIndex.remove(i);
    }

    await notesIndex.commit();

</script>
</body>
</html>

Some hints:

  1. The typescript package "@/types/FlexSearch" is not supported anymore and was replaced by the official index.d.ts of this repository.
  2. The native result highlight feature from this library probably better handle encoding
  3. Use css style overflow: hidden; max-width: 200px; text-overflow: ellipsis to break contents which is too long, read more about here https://www.w3schools.com/cssref/css3_pr_text-overflow.php

ts-thomas avatar Apr 17 '25 16:04 ts-thomas

Addition to my hint 3.: I will add a feature to define a limit of length surrounded by matches within highlight. Good catch 👍

ts-thomas avatar Apr 17 '25 16:04 ts-thomas

Hey, thank you for the example. I think the "skipping" issue is happening because I'm commiting after each "add". I tried doing it like you did (add a lot, then commit once) and it didn't skip the items this way. But in my case I think I'll continue using the mutex.

Also thank you for suggestions :)

  1. yeah, there was an issue with my setup about including the official d.ts file. So as a band-aid solution @/types/FlexSearch is actually a file within my project

Imageattachments/assets/67a3bf6d-3cd2-4d52-ba3e-4bded769eacc)

  1. I did this in js here because I also wanted to have leading ellipsis. So if the <mark> is in the middle of the document, I can still show a preview like ... part of the content ... And also I hae multiline summary, which is not easily supported with a css only solution:

Image Now it works quite well I think. Thank you again for the help.

edit: also, some of the complexity is coming from "merging" the results. When I do a search, I get separate "title" and "content" results. Then I deduplicate (count duplications) and merge title and content results. This way I can do highlighting on both in a single result item like:

Image

burakcan avatar Apr 17 '25 23:04 burakcan

oh just noticed the additional comment, actually having it natively supported by the library would be perfect 💯

burakcan avatar Apr 17 '25 23:04 burakcan

@burakcan Good news, the Result Highlighting feature was extended: https://github.com/nextapps-de/flexsearch/blob/master/doc/result-highlighting.md There is one feature still in progress: the combination of { merge: true, highlight: ... } when searching combines results of multiple fields into one result item (like GROUP BY id).

ts-thomas avatar May 12 '25 06:05 ts-thomas

oh thank you @ts-thomas this looks amazing. I'll try to replace my implementation with it this week and give feedback.

burakcan avatar May 12 '25 11:05 burakcan

@burakcan The combination of { merge: true, highlight: ... } is now supported.

example from test:

// some test data
const data = [{
    "id": 1,
    "title": "Carmencita",
    "description": "Description: Carmencita"
},{
    "id": 2,
    "title": "Le clown et ses chiens",
    "description": "Description: Le clown et ses chiens"
}];

// create the document index
const index = new Document({
    encoder: Charset.LatinBalance,
    document: {
        store: true,
        index: [{
            field: "title",
            tokenize: "forward"
        },{
            field: "description",
            tokenize: "forward"
        }]
    }
});

// add test data
for(let i = 0; i < data.length; i++){
    index.add(data[i]);
}

let result = index.search({
    query: "karmen or clown or not found",
    suggest: true,
    enrich: true,
    merge: true,
    highlight: "<b>$1</b>"
});

Result:

[{
    id: 1,
    doc: {
        "id": 1,
        "title": "Carmencita",
        "description": "Description: Carmencita"
    },
    field: ["title", "description"],
    highlight: {
        "title": '<b>Carmen</b>cita',
        "description": 'Description: <b>Carmen</b>cita',
    }
},{
    id: 2,
    doc: {
        "id": 2,
        "title": "Le clown et ses chiens",
        "description": "Description: Le clown et ses chiens"
    },
    field: ["title", "description"],
    highlight: {
        "title": 'Le <b>clown</b> et ses chiens',
        "description": 'Description: Le <b>clown</b> et ses chiens',
    }
}]

Do you see any improvements?

ts-thomas avatar May 14 '25 08:05 ts-thomas