orama icon indicating copy to clipboard operation
orama copied to clipboard

Extend Crawler queries by a custom "data-orama" attribute

Open fabiobiondi opened this issue 2 months ago • 2 comments

Problem Description

We are trying the Crawler and and we noticed that our Next 14 site is not being indexed.

The problem is probably that we have many nested components that render texts inside <div> instead of <p>. I realize that it's not the best in terms of accessibility and semantics but we have this need.

Looking at the source code (general-purpose.ts) we realized that the contents of the <div>s are totally ignored.

https://github.com/askorama/crawly/blob/2892e473775a408495d07a0dea016ec23a85d362/src/general-purpose.ts#L34-L51

In fact I and @gioboa did a test modifying your function, adding <div>s to the query, but dirt and non-useful DOM elements were also indexed. So it doesn't seem like a decent solution.

Proposed Solution

We thought an interesting idea might be to let users decide what content to index outside of your rules.

A very simple hypothetical solution could be to insert a data-orama attribute on the elements to be indexed into the site you want to index and extend the crawler to also query those elements.

<div data-orama> content </div>

I think it might be a simple, clean and powerful way to extend it.

What do you think?

Alternatives

Another future solution could be to allow the crawler function to be completely customized by the users

Additional Context

No response

fabiobiondi avatar May 14 '24 16:05 fabiobiondi