tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

Could we specify the document id explictly when trying to add a document into the index?

Open longjiquan opened this issue 2 years ago • 8 comments

Firstly, this is an awesome project and thanks for all of your great works.

For now from what I already known, tantivy didn't support dynamic mapping already (please correct me if I'm wrong), the closest solution is "flatten json". As the document mentioned, range query is not supported well with flatten json.

So I'm trying to implement dynamic mapping based on the tantivy's fundamental index types. It's very naive that all documents will be parsed and then index for every possible flatten path. If so, I must maintain a relationship between the doc id of the related index and the doc id of orignal document. So are there any chances that we could specify the document id explictly for the document?

longjiquan avatar Nov 06 '23 08:11 longjiquan

What do you mean with "flatten Json"? We have the JSON type which reflects dynamic typing on a field, and there's the idea to have a global flag on the schema https://github.com/quickwit-oss/tantivy/issues/2215. Internally we flatten nested JSON structs on the JSON type.

A docid is a internal id in tantivy, which is the position of a document in a segment. It changes after merges and doesn't make sense to be user provided, but you can store any ids or data on your documents.

PSeitz avatar Nov 06 '23 11:11 PSeitz

@PSeitz thanks for the explanation and I'll read the https://github.com/quickwit-oss/tantivy/issues/2215. "flatten json" is from here.

By the way, is the order of docid consistent with the input order of the documents? In our use cases, we will use tantivy to do some filterings but only the offset of document matters to us. We can ensure our index will have only one segment. Of course, you're right we can store the offset attatched to the document, but if so this will result in big read/write amplification.

longjiquan avatar Nov 06 '23 12:11 longjiquan

Range queries on JSON can be supported, but isn't currently: https://github.com/quickwit-oss/tantivy/issues/1709.

By the way, is the order of docid consistent with the input order of the documents?

If you have only one single threaded writer and no merges, I think yes.

So I'm trying to implement dynamic mapping based on the tantivy's fundamental index types. It's very naive that all documents will be parsed and then index for every possible flatten path. If so, I must maintain a relationship between the doc id of the related index and the doc id of orignal document.

I don't really understand what you are trying to do. Can you give an example?

PSeitz avatar Nov 06 '23 12:11 PSeitz

In fact, what we want to do is implementation 2 in https://github.com/quickwit-oss/tantivy/issues/1050.

If we have below two documents:

{
  "title": "Google is a useful search engine",
  "url": "google.com",
  "num_clicks": 1026,
  "meta": {
    "num_staffs": 100,
    "description": ["desc1", "desc2"],
  }
}
{
  "title": "Databricks",
  "url": "databricks.com",
  "num_clicks": 88,
  "meta": {
    "num_staffs": 46,
    "created_at": "2022-06-22T13:00:00.22Z"
  }
}

Then we will create 6 index for them:

index_name data_type
title text
url text
num_clicks u64
meta.num_staffs u64
meta.description Multivalued text
meta.created_at date

When a query is met, for example "meta.num_staffs > 58", we will translate it on the index "meta.num_staffs". As I already mentioned above, we want to know the offset of the hit documents, here document 0 (meta.num_staffs = 100) was hit, so I want offset 0 to be returned. To do so, as you reminded, we can add a new field to every index which will attach the offset.

I'll appreciate it a lot if you could provide more detailed design about the dynamic schema.

longjiquan avatar Nov 07 '23 03:11 longjiquan

Why do you create 6 indices and not just one? You can flatten the object on a preprocessing to circumvent the JSON Range query limitation (or create a PR that fixes it).

I'll appreciate it a lot if you could provide more detailed design about the dynamic schema.

What details are missing? This is just a current outline how it could work not a specification, so going too much into detail upfront doesn't make sense imo.

PSeitz avatar Nov 07 '23 07:11 PSeitz

OK, thanks a lot, according to https://github.com/quickwit-oss/tantivy/issues/1709#issuecomment-1510665022, I guess that the only missing thing on range query of json field is to support the syntax in query parser. And I noticed that there already were someone trying to fix it. Of course, I'm also glad to contribute to this project.

By the way, no offense, what do you mean "flatten the object on a preprocessing"?

longjiquan avatar Nov 07 '23 08:11 longjiquan

OK, thanks a lot, according to https://github.com/quickwit-oss/tantivy/issues/1709#issuecomment-1510665022, I guess that the only missing thing on range query of json field is to support the syntax in query parser. And I noticed that there already were someone trying to fix it. Of course, I'm also glad to contribute to this project.

I think no one is working on that currently.

By the way, no offense, what do you mean "flatten the object on a preprocessing"?

If that's the structure of your document, it should be simple.

{
  "title": "Google is a useful search engine",
  "meta": {
    "num_staffs": 100,
    "description": ["desc1", "desc2"],
  }
}
{
  "title": "Google is a useful search engine",
  "meta.num_staffs":  100,
  "meta.description":  ["desc1", "desc2"],
}

PSeitz avatar Nov 07 '23 14:11 PSeitz

OK, I got it. I noticed that tantivy's query has very clear abstraction. We need the range query anyway (though maybe a few months later). I'll try to add a implementation of the interface and if it's running well in our environment I'll open a PR for this.

longjiquan avatar Nov 08 '23 08:11 longjiquan