manticoresearch icon indicating copy to clipboard operation
manticoresearch copied to clipboard

Secondary indexes for JSON

Open sanikolaev opened this issue 11 months ago • 4 comments

The task is to implement secondary indexes for JSON.

We made a research of performance issues of one Manticore Search users and we have determined that the key to enhancing the queries performance lies in utilizing secondary indexes for JSON attributes. This necessitates partial JSON flattening. In order to proceed, we need to design and implement the following features:

  • before generating secondary indexes for attributes we need to apply some flattening (that will be the source for the secondary indexes) and type deduction rules
    • we need to determine all available attribute names (e.g. “json.subclass1.subclass2.property”)
    • we need to determine types for all attributes, even if they have different types in different documents
  • json flattening rules:
    • int/float/string types: as-is, already supported in daemon and MCL
    • subobject: "json.subclass.property = value"
    • array of mixed types: not supported now, not planned
    • array of ints: not currently supported; support will be implemented
    • array of strings: required to be supported in SI, will NOT be supported in daemon/MCL (i.e. there won’t be a new data type)
    • array of arrays: not supported now, not planned
    • array of objects: not supported now, not planned
  • type deduction rules:
    • after we do json flattening we have a list of all flattened attributes which may have different types
    • we can take several approaches here and will need to figure out which is the most optimal:
      • Cast to widest encountered type (e.g. int to bigint, bigint to float, float to string)
      • Cast only numerics with no data loss (e.g. int to bigint); ignore non-numeric data
    • we can’t use daemon expressions to access JSON data because we will have to generate SI at indexer level which doesn’t have any expressions
    • we’ll have to use lower level BSON access code
  • searching:
    • to use these new generated attributes we need to preprocess query filters and select expressions
    • currently there are attributes in select expressions (e.g. in(xx_sph_attr_xxattributes['9xxxxxxc704d781733bxxxxxxxx0'], 'xonxtixe') as col_1)
      • because daemon does not support expressions in filters, we also need to track select list attributes that are used in filters and analyze them
    • json expressions will be replaced by attributes by means of plain string matching
      • e.g. "xx_sph_attr_xxattributes['9xxxxxxc704d781733bxxxxxxxx0']" -> “xx_sph_attr_xxattributes.9xxxxxxc704d781733bxxxxxxxx0"
      • if a secondary index with such a name exists, it will replace the filter
    • complex JSON expressions cannot be directly used as filters
    • e.g. select *, any(x >= 1 and x <= 2 for x in xx_sph_attr_xxattributes['2443xxxxxxxxb76xxxxxxxxxxxxc4d']) a from test where a>0 will need to become: select * from test where any(xx_sph_attr_xxattributes.2443xxxxxxxxb76xxxxxxxxxxxxc4d) between 1 and 2
    • and this modification should only work when an attribute xx_sph_attr_xxattributes.2443xxxxxxxxb76xxxxxxxxxxxxc4d exists in SI and JSON expression can be translated into an attribute filter
    • otherwise, query should not be modified and should work as it did before
  • we need to add such indexes to CBO rules (probably to always use them if available because JSON scan is slow)

In summary, we will dynamically generate secondary indexes for JSON keys, which will then serve as shortcuts to bypass most documents when filtering using these keys.

sanikolaev avatar Mar 08 '24 14:03 sanikolaev

not quite sure about array of float as if it work similar to mva there is no point to store individual float values and build index for every value

if float vector used for knn there no need to add SI for that type

tomatolog avatar Mar 08 '24 18:03 tomatolog

Done in https://github.com/manticoresoftware/manticoresearch/commit/155d335b503294a01895ef7ae4919b7142661b65

glookka avatar Jun 29 '24 14:06 glookka

The task to document this is https://github.com/manticoresoftware/manticoresearch/issues/2348

sanikolaev avatar Jul 01 '24 08:07 sanikolaev

Reopening to update OpenAPI yaml, so it's possible to control creating secondary indexes via clients.

sanikolaev avatar Aug 07 '24 11:08 sanikolaev