manticoresearch
manticoresearch copied to clipboard
Auto table creation on INSERT/REPLACE (aka dynamic mapping)
The task is to make it possible to insert into a table which doesn't exist yet, or in other words for a table to be created automatically so it's possible to use Manticore Search with logstash and other log shipping tools that don't create indexes implicitly.
Preliminary specification (to be discussed)
searchd.dynamic_mapping = 1/0(1 by default) to disable the functionality- create schema automatically based on the first document / batch of documents (then the schema should fit all the documents in the batch, otherwise an error should be returned (e.g. in case of string/int conflict)). In the future we want to support higher heterogeneity, but it's out of scope of this issue.
- auto-redetection. This is needed to make a better schema in the following cases:
- seldom floats:
1, 2, 3, 4, 5.6, 6, 7, 8. W/o the redetection we would create an int attribute while it should be a float - unknown size:
12300, 123035, 542342, 34359738368. W/o the redetection we would create an int attribute while it should be a bigint - timestamp or not?
1605599944, 1605599945, 1605599946, 433. W/o the redetection we would detect it as a timestamp, but it's actually just an int. This can be solved by buffering first N (e.g. 1000) documents for auto-redetection when the needed number of documents has been hit. For that we can perhaps store everything in raw format in a new data structure in memory while we are accumulating and recreate the whole memory chunk once we have enough. Or we can trigger redetection as soon as it becomes inevitable, i.e. the type defined after the first document is not compatible with the new value. As it can cause some instabilities this should be configurable insearchd.data_type_redetection_buffer = 0(disable)/1000(default).
- seldom floats:
➤ Sergey Nikolaev commented:
Call notes after dev call of Aug 18 2022
- Q: for a string how do we understand what underlying data type to create: field or string attribute or stored only?
- A: there doesn't seem to be a general rule. We need to research how it's made in other databases to make a decision.
- As for the auto redetection here's what seems more or less optimal:
- trigger schema update on:
rt_mem_limitexceeded orsearchd.data_type_redetection_buffer = N(about 1000 by default) (whichever happens first)- a conflict, e.g.
INSERT INTO t values(1); INSERT INTO t values(1.1). Here at the 2nd insert it should change the schema to float.
- on schema update do not do internal ALTER, instead just run internal INSERT with the new schema
- it may be easier to have all the fields (and attrs) stored temporarily in docstore while the schema is not stable, so it's easier to read the values that should be reinserted after the schema is updated
- trigger schema update on:
Thinking long term, would be great if this functionality could be 'abused' to power
CREATE TABLE index2 SELECT id,title FROM index1 WHERE images> 1000
kind of query. ie create a index based on another query. Obviously max_matches would be a limit! Without having to download all the data, only to immediately reupload it.
(for background, create a number tables in mysql as 'material views' and then create a (currently 'plain') index on that table. Would be cool to short cut that and create the index directly in manticore. Of course would only work for stored fields/attributes from index1! The real query would be more complicated, probably involving a GROUP BY!)
There's one thing we are disussing now, which can help with implementation of things like CREATE TABLE index2 SELECT id,title FROM index1 WHERE images> 1000, INSERT INTO ... SELECT * , running backup/restore/indexer etc. from inside Manticore Search etc. The idea is simple:
- little sidecar daemon written in PHP which has a communication channel with the searchd. Written in PHP, so it's easy to implement high-level logic.
- when Manticore gets a query which it doesn't know how to process before failing it first asks the sidecar if it knows anything about it.
- if the sidecar can handle it - it does that and returns a response which is passed over back to the user.
So for example:
- searchd receives
CREATE TABLE index2 SELECT id,title FROM index1 WHERE images> 1000 - before failing it sends it over to the sidecar
- the sidecar parses it and does a number of simple things:
desc index1create table index2 (...)- in a loop:
select id, title from index1 where ...insert into index2 values(...)
- returns "OK" to the searchd
- the searchd returns "OK" to the user
Hmm, that could work. But seems like it would still mean transfering all the data out of searchd, then the PHP app, has to repackage all the data (and re encoding it as SQL), jsut send it all back again. Ok, if the sidecar is on the same network as searchd - not quite so terrible, but still.
If not going to be internal, seems like it would be easier to just leave it upto the application developer. Could provide a libary of scripts to automate common tasks.
My own script https://github.com/barryhunter/indexdump does most of it now - dump the output of a manticore query as sq-script. Which can be piped directly back to searchd with the 'mysql' command line client. its creation of the 'create table' is a bit rudimentary at the moment tho.
php indexdump.php -hbalancer \"select * from table where title != 'Other'\" | mysql -hworker -P9306
(not quite functional, but close)
➤ Nick Sergeev commented:
The information on this feature is now added to the documentation: https://manual.manticoresearch.com/dev/Adding_documents_to_a_table/Adding_documents_to_a_real-time_table#Auto-schema