manticoresearch
manticoresearch copied to clipboard
backup script
We need an official backup script shipped along with the other tools.
After discussing with the core team here's what seems to be most promising:
- we need
LOCK/UNLOCK <idx>
in Manticore Search to make sure nothing can't modify<idx>
during backup. Rel. issue https://github.com/manticoresoftware/manticoresearch/issues/822 - we need a bash script which does LOCK/UNLOCK and just uses rsync with proper checksumming.
- why bash? PHP/python would be more convenient, but we would then need to make it a dependency in .rpm/.deb packages, which is what we'd like to avoid doing.
Alternative discussed ideas:
-
BACKUP ...
command - ok, but will take much longer to implement than the above. Perhaps such a command could be just a wrapper for the above script once it's ready. - tool which users replication to connect to a node and sync index from it - ok too, but it'll also take longer to implement possible replication issues (e.g. out of replication ports).
Perl?
Perl?
No, thank you:
- We don't write in perl
- We would like to avoid a dependency like this:

...
Transaction Summary
============================================================================================================================================================================
Install 158 Packages
Upgrade 3 Packages
Total download size: 25 M
PHP look more promising then:
but it's still smth we'd like to avoid having as a dependency.
Достаточно ставить php-fpm, тогда оно не потащит за собой зоопарк из апача и прочего.
php-fpm, php-mysqli, php-pdo
скорее всего достаточно (у меня нет центоса, не могу проверить)
➤ Sergey Nikolaev commented:
... install php-cli
may be indeed a good option since:
- it says it + its dependencies download 3MB from the Internet and take about 15MB of disk space
- and it takes about 10 sec to install them all
- the dependencies seem to be all sane, no massive chain reaction
root@14b071c20238:/# apt install php-cli
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
libbsd0 libedit2 libsodium23 php-common php7.4-cli php7.4-common php7.4-json php7.4-opcache php7.4-readline psmisc ucf
Suggested packages:
php-pear
The following NEW packages will be installed:
libbsd0 libedit2 libsodium23 php-cli php-common php7.4-cli php7.4-common php7.4-json php7.4-opcache php7.4-readline psmisc ucf
0 upgraded, 12 newly installed, 0 to remove and 112 not upgraded.
Need to get 3036 kB of archives.
After this operation, 14.6 MB of additional disk space will be used.
but it still requires better thinking and research.
Overall I would say bash/shell would be preferable, as simpler. But on the other hand PHP would be more portable to Windows!
The logic of the script seems pretty simple, just needs to run the lock (which admittedly means that a mysql-client is a dependency! although could perhaps use the HTTP interface) and copy the files, and unlock
But if the script is to fetch a list of all indexes, and even has to look for the path
for each index (from the .conf file, and perhaps even the .json file for RT mode?) then that begins to get beyond a simple bash script.
Draft ideas/questions for the script:
- "could perhaps use the HTTP interface" - yes, shouldn't be a big deal to use
/sql?mode=raw
endpoint. In case we use bashcurl
may be also unavailable, but if we go withphp
stream_context_create() and file_get_contents() are available out of the box. - use
select * from idx.files
to get list of index files (it works for plain/pq indexes too) - checksumming is important.
rsync
can be a dependency (800kb unarchived: rsync + one lib) - it should flush binlogs and make sure they are all flushed, i.e. no binlogs left
- To think: what will point to the instance to backup from?
- path to config (e.g.
/etc/manticoresearch/manticore.conf
and then we have to parse it) - host / host:port (e.g.
localhost
orlocalhost:9312
, then we may need to find what pid it is which may be a problem) - searchd pid (e.g.
12345
: doesn't seem to be a good practice for backup scripts)
- path to config (e.g.
- if no instance specified we should try with the default one
- mode 1: should be able to backup the whole instance, then:
- perhaps
set maintenance=1
- lock all indexes
- rsync all
- unlock all
- perhaps
- mode 2: should be able to backup specified indexes only, then:
- if one of the indexes is distributed warns it may change during the backup (not sure if
LOCK
will cover it)
- if one of the indexes is distributed warns it may change during the backup (not sure if
- always include config and
manticore.json
(if it exists) to the backup, since it's not a big deal - preserve file permissions
- To think: how do we backup wordforms, exceptions etc of plain indexes? It may require parsing config which can be complicated.
- To think: same question about
plugin_dir
,lemmatizer_base
etc. - catch TERM/INT and unlock before terminating the backup script
- implement mode
backup --unlock
just to unlock and exit if the above didn't help - show backup progress nicely
- backup searchd/MCL version info too (may be important when you restore from the backup)
Не очень хорошо знаю английский для длинных дискуссий, поэтому будет на русском
use select * from idx.files to get list of index files (it works for plain/pq indexes too)
Мне кажется это уязвимым решением. Лучше:
wombat@source:~$ mysql --host 127.0.0.1 --port=9306 --prompt='Manticore> ' -e 'show tables;'
+-----------------------------+------+
| Index | Type |
+-----------------------------+------+
| rt_47newsru_articles | rt |
| rt_47newsru_articles_stored | rt |
| rt_47newsru_reports | rt |
+-----------------------------+------+
Но тут придется разбирать строчки. А можно ли запросить список индексов в JSON-формате? тогда их можно было бы разобрать через JQ
It may require parsing config which can be complicated.
it's not a problem...
Summary: I think, PHP or any compiled language is better than bash+powershell scripts.
And, Perl is good, but you don't want know this word
@sanikolaev that looks good list!
I agree that rsync is good, as it should allow the backup to be taken as quick as possible, without locking the server for too long. (it could even do one 'dirty' rsync outside the lock, then do another rsync inside the lock, which will only copy stuff that actually changed. More wasteful as some data might be transfered needlessly, but would in general keep the lock for shorter period)
Of course other possibility is to do a 'mysqldump' style export. At least for indexes where fields have been stored in 'docstore'. Otherwise would only be attributes, but for a plain index, it only the attibutes that change, so need backing up :)
doing a 'dump' style export would also have the benefit of being more portable (load into different server versions!), and even work remotely (ie not running on the same server as searchd) - but would in general be MUCH slower than rsync copy. (importing the 'dump' file would also be much slower, if need to 'restore' the backup)
Anyway, I just commited a script I created a while ago for creating a 'sql dump' file, from a running searchd instance. by doing 'SELECT * FROM'. https://github.com/barryhunter/indexdump Its only one table at a time (has to be named), and while it will add a 'create table' for the index. IF 'show create table' is not available, it tries to replicate a CREATE TABLE from 'DESCRIBE' but I admit that part isn't well tested. And certainly wont get all the extra options, its just the fields/attributes. (based on a similar script that more closely replicates mysqldump for mysql databases, but can dump views)
I still think concentrating on rsync/file based copy is best for the official script. But can make the above script more robust if its of interest to others. (maybe the backup script could use something like 'indexdump.php' as an option instead of rsync)
Hello In my opinion the best idea is to be compatible with the mysqldump utility. This is the most natural way to make a logical backup.
Or create a manticoredump)
Or
SELECT ... INTO OUTFILE 'file_name' [CHARACTER SET charset_name] [export_options]
And
LOAD DATA [LOW_PRIORITY | CONCURRENT] [LOCAL] INFILE 'file_name' [REPLACE | IGNORE] INTO TABLE tbl_name
FFR: there's another issue or, to be more correct, a general solution for the mysqldump compatibility problem - https://github.com/manticoresoftware/manticoresearch/issues/710#issuecomment-1039957828
For now for logical backups of Manticore Search (Google, please index it) I recommend to try the tool Barry mentioned above -https://github.com/barryhunter/fakedump/blob/master/indexdump.php and help him with improving it
Within this issue we'd like to focus on the physical kind of backups.
Looks like Barry's code (and mysqldump-like) can't backup unstored RT indexes...
Looks like Barry's code (and mysqldump-like) can't backup unstored RT indexes...
That's right. As Barry wrote "At least for indexes where fields have been stored in 'docstore'".
Logical backup of non-stored fields is impossible in a general case since you may be using dict=crc
, therefore even if you decode the dictionary/doclist/wordlist you won't be able to turn it into the original text. Same if you use min_word_len > 1, stopwords, morphology etc.
That's why since Manticore 5 everything is stored by default. In the future it should also make it possible to do retokenization on the fly (we already have a specification of this task).
That's why since Manticore 5 everything is stored by default.
so, backup utility must check source Manticore version and index structure..
my indexdump.php script does check the output of DESCRIBE, and emits a warning that non-stored fields were in the original index and wont be in the dump. (it does account for ones duplicated into a string attribute!) In theory it should be able to dump any plain or RT index, regardless of searchd version. But the 'create table' will only be usable with more recent versions!
On the other hand if you have non-stored fields, almost certainly not using Manticore as a 'database' (its indexing copy of data somewhere else) backing up the index is less critical.
And if you are backing up the index(es) for 'quick' recovery incase of a catastrophic failure, then a physical backup is already far more practical. (you could run 'inserts' from the real database.) restoring a physical backup is much quicker than a logical backup.
Oh, and as indexdump.php started life a clone of mysqldump, it already has LOCK support :) - at least in locking the index when dumping. It can't add any locking the dumped output.
Some problem with paths to exceptions, stopwords and wordforms files When a backup is created in SHOW CREATE TABLE these paths point to the index directory. Is there any way to get the original paths to the files? (Which were used when creating the index) Or a way to set read access to this files from the index for a non-root user?
Is there any way to get the original paths to the files?
read and parse config ?
It's probably not the best solution. A backup is a snapshot of the current state of an index. Including wordforms files. Also one of the main options for using Manticore Search - RT without listing in configuration file.
It is correct just to copy them from the directory with the index. And there are no problems other than access (backup is usually created not from the root user)
lock <IDX>[,<IDX2>,<IDX3...]
will return you table with all relevant files.
That is already published, you can try it. No need to manually parse config/whatever.
Don't forget also to call unlock
when finished.
Yes, there is a list, but it could also be obtained from the result of SHOW CREATE TABLE (exceptions, stopwords and wordforms) The problem is accessing these files. They are available only to the owner - user manticore or root
lock <IDX>[,<IDX2>,<IDX3...]
will return you table with all relevant files. That is already published, you can try it. No need to manually parse config/whatever. Don't forget also to callunlock
when finished.
- Please write how the locks work in more detail? I try to lock the index - but in fact nothing happens except for listing the files
In first session:
MySQL [(none)]> CREATE TABLE test_index(label text);
Query OK, 0 rows affected (0.001 sec)
MySQL [(none)]> LOCK test_index;
+---------------------------------------------------+---------------------------------------------------+
| file | normalized |
+---------------------------------------------------+---------------------------------------------------+
| /var/lib/manticore/test_index/test_index.meta | /var/lib/manticore/test_index/test_index.meta |
| /var/lib/manticore/test_index/test_index.ram | /var/lib/manticore/test_index/test_index.ram |
| /var/lib/manticore/test_index/test_index.settings | /var/lib/manticore/test_index/test_index.settings |
+---------------------------------------------------+---------------------------------------------------+
3 rows in set (0.000 sec)
MySQL [(none)]> INSERT INTO test_index VALUES (0, 'Test text');
Query OK, 1 row affected (0.001 sec)
in second session:
MySQL [(none)]> INSERT INTO test_index VALUES (0, 'Row 2');
Query OK, 1 row affected (0.000 sec)
In first session:
MySQL [(none)]> SELECT * FROM test_index;
+---------------------+-----------+
| id | label |
+---------------------+-----------+
| 8864805720680300547 | Row 2 |
| 8864805720680300546 | Test text |
+---------------------+-----------+
2 rows in set (0.000 sec)
MySQL [(none)]> UNLOCK test_index;
Query OK, 1 row affected (0.000 sec)
Looks like the recording continues unhindered
- I have to do FLUSH RTINDEX test_index; before backup?
A comment on the dedicated Lock/unlock thread, https://github.com/manticoresoftware/manticoresearch/issues/822 suggests the LOCK command only blocks statements from replication.
... stas seems to imply, it would an additional feature to also block direct updates from insert etc.
But it does seem that would be required before can call the 'lock' function done, and suitable for use with this backup.
Although guess could argue that the 'concurrent inserts' etc, happening on the index, while locked, arent affecting the files on disk (going into memory, until flushed, either the ram chunk, or atributes etc) ... so really the 'lock' only preventing writes to the index files (so rsync etc can take a consistent snapshot) - while writes can still be made to the index.
Lock only prevents writes to the 'index files', not writes to the logical index.
In other words this LOCK function might be good enough for a physical backup, but it's NOT good enough for a logical backup! A logical backup could still see dirty data by seeing some intermediate state.
Got it.
Utility for physical backup under Node.js In the end, if you write a command in sudoers - everything is quite convenient
Like this:
%username ALL=(ALL)NOPASSWD:/usr/local/bin/indexbackup *
Example
yarn global add indexbackup
sudo indexbackup --all | aws s3 cp - s3://buket-name/daily/backup.tar.gz
And my 20G indexes flew to AWS s3)
Thanks everyone
For some cases, a manticore.json backup is needed.
Is there a way to get the data_dir parameter?
Is there a way to get the data_dir parameter?
It's not possible to get it from inside a running Manticore instance now, but we already discussed with the core team that it shouldn't be a big deal to implement this and similar things to make it easier for the backup script to understand more about the instance it's connected to.
If possible, discuss changing the default permissions for index files as well. For example read access for the manticore group. Then it would be possible to add a backup user to the group and not play with sudo and root
For example read access for the manticore group.
Good idea! Should be ok in terms of performance. I'll create a separate issue about it.
we need LOCK/UNLOCK
in Manticore Search to make sure nothing can't modify during backup. Rel. issue https://github.com/manticoresoftware/manticoresearch/issues/822
LOCK/UNLOCK has been implemented in https://github.com/manticoresoftware/manticoresearch/commit/f6fe6408f24251eb1dd65e284c7c9b42593d7792 , but requires testing and documenting.