manticoresearch icon indicating copy to clipboard operation
manticoresearch copied to clipboard

backup script

Open sanikolaev opened this issue 2 years ago • 43 comments

We need an official backup script shipped along with the other tools.

After discussing with the core team here's what seems to be most promising:

  • we need LOCK/UNLOCK <idx> in Manticore Search to make sure nothing can't modify <idx> during backup. Rel. issue https://github.com/manticoresoftware/manticoresearch/issues/822
  • we need a bash script which does LOCK/UNLOCK and just uses rsync with proper checksumming.
    • why bash? PHP/python would be more convenient, but we would then need to make it a dependency in .rpm/.deb packages, which is what we'd like to avoid doing.

Alternative discussed ideas:

  • BACKUP ... command - ok, but will take much longer to implement than the above. Perhaps such a command could be just a wrapper for the above script once it's ready.
  • tool which users replication to connect to a node and sync index from it - ok too, but it'll also take longer to implement possible replication issues (e.g. out of replication ports).

sanikolaev avatar Jul 14 '22 10:07 sanikolaev

Perl?

KarelWintersky avatar Jul 19 '22 16:07 KarelWintersky

Perl?

No, thank you:

  1. We don't write in perl
  2. We would like to avoid a dependency like this:
image
...
Transaction Summary
============================================================================================================================================================================
Install  158 Packages
Upgrade    3 Packages

Total download size: 25 M

PHP look more promising then: image

but it's still smth we'd like to avoid having as a dependency.

sanikolaev avatar Jul 19 '22 16:07 sanikolaev

Достаточно ставить php-fpm, тогда оно не потащит за собой зоопарк из апача и прочего.

php-fpm, php-mysqli, php-pdo

скорее всего достаточно (у меня нет центоса, не могу проверить)

KarelWintersky avatar Jul 19 '22 16:07 KarelWintersky

➤ Sergey Nikolaev commented:

... install php-cli may be indeed a good option since:

  • it says it + its dependencies download 3MB from the Internet and take about 15MB of disk space
  • and it takes about 10 sec to install them all
  • the dependencies seem to be all sane, no massive chain reaction
root@14b071c20238:/# apt install php-cli 
Reading package lists... Done 
Building dependency tree 
Reading state information... Done 
The following additional packages will be installed: 
  libbsd0 libedit2 libsodium23 php-common php7.4-cli php7.4-common php7.4-json php7.4-opcache php7.4-readline psmisc ucf 
Suggested packages: 
  php-pear 
The following NEW packages will be installed: 
  libbsd0 libedit2 libsodium23 php-cli php-common php7.4-cli php7.4-common php7.4-json php7.4-opcache php7.4-readline psmisc ucf 
0 upgraded, 12 newly installed, 0 to remove and 112 not upgraded. 
Need to get 3036 kB of archives. 
After this operation, 14.6 MB of additional disk space will be used. 

but it still requires better thinking and research.

githubmanticore avatar Jul 20 '22 09:07 githubmanticore

Overall I would say bash/shell would be preferable, as simpler. But on the other hand PHP would be more portable to Windows!

The logic of the script seems pretty simple, just needs to run the lock (which admittedly means that a mysql-client is a dependency! although could perhaps use the HTTP interface) and copy the files, and unlock

But if the script is to fetch a list of all indexes, and even has to look for the path for each index (from the .conf file, and perhaps even the .json file for RT mode?) then that begins to get beyond a simple bash script.

barryhunter avatar Jul 20 '22 10:07 barryhunter

Draft ideas/questions for the script:

  • "could perhaps use the HTTP interface" - yes, shouldn't be a big deal to use /sql?mode=raw endpoint. In case we use bash curl may be also unavailable, but if we go with php stream_context_create() and file_get_contents() are available out of the box.
  • use select * from idx.files to get list of index files (it works for plain/pq indexes too)
  • checksumming is important. rsync can be a dependency (800kb unarchived: rsync + one lib)
  • it should flush binlogs and make sure they are all flushed, i.e. no binlogs left
  • To think: what will point to the instance to backup from?
    • path to config (e.g. /etc/manticoresearch/manticore.conf and then we have to parse it)
    • host / host:port (e.g. localhost or localhost:9312, then we may need to find what pid it is which may be a problem)
    • searchd pid (e.g. 12345: doesn't seem to be a good practice for backup scripts)
  • if no instance specified we should try with the default one
  • mode 1: should be able to backup the whole instance, then:
    • perhaps set maintenance=1
    • lock all indexes
    • rsync all
    • unlock all
  • mode 2: should be able to backup specified indexes only, then:
    • if one of the indexes is distributed warns it may change during the backup (not sure if LOCK will cover it)
  • always include config and manticore.json (if it exists) to the backup, since it's not a big deal
  • preserve file permissions
  • To think: how do we backup wordforms, exceptions etc of plain indexes? It may require parsing config which can be complicated.
  • To think: same question about plugin_dir, lemmatizer_base etc.
  • catch TERM/INT and unlock before terminating the backup script
  • implement mode backup --unlock just to unlock and exit if the above didn't help
  • show backup progress nicely
  • backup searchd/MCL version info too (may be important when you restore from the backup)

sanikolaev avatar Jul 20 '22 11:07 sanikolaev

Не очень хорошо знаю английский для длинных дискуссий, поэтому будет на русском

use select * from idx.files to get list of index files (it works for plain/pq indexes too)

Мне кажется это уязвимым решением. Лучше:

wombat@source:~$ mysql --host 127.0.0.1 --port=9306 --prompt='Manticore> ' -e 'show tables;'
+-----------------------------+------+
| Index                       | Type |
+-----------------------------+------+
| rt_47newsru_articles        | rt   |
| rt_47newsru_articles_stored | rt   |
| rt_47newsru_reports         | rt   |
+-----------------------------+------+

Но тут придется разбирать строчки. А можно ли запросить список индексов в JSON-формате? тогда их можно было бы разобрать через JQ

KarelWintersky avatar Jul 20 '22 12:07 KarelWintersky

It may require parsing config which can be complicated.

it's not a problem...

KarelWintersky avatar Jul 20 '22 12:07 KarelWintersky

Summary: I think, PHP or any compiled language is better than bash+powershell scripts.

And, Perl is good, but you don't want know this word

KarelWintersky avatar Jul 20 '22 12:07 KarelWintersky

@sanikolaev that looks good list!

I agree that rsync is good, as it should allow the backup to be taken as quick as possible, without locking the server for too long. (it could even do one 'dirty' rsync outside the lock, then do another rsync inside the lock, which will only copy stuff that actually changed. More wasteful as some data might be transfered needlessly, but would in general keep the lock for shorter period)

Of course other possibility is to do a 'mysqldump' style export. At least for indexes where fields have been stored in 'docstore'. Otherwise would only be attributes, but for a plain index, it only the attibutes that change, so need backing up :)

doing a 'dump' style export would also have the benefit of being more portable (load into different server versions!), and even work remotely (ie not running on the same server as searchd) - but would in general be MUCH slower than rsync copy. (importing the 'dump' file would also be much slower, if need to 'restore' the backup)

Anyway, I just commited a script I created a while ago for creating a 'sql dump' file, from a running searchd instance. by doing 'SELECT * FROM'. https://github.com/barryhunter/indexdump Its only one table at a time (has to be named), and while it will add a 'create table' for the index. IF 'show create table' is not available, it tries to replicate a CREATE TABLE from 'DESCRIBE' but I admit that part isn't well tested. And certainly wont get all the extra options, its just the fields/attributes. (based on a similar script that more closely replicates mysqldump for mysql databases, but can dump views)

I still think concentrating on rsync/file based copy is best for the official script. But can make the above script more robust if its of interest to others. (maybe the backup script could use something like 'indexdump.php' as an option instead of rsync)

barryhunter avatar Jul 20 '22 16:07 barryhunter

Hello In my opinion the best idea is to be compatible with the mysqldump utility. This is the most natural way to make a logical backup.

Or create a manticoredump)

Or

SELECT ... INTO OUTFILE 'file_name' [CHARACTER SET charset_name] [export_options]

And

LOAD DATA [LOW_PRIORITY | CONCURRENT] [LOCAL] INFILE 'file_name' [REPLACE | IGNORE] INTO TABLE tbl_name

webigorkiev avatar Jul 28 '22 16:07 webigorkiev

FFR: there's another issue or, to be more correct, a general solution for the mysqldump compatibility problem - https://github.com/manticoresoftware/manticoresearch/issues/710#issuecomment-1039957828

For now for logical backups of Manticore Search (Google, please index it) I recommend to try the tool Barry mentioned above -https://github.com/barryhunter/fakedump/blob/master/indexdump.php and help him with improving it

Within this issue we'd like to focus on the physical kind of backups.

sanikolaev avatar Jul 29 '22 04:07 sanikolaev

Looks like Barry's code (and mysqldump-like) can't backup unstored RT indexes...

KarelWintersky avatar Jul 29 '22 09:07 KarelWintersky

Looks like Barry's code (and mysqldump-like) can't backup unstored RT indexes...

That's right. As Barry wrote "At least for indexes where fields have been stored in 'docstore'".

Logical backup of non-stored fields is impossible in a general case since you may be using dict=crc, therefore even if you decode the dictionary/doclist/wordlist you won't be able to turn it into the original text. Same if you use min_word_len > 1, stopwords, morphology etc.

That's why since Manticore 5 everything is stored by default. In the future it should also make it possible to do retokenization on the fly (we already have a specification of this task).

sanikolaev avatar Jul 29 '22 10:07 sanikolaev

Package for sql dump RT index under Node.js

indexdump

webigorkiev avatar Jul 29 '22 10:07 webigorkiev

That's why since Manticore 5 everything is stored by default.

so, backup utility must check source Manticore version and index structure..

KarelWintersky avatar Jul 29 '22 13:07 KarelWintersky

my indexdump.php script does check the output of DESCRIBE, and emits a warning that non-stored fields were in the original index and wont be in the dump. (it does account for ones duplicated into a string attribute!) In theory it should be able to dump any plain or RT index, regardless of searchd version. But the 'create table' will only be usable with more recent versions!

On the other hand if you have non-stored fields, almost certainly not using Manticore as a 'database' (its indexing copy of data somewhere else) backing up the index is less critical.

And if you are backing up the index(es) for 'quick' recovery incase of a catastrophic failure, then a physical backup is already far more practical. (you could run 'inserts' from the real database.) restoring a physical backup is much quicker than a logical backup.

Oh, and as indexdump.php started life a clone of mysqldump, it already has LOCK support :) - at least in locking the index when dumping. It can't add any locking the dumped output.

barryhunter avatar Jul 29 '22 13:07 barryhunter

Some problem with paths to exceptions, stopwords and wordforms files When a backup is created in SHOW CREATE TABLE these paths point to the index directory. Is there any way to get the original paths to the files? (Which were used when creating the index) Or a way to set read access to this files from the index for a non-root user?

webigorkiev avatar Jul 31 '22 08:07 webigorkiev

Is there any way to get the original paths to the files?

read and parse config ?

KarelWintersky avatar Jul 31 '22 12:07 KarelWintersky

It's probably not the best solution. A backup is a snapshot of the current state of an index. Including wordforms files. Also one of the main options for using Manticore Search - RT without listing in configuration file.

It is correct just to copy them from the directory with the index. And there are no problems other than access (backup is usually created not from the root user)

webigorkiev avatar Jul 31 '22 13:07 webigorkiev

lock <IDX>[,<IDX2>,<IDX3...] will return you table with all relevant files. That is already published, you can try it. No need to manually parse config/whatever. Don't forget also to call unlock when finished.

klirichek avatar Jul 31 '22 13:07 klirichek

Yes, there is a list, but it could also be obtained from the result of SHOW CREATE TABLE (exceptions, stopwords and wordforms) The problem is accessing these files. They are available only to the owner - user manticore or root

webigorkiev avatar Jul 31 '22 13:07 webigorkiev

lock <IDX>[,<IDX2>,<IDX3...] will return you table with all relevant files. That is already published, you can try it. No need to manually parse config/whatever. Don't forget also to call unlock when finished.

  1. Please write how the locks work in more detail? I try to lock the index - but in fact nothing happens except for listing the files

In first session:

MySQL [(none)]> CREATE TABLE test_index(label text);
Query OK, 0 rows affected (0.001 sec)

MySQL [(none)]> LOCK test_index;
+---------------------------------------------------+---------------------------------------------------+
| file                                              | normalized                                        |
+---------------------------------------------------+---------------------------------------------------+
| /var/lib/manticore/test_index/test_index.meta     | /var/lib/manticore/test_index/test_index.meta     |
| /var/lib/manticore/test_index/test_index.ram      | /var/lib/manticore/test_index/test_index.ram      |
| /var/lib/manticore/test_index/test_index.settings | /var/lib/manticore/test_index/test_index.settings |
+---------------------------------------------------+---------------------------------------------------+
3 rows in set (0.000 sec)

MySQL [(none)]> INSERT INTO test_index VALUES (0, 'Test text');
Query OK, 1 row affected (0.001 sec)

in second session:

MySQL [(none)]> INSERT INTO test_index VALUES (0, 'Row 2');
Query OK, 1 row affected (0.000 sec)

In first session:

MySQL [(none)]> SELECT * FROM test_index;
+---------------------+-----------+
| id                  | label     |
+---------------------+-----------+
| 8864805720680300547 | Row 2     |
| 8864805720680300546 | Test text |
+---------------------+-----------+
2 rows in set (0.000 sec)

MySQL [(none)]> UNLOCK test_index;
Query OK, 1 row affected (0.000 sec)

Looks like the recording continues unhindered

  1. I have to do FLUSH RTINDEX test_index; before backup?

webigorkiev avatar Jul 31 '22 17:07 webigorkiev

A comment on the dedicated Lock/unlock thread, https://github.com/manticoresoftware/manticoresearch/issues/822 suggests the LOCK command only blocks statements from replication.

... stas seems to imply, it would an additional feature to also block direct updates from insert etc.

But it does seem that would be required before can call the 'lock' function done, and suitable for use with this backup.

Although guess could argue that the 'concurrent inserts' etc, happening on the index, while locked, arent affecting the files on disk (going into memory, until flushed, either the ram chunk, or atributes etc) ... so really the 'lock' only preventing writes to the index files (so rsync etc can take a consistent snapshot) - while writes can still be made to the index.

Lock only prevents writes to the 'index files', not writes to the logical index.

In other words this LOCK function might be good enough for a physical backup, but it's NOT good enough for a logical backup! A logical backup could still see dirty data by seeing some intermediate state.

barryhunter avatar Jul 31 '22 17:07 barryhunter

Got it.

indexbackup

Utility for physical backup under Node.js In the end, if you write a command in sudoers - everything is quite convenient

Like this:

%username ALL=(ALL)NOPASSWD:/usr/local/bin/indexbackup *

Example

yarn global add indexbackup
sudo indexbackup  --all | aws s3 cp - s3://buket-name/daily/backup.tar.gz

And my 20G indexes flew to AWS s3)

Thanks everyone

webigorkiev avatar Jul 31 '22 21:07 webigorkiev

For some cases, a manticore.json backup is needed.

Is there a way to get the data_dir parameter?

webigorkiev avatar Aug 01 '22 06:08 webigorkiev

Is there a way to get the data_dir parameter?

It's not possible to get it from inside a running Manticore instance now, but we already discussed with the core team that it shouldn't be a big deal to implement this and similar things to make it easier for the backup script to understand more about the instance it's connected to.

sanikolaev avatar Aug 01 '22 06:08 sanikolaev

If possible, discuss changing the default permissions for index files as well. For example read access for the manticore group. Then it would be possible to add a backup user to the group and not play with sudo and root

webigorkiev avatar Aug 01 '22 06:08 webigorkiev

For example read access for the manticore group.

Good idea! Should be ok in terms of performance. I'll create a separate issue about it.

sanikolaev avatar Aug 02 '22 05:08 sanikolaev

we need LOCK/UNLOCK in Manticore Search to make sure nothing can't modify during backup. Rel. issue https://github.com/manticoresoftware/manticoresearch/issues/822

LOCK/UNLOCK has been implemented in https://github.com/manticoresoftware/manticoresearch/commit/f6fe6408f24251eb1dd65e284c7c9b42593d7792 , but requires testing and documenting.

sanikolaev avatar Aug 02 '22 05:08 sanikolaev