Cron job to update search index
Really appreciate all of your work on this @yvanzo! This is more of a question or possible enhancement:
Since database updates don't include db indexes, I'm wondering what the best way would be to schedule new indexes to be automatically built?
I'm wondering if this can be done via a cron job similar to the replication cron? Typically we only need to rebuild the 'artist' table for our use-case so scheduling that to run every 1 or 2 days would be ideal. Others might want to rebuild the indexes weekly or more.
Any tips on how to go about scheduling these?
Thank you!
Hi @justinthiele! The cron service is running by default in the container for the indexer service. (The image is based on https://github.com/phusion/baseimage-docker). You can just use that or use the cron of your host system.
Hi @justinthiele. Did you have any luck in getting the cron job working? I want to also create a cron job to rebuild search index but cannot figure out the correct command to make this happen. I've been messing with this for an hour and can't figure it out.
If you were able to get this working, do you mind sharing your cron job command(s) along with any scripts you wrote to accomplish this?
Thanks!
@myopenflixr I got busy with other projects and haven't worked on this at all, unfortunately.
Thanks for the reply @yvanzo!
@yvanzo. Would you have any tips on how to get the cron job working on my host machine (or even within the docker image)? I'm wanting to have the cronjob run once weekly. As mentioned above, I cannot figure out the correct command to get it running on my host machine.
Here's what I have so far:
* 1 * * 6 cd /home/myusername/musicbrainz-docker && /usr/bin/docker-compose exec indexer python -m sir reindex
I added my user to the Docker group to avoid the need to run sudo. But the cron job isn't working. I've tried multiple variations of the above with no success.
Thanks.
Hi @myopenflixr, Iām not sure which implementation of cron your host system is running but I would try replacing the initial * 1 in your line with 0 1 which means 1:00am (and * * 6 means every Friday) as follows:
0 1 * * 6 cd /home/myusername/musicbrainz-docker && /usr/bin/docker-compose exec indexer python -m sir reindex
Other tip: You might want to redirect the standard output and error output of this command to a file.
Hi @yvanzo. Thank you for the reply. I did get it working by modifying your cron example to the following:
#0 1 * * 6 cd /home/myusername/musicbrainz-docker && /usr/bin/docker-compose exec -T indexer python -m sir reindex > /home/myusername/musicbrainz-docker/reindex.log 2>&1
I had to add the -T flag to get it to run properly on Ubuntu 22.04.
Now, this brings up another question. Since the reindex function is resource heavy and takes quite a long time, would it be more efficient to just download pre-built search indexes per these instructions?
sudo docker-compose run --rm musicbrainz fetch-dump.sh search
sudo docker-compose run --rm search load-search-indexes.sh
If so, how might I go about adding these two functions to run concurrently in a cron job? Can I simply separate the two commands with ; in between like this?
#0 1 * * 6 cd /home/myusername/musicbrainz-docker && /usr/bin/docker-compose run -T --rm musicbrainz fetch-dump.sh search ; /usr/bin/docker-compose run -T --rm search load-search-indexes.sh
And if I do run this weekly in a cron job (or manually), is there a need to delete or recreate the database each time I run these commands? Or can I safely run those two commands weekly with no issues?
Thanks, Mike
@myopenflixr have you tried using the search indexing recently? There were some commits that speed up the indexing by like 100x. It allowed me to run search indexing hourly and it finishes within 5 minutes after the cron downloads the updates.
What type of hardware are you running your slave server on? Im wondering if you could schedule a daily update at like 3am when you have low utilization on your setup.
Regarding your question on "pre-build search indexes" - these are only updated weekly when the dumps also get updated. unless you are wiping your DB and Indexes to reload them via the cron, I wouldn't suggest going this route.
@JoshDi I actually just deleted and rebuilt my slave server from scratch yesterday.
I'm running my slave server on a Proxmox VM. The server itself is a Dell R620 with Dual Xeon E5-2670's and 128GB ram. I have allocated 16 vcpus and 24GB ram to the VM which is running Ubuntu 22.04. I also allocated 325GB disk space from my ZFS pool (SSD's) for the slave server.
I'm essentially looking for the optimal way of keeping everything updated on a regular basis. Daily updates aren't a requirement, but if I can complete it in 5 minutes, then hell, why not!
As of now, I have a "stock" install on my slave server. The only thing I really tweaked was the memory-settings.yml as follows:
version: 3.1
services:
db:
shm_size: 8g
command: postgres -c "shared_buffers=4GB" -c "work_mem=128MB" -c "maintenance_work_mem=4GB" -c "shared_preload_libraries=pg_amqp.so"
search:
environment:
- SOLR_HEAP=8g
What would be your recommendation to keep everything up-to-date? I'll take any recommendations.
Thanks, Mike
Thats a powerful enough server to run replication every hour. Look at the config files below - update those first and then run the following steps.
Your memory-settings.yml looks good. Given the memory you allocated, you probably cant go much higher than that. You may need to adjust the import_threads number down if you run into memory issues.
Please modify or create a local/indexer.ini file to look like the following:
[database]
dbname = musicbrainz_db
host = db
password = ${POSTGRES_PASSWORD}
port = 5432
user = ${POSTGRES_USER}
[solr]
uri = http://search:8983/solr
batch_size = 1000
[sir]
import_threads = 8
index_limit = 1000000
live_index_batch_size = 1000
process_delay = 5
query_batch_size = 100000
wscompat = on
[rabbitmq]
host = mq
user = sir
password = sir
vhost = /search-index-rebuilder
prefetch_count = 1000
[sentry]
dsn = ""
I also set my local/replication.cron file to the following. This will run replication packets every hour on the 9 min mark - you can adjust as you see fit. If you restart your server often, you will have to take care to make sure that the indexes/db match (/admin/./check-search-indexes all) and there are no messages being processed in the queue (docker-compose exec mq rabbitmqadmin -u sir -p sir -V /search-index-rebuilder list queues). If there are messages and you reboot, the index and db wont match unless you reindex the db tables with issues - if you do this, make sure to disable the cronjob while you fix the mismatched indexes and then reenable the cron when done.
SHELL=/bin/bash
BASH_ENV=/noninteractive.bash_env
9 * * * * /usr/local/bin/replication.sh
After you have followed the above steps, make sure you have followed all of the steps in the sections below:
Indexing for the first time will take about 2-3hrs. The incremental updates will take much less time with the patches in place. You can monitor the db and indexing count at any time using the following command within your musicbrainz-docker folder:
root@HTPC-Xeon:/storage/musicbrainz-docker# admin/./check-search-indexes all
CORE STATUS INDEX DB
editor OK 0 /0
instrument OK 1012 /1012
series OK 17642 /17642
place OK 54813 /54813
event OK 61290 /61290
area OK 118822 /118822
tag OK 203128 /203128
label OK 234150 /234150
cdstub OK 283440 /283440
annotation -- 527452 /523041
work OK 1738854 /1738854
artist OK 2049216 /2049216
release-group OK 2666877 /2666877
release OK 3405660 /3405660
url OK 9510734 /9510734
recording OK 27920736 /27920736
The annotations index will always be off - dont worry about it. It's a known bug and that table doesn't really matter much for most queries.
I've never ran live indexing simply because it states that it's not yet stable. However, I'm cool with giving it a try.
I have a couple questions for you before I enable live indexing. #1 - Is there anything else that I need to manually install/set-up before following the "Enable live indexing" instructions? #2 - If my server does reboot during , how to I go about reindexing the data tables to get everything back in line (i.e. fix the mismatched indexes)?
Thanks!
Good questions. make sure you have read my last thread again, i made some updates.
No, if you have followed my config instructions above and followed all of the steps in those 3 sections, live indexing will be running. Live indexing is pretty stable - the only time you can have an issue is if you have too many import threads for your memory requirements (memory-settings.yml). Your current settings look fine.
When you reboot, run the two commands I shared above. They check how many index messages are in the queue (normally all 0 but will have them after the 9 min mark if you use my cron job. If you use your own, just run that command around the time the cron will run). The other command will check if the db and index count match. If they ever don't, you can run the following command to update the individual table. In the command below the example is for the recording table:
make sure you are running these commands in a screen session or other type of session that wont close if ssh drops in the middle.
disable cron job:
admin/configure rm replication-cron
sudo docker-compose up -d
check db / index count:
admin/./check-search-indexes all
tables that are not OK, excluding the annotation table being optional, you will need to update using the command below:
command to fix table recording and release: (you will need to modify it to correct the tables that mismatch in your case.)
admin/./delete-search-indexes recording release
docker-compose exec indexer python -m sir -d reindex --entity-type recording --entity-type release
re-enable cron job:
admin/configure add replication-cron
sudo docker-compose up -d
then you can run the following command to confirm that the db and index match:
admin/./check-search-indexes recording
admin/./check-search-indexes all
@JoshDi Thanks for the guidance. I'll take a shot at getting live indexing set up.
I may hit you up with any additional questions.
@JoshDi Looks like I just ran into my first hurdle.
When trying to enable live indexing and running this command: sudo docker-compose exec indexer python -m sir amqp_setup
I receive the following error:
2022-10-08 14:30:04,888: Connecting to RabbitMQ
Traceback (most recent call last):
File "/usr/local/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/code/sir/__main__.py", line 134, in <module>
main()
File "/code/sir/__main__.py", line 130, in main
func(args)
File "sir/amqp/setup.py", line 22, in setup_rabbitmq
conn = util.create_amqp_connection()
File "sir/util.py", line 140, in create_amqp_connection
conn.connect()
File "/usr/local/lib/python2.7/site-packages/amqp/connection.py", line 317, in connect
self.drain_events(timeout=self.connect_timeout)
File "/usr/local/lib/python2.7/site-packages/amqp/connection.py", line 505, in drain_events
while not self.blocking_read(timeout):
File "/usr/local/lib/python2.7/site-packages/amqp/connection.py", line 511, in blocking_read
return self.on_inbound_frame(frame)
File "/usr/local/lib/python2.7/site-packages/amqp/method_framing.py", line 55, in on_frame
callback(channel, method_sig, buf, None)
File "/usr/local/lib/python2.7/site-packages/amqp/connection.py", line 518, in on_inbound_method
method_sig, payload, content,
File "/usr/local/lib/python2.7/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
listener(*args)
File "/usr/local/lib/python2.7/site-packages/amqp/connection.py", line 648, in _on_close
(class_id, method_id), ConnectionError)
amqp.exceptions.AccessRefused: (0, 0): (403) ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile.
Any suggestions? Or did I do something wrong?
I dont think you ran all of these steps properly. please review and ensure that you ran all of the commands:
@JoshDi I'm still running into the same errors. As of right now, I have a clean working install of the slave server (restored from yesterday's install before I tried setting up live indexing). Search indexes were already set up and Replication has already been run during the initial install. The system is fresh and working great. So next steps should be to create the two files you
Next step should then be to create the indexer.ini & replication.cron files in my musicbrainz-docker/local folder.
Then I should be ready to follow instructions for Enable Live Indexing correct?
After disabling replication cron job (Step 1), I then run the command sudo docker-compose exec indexer python -m sir amqp_setup and get all of the errors as described above.
Question for you...Once I add the indexer.ini & replication.cron files, do I need to run any additional docker-compose commands and/or make any edits to my .env file before I Enable Live Indexing
My .env file looks like this after my clean install
COMPOSE_FILE=docker-compose.yml:local/compose/memory-settings.yml:compose/replication-token.yml:compose/replication-cron.yml
MUSICBRAINZ_WEB_SERVER_HOST=mydomain.com
MUSICBRAINZ_WEB_SERVER_PORT=80
If you follow all of the instructions in those 3 sections, it will work. My instructions plus those three sections is all you have to do.
You haven't run enable live indexing yet
This may be a dumb question, but do I have to follow all the instructions on the AMQP page listed in Step 2? 2. Make indexer goes through AMQP Setup
Including the installation of RabbitMQ? Or is it already installed within the initial MusicBrainz slave server installation?
I think I finally figured it out.
It appears that my mq instance did not create user sir as expected.
Followed instructions in this post AMQP Setup Fails connecting to RabbitMQ
I ended up having to recreate the container to set the configuration back
sudo docker-compose up --force-recreate -d mq && \
sudo docker-compose logs --follow -t mq
After running those two commands, I was able to get through the setup of live indexing...FINALLY
This may be a dumb question, but do I have to follow all the instructions on the AMQP page listed in Step 2? 2. Make indexer goes through AMQP Setup
Including the installation of RabbitMQ? Or is it already installed within the initial MusicBrainz slave server installation?
Yes you do
@myopenflixr Did you figure out a working solution for running a daily cron job to update the index? That's still my primary need (vs live indexing). If so, could you add it to the documentation?
@justinthiele Yes I was able to get it working! (However, I've since setup live indexing)
I created a cron job on my local machine by doing the following:
Step #1 - Added my local user to the docker group which eliminated the need to use sudo for docker commands
Step #2 - added the following line to /etc/crontab
0 1 * * 7 YOUR_USER_NAME cd /home/YOUR_USER_NAME/musicbrainz-docker && /usr/bin/docker-compose exec -T indexer python -m sir reindex
Be sure to change YOUR_USER_NAME accordingly. And you can change the timing of your cron job accordingly. The above example runs the cron job at 1am every Sunday morning
Have you guys read the documentation? You don't need to schedule a cron to run an index update within the docker image. The cron file is exposed from docker directly.
https://github.com/metabrainz/musicbrainz-docker#customize-replication-schedule
@JoshDi: The documentation is about the cron job to update the PostgreSQL database, not to rebuild Solr search indexes. If @justinthiele prefers to avoid using the live indexer, the solution that @myopenflixr proposed makes perfect sense.
thank you for the clarification @yvanzo
why would someone want to schedule only reindexing the database without live-indexing? Are these users running replication and then reindexing the database nightly? I'd argue that live-indexing is stable and fast enough now that its most efficient (for any type of machine - powerful or not) to run replication daily with live indexing enabled. The latest changes that fixed live indexing, sped up the time it takes to complete by like 100x.
Thanks @myopenflixr, that's working for me! I just made a slight adjustment to your cron job to abstract out the username in the path for easier re-use: 0 1 * * 7 YOUR_USER_NAME cd ~/musicbrainz-docker && /usr/bin/docker-compose exec -T indexer python -m sir reindex
Appreciate all the input from @JoshDi and @yvanzo as well š
I'll see if I can work this into the documentation š
I believe this issue can be closed. I didn't follow along with the discussion about getting Live Indexing to work but would suggest updating the Readme with any new insights that came to light @myopenflixr.
Thanks everybody!
@JoshDi: Even though the live indexer is much faster than it used to be, it is still not stable. There regularly are situations were it crashes and the recovery is not always straightforward. For someone who can afford a delay in search results, using a cron job instead means less hassle.
Fair enough