whatdotheyknow-theme
whatdotheyknow-theme copied to clipboard
Reindex InfoRequestEvents with new request_public_body_tag term
~1912247 events total
indexes 50-100 every 5 minutes in normal operation (from acts_as_xapian_jobs
table)
Takes < 30 seconds to index them all
~300 safely in the 5 minute slot?
= index 3600 ph = 531.479615385 hours for all = 22.132488426 days
Idea is that we drip-feed info request events to the jobs table so that it gradually reindexes the old events with the new term.
This is the gist of what I think we want:
InfoRequestEvent.find_in_batches(:batch_size => 300) do |events|
events.each(&:xapian_mark_needs_index)
sleep 300 # 5 mins so that the next batch gets collected by the next indexing run
end
Need to consider:
- error logging
- how to run? just a 1 off script? (
bundle exec rails runner reindex_all_events_in_batches
? - crash recovery – if it breaks on one event, how do we avoid indexing everything again
to pick up where we left off (or at least close to it), something like...
start_id = ENV["START_ID"] || 0
InfoRequestEvent.where("id > #{start_id}").find_in_batches(:batch_size => 300) do |events|
events.each(&:xapian_mark_needs_index)
logger.info("last event indexed: #{events.last.id}")
sleep 300 # 5 mins so that the next batch gets collected by the next indexing run
end
So if the last thing (success) in the log is 299, set the START_ID
to 299 to kick the next batch off at 300
Looks good. I think next thing to do is try this with an initial handful of batches to check that we can effectively process the jobs in the 5 minute window.
- Probably want to monitor the job queue – maybe by taking the count (
SELECT COUNT(*) FROM acts_as_xapian_jobs;
) every 10 seconds or so – to check that we're not getting a backlog of jobs that we're struggling to process. - Also worth taking notes on resource usage as its processing the batch (New Relic / cacti useful here).
Just a reminder - we need to update wdtk
before we do this for real.
Would we have to do anything special to prevent something from automatically adding everything to the job queue on deploy?
And is using logger to write to the existing log sufficient?
Would we have to do anything special to prevent something from automatically adding everything to the job queue on deploy?
I don't think so, but worth a double check of the deploy tasks to make sure something like that isn't going to happen.
And is using logger to write to the existing log sufficient?
I was wondering about this. My first thought was to create a separate logger just for clarity, but we could just add a prefix to log messages generated by this for easy grepping. I have no real preference – whatever you think will make it easier to check every day.
Will also want to make sure exceptions are logged.
Would we have to do anything special to prevent something from automatically adding everything to the job queue on deploy?
I don't think so, but worth a double check of the deploy tasks to make sure something like that isn't going to happen.
There doesn't seem to be anything in the deploy script, should be ok
Make the rake task of alaveteli itself – useful for everyone.
bundle exec rake reindex:events
is now running in a screen session (under my user, sudo-ed to app user).
Indexing has been stopped because of https://github.com/mysociety/alaveteli/issues/3604.
Abort message:
* queued batch ending: 175719
** Error while processing event 175719, last event successfully queued was: 175719
Also note that the task keeps hold of the logrotated file. Marked as new for discussion alongside https://github.com/mysociety/alaveteli/issues/3604.
Some notes on what this is about:
If not completed it would mean that the request_public_body_tag advanced search term wouldn’t have a full dataset to search on. Not sure if there’s an easy way of finding that out
The search engine indexes events (so that it can look at historic states and whatnot). To be able to search for events where the request’s public body has the given tag, the search index needs to get updated with that information for each event (a lot of events!). Updates are handled automatically, but that initial seeding needed to be manual (or we could just wait until every request gets updated in normal course, but that would probably take tens of years for all of them)
To reduce the set of events we could try to inspect the xapian value for the term so that we only mark for reindexing if it's empty.
Just linking to https://github.com/mysociety/alaveteli/issues/1179 to reference it there.