whatdotheyknow-theme icon indicating copy to clipboard operation
whatdotheyknow-theme copied to clipboard

Reindex InfoRequestEvents with new request_public_body_tag term

Open garethrees opened this issue 8 years ago • 13 comments

~1912247 events total

indexes 50-100 every 5 minutes in normal operation (from acts_as_xapian_jobs table) Takes < 30 seconds to index them all ~300 safely in the 5 minute slot?

= index 3600 ph = 531.479615385 hours for all = 22.132488426 days

Idea is that we drip-feed info request events to the jobs table so that it gradually reindexes the old events with the new term.

garethrees avatar Sep 28 '16 10:09 garethrees

This is the gist of what I think we want:

InfoRequestEvent.find_in_batches(:batch_size => 300) do |events|
  events.each(&:xapian_mark_needs_index)
  sleep 300 # 5 mins so that the next batch gets collected by the next indexing run
end

Need to consider:

  • error logging
  • how to run? just a 1 off script? (bundle exec rails runner reindex_all_events_in_batches?
  • crash recovery – if it breaks on one event, how do we avoid indexing everything again

garethrees avatar Oct 10 '16 09:10 garethrees

to pick up where we left off (or at least close to it), something like...

start_id = ENV["START_ID"] || 0
InfoRequestEvent.where("id > #{start_id}").find_in_batches(:batch_size => 300) do |events|
   events.each(&:xapian_mark_needs_index)
   logger.info("last event indexed: #{events.last.id}")
   sleep 300 # 5 mins so that the next batch gets collected by the next indexing run
end

So if the last thing (success) in the log is 299, set the START_ID to 299 to kick the next batch off at 300

lizconlan avatar Oct 14 '16 17:10 lizconlan

Looks good. I think next thing to do is try this with an initial handful of batches to check that we can effectively process the jobs in the 5 minute window.

  • Probably want to monitor the job queue – maybe by taking the count (SELECT COUNT(*) FROM acts_as_xapian_jobs;) every 10 seconds or so – to check that we're not getting a backlog of jobs that we're struggling to process.
  • Also worth taking notes on resource usage as its processing the batch (New Relic / cacti useful here).

garethrees avatar Oct 17 '16 08:10 garethrees

Just a reminder - we need to update wdtk before we do this for real.

garethrees avatar Oct 17 '16 10:10 garethrees

Would we have to do anything special to prevent something from automatically adding everything to the job queue on deploy?

lizconlan avatar Oct 17 '16 16:10 lizconlan

And is using logger to write to the existing log sufficient?

lizconlan avatar Oct 17 '16 16:10 lizconlan

Would we have to do anything special to prevent something from automatically adding everything to the job queue on deploy?

I don't think so, but worth a double check of the deploy tasks to make sure something like that isn't going to happen.

And is using logger to write to the existing log sufficient?

I was wondering about this. My first thought was to create a separate logger just for clarity, but we could just add a prefix to log messages generated by this for easy grepping. I have no real preference – whatever you think will make it easier to check every day.

Will also want to make sure exceptions are logged.

garethrees avatar Oct 18 '16 08:10 garethrees

Would we have to do anything special to prevent something from automatically adding everything to the job queue on deploy?

I don't think so, but worth a double check of the deploy tasks to make sure something like that isn't going to happen.

There doesn't seem to be anything in the deploy script, should be ok

lizconlan avatar Oct 18 '16 13:10 lizconlan

Make the rake task of alaveteli itself – useful for everyone.

garethrees avatar Oct 24 '16 12:10 garethrees

bundle exec rake reindex:events is now running in a screen session (under my user, sudo-ed to app user).

garethrees avatar Nov 02 '16 12:11 garethrees

Indexing has been stopped because of https://github.com/mysociety/alaveteli/issues/3604.

Abort message:

* queued batch ending: 175719
** Error while processing event 175719, last event successfully queued was: 175719

Also note that the task keeps hold of the logrotated file. Marked as new for discussion alongside https://github.com/mysociety/alaveteli/issues/3604.

garethrees avatar Nov 04 '16 12:11 garethrees

Some notes on what this is about:

If not completed it would mean that the request_public_body_tag advanced search term wouldn’t have a full dataset to search on. Not sure if there’s an easy way of finding that out

The search engine indexes events (so that it can look at historic states and whatnot). To be able to search for events where the request’s public body has the given tag, the search index needs to get updated with that information for each event (a lot of events!). Updates are handled automatically, but that initial seeding needed to be manual (or we could just wait until every request gets updated in normal course, but that would probably take tens of years for all of them)

garethrees avatar Jun 21 '23 08:06 garethrees

To reduce the set of events we could try to inspect the xapian value for the term so that we only mark for reindexing if it's empty.

garethrees avatar Jun 21 '23 08:06 garethrees

Just linking to https://github.com/mysociety/alaveteli/issues/1179 to reference it there.

garethrees avatar Oct 10 '24 18:10 garethrees