wp-search-with-algolia icon indicating copy to clipboard operation
wp-search-with-algolia copied to clipboard

Allow re-indexing to be triggered & executed via cron without WP-CLI

Open KZeni opened this issue 6 years ago • 12 comments

I've detailed this previously at https://wordpress.org/support/topic/what-are-the-best-options-when-it-comes-to-triggering-a-reindex-via-cron-job-2/, but I figured I should post this here as well for greater visibility & communication on the matter.

In short, some hosting providers don't give CLI access or otherwise support WP-CLI. Currently, this plugin has buttons in the site admin for triggering & executing the site re-index, but those cannot be handled remotely / via a cronjob (the only means to trigger a re-index via a cronjob is currently using WP-CLI; which might not be available as previously mentioned.) A hosting provider that provides cronjob support without WP-CLI support/access would be the perfect use case for something like this (with a site wanting to trigger full re-indexes at scheduled intervals rather than relying on real-time index updates [ex. has an index that updates a ton of data frequently in minor ways when the search index really only needs the occasional update to greatly reduce search index operations.])

Meanwhile, I have access to WP-CLI, but https://github.com/WebDevStudios/wp-search-with-algolia/issues/9 is making it so WP-CLI can't be used for re-indexing since that has problematic results, currently. So this would help this situation as well if this is implemented sooner than the WP-CLI re-indexing issue with SiteOrigin PageBuilder is fixed (while also still being useful for those unable to use WP-CLI as mentioned above.)

KZeni avatar Aug 27 '19 18:08 KZeni

i would love to see that !

Dimitri-Basseguy avatar Jun 28 '22 07:06 Dimitri-Basseguy

Any update on this? Would love to see if it could be reindexing every 24hours

1101blueli avatar Aug 19 '22 08:08 1101blueli

@1101blueli Just so you know, you already can do that if you have (or are able to get) WP-CLI on your hosting setup. This request is specifically for unique setups that can't use wp-cli for one reason or another.

Also, this is for when you disable the on-the-fly re-indexing that otherwise happens to keep your content fully up-to-date at all times due to it updating whenever content is edited via the site admin (possibly due to wanting to control/limit the amount of reindexing that happens on a large and/or busy site, and/or possibly speed up larger content imports so it doesn't try to update Algolia upon import of each item & instead has it perform the update quicker [by not updating Algolia on-the-fly] and then having Algolia be updated later via the re-index that's done separately.)

KZeni avatar Aug 19 '22 14:08 KZeni

Nothing new on our end, that I'm aware of. I'd need to dig in to try and find when the actual index update requests are made, and how, to try and figure out how to possibly also trigger via "true" cron job. The WP-CLI integration is still probably the best way to go at this point.

tw2113 avatar Aug 22 '22 03:08 tw2113

+1 I would love to see that too

simonkoehler avatar Sep 12 '22 15:09 simonkoehler

Do not copy/paste code example as is, add security and nonce checks on your own as needed to prevent a wide open HTTP request access vector.

So I've been toying with this one a little bit today. I found that it's possible to trigger via an HTTP request to the admin-ajax.php endpoint that comes with WordPress.

An example cURL request could be as such:

curl --location --request POST 'https://wds.test/wp-admin/admin-ajax.php' \
--form 'action="algolia_re_index"' \
--form 'index_id="searchable_posts"' \
--form 'p="1"'

I grabbed the 3 form parameters from the "re-index" button ajax request in the admin. That's where things get a little bit hairier though.

The ajax request doesn't have the "nopriv" version of the ajax callback, as shown below:

add_action( 'wp_ajax_nopriv_algolia_re_index', [ $this, 're_index'] );

We also don't presently store the Algolia_Admin class instance on a given property as shown at https://github.com/WebDevStudios/wp-search-with-algolia/blob/main/includes/class-algolia-plugin.php#L166-L169

This is important because we'd need to access the instantiated class, to get access to the re_index method.

Alternatively, you could probably just copy/paste that method into your own function and then set a cron job to make the cURL request above.

Example copied method:

function my_re_index() {
	$plugin = Algolia_Plugin_Factory::create();

	$index_id = filter_input( INPUT_POST, 'index_id', FILTER_SANITIZE_STRING );
	$page     = filter_input( INPUT_POST, 'p', FILTER_SANITIZE_STRING );

	try {
		if ( empty( $index_id ) ) {
			throw new RuntimeException( 'Index ID should be provided.' );
		}

		if ( ! ctype_digit( $page ) ) {
			throw new RuntimeException( 'Page should be provided.' );
		}
		$page = (int) $page;

		$index = $plugin->get_index( $index_id );
		if ( null === $index ) {
			throw new RuntimeException( sprintf( 'Index named %s does not exist.', $index_id ) );
		}

		$total_pages = $index->get_re_index_max_num_pages();

		ob_start();
		if ( $page <= $total_pages || 0 === $total_pages ) {
			$index->re_index( $page );
		}
		ob_end_clean();

		$response = array(
			'totalPagesCount' => $total_pages,
			'finished'        => $page >= $total_pages,
		);

		wp_send_json( $response );
	} catch ( Exception $exception ) {
		echo esc_html( $exception->getMessage() );
		throw $exception;
	}
}
add_action( 'wp_ajax_nopriv_algolia_re_index', 'my_re_index' );

For what it's worth, we haven't touched the original re_index method in at least 2 years.

tw2113 avatar Oct 12 '22 02:10 tw2113

Thank you @tw2113

I tried it and it works. But it only updates a few items in the index. I guess that's because the p=1. Is there a way to update all pages it only one request?

simonkoehler avatar Oct 12 '22 14:10 simonkoehler

hmmm. @simonkoehler perhaps try without that parameter.

I went with this idea based on what I saw in https://github.com/WebDevStudios/wp-search-with-algolia/blob/2.2.0/includes/admin/js/reindex-button.js

There's no foreach loop type spot, and the p parameter comes from line 49. Based on what I'm seeing, line 68-70 should be restarting the process if Algolia's response is that things aren't finished yet.

{
    "totalPagesCount": 1,
    "finished": true
}

Example return request on a very tiny install I used.

tw2113 avatar Oct 12 '22 14:10 tw2113

Okay, I will test further. At the moment I get the following response if I use p=1:

{"totalPagesCount":20,"finished":false}

simonkoehler avatar Oct 12 '22 14:10 simonkoehler

I wonder if it maybe timed out, since this is a single POST request as well, and 20 pages is a lot, especially if there's a many per page.

tw2113 avatar Oct 12 '22 15:10 tw2113

I don't want to spam in this thread more than necessary :-)

But as I understood your post, only the function of the existing button is copied here. And the button in the backend works fine without any timeouts.

simonkoehler avatar Oct 12 '22 15:10 simonkoehler

my_re_index is an almost exact copy of re_index, except it's been moved outside of the context of the original PHP class. It's still just a callback tied to the algolia_re_index ajax action which gets specified in the example cURL request.

I could see it being possible that the browser tab being open somehow sets the timeout to be infinite, perhaps that's a detail I missed seeing. Perhaps that's a parameter available to cURL requests?

tw2113 avatar Oct 12 '22 15:10 tw2113

Thinking over this one, while I definitely see the appeal, I also definitely don't think it should be a default behavior by any means. From whatever amount of security standpoint one can think of, a public endpoint that can trigger complete re-indexes should probably not be a known vector. Especially when the endpoint can be hit with successive requests, potentially taking down the server and/or quickly putting you over your Algolia account's limits.

On top of that, the plugin to my knowledge, has never done checks to see if things have updated and thus NOT make the API request if nothing has changed. It will just "destroy" the record and push the new version of the record as is.

While I'm sure my previous example would still technically work, please do NOT take it in as is. Modify enough to add your own checks to make sure you should process, and return early if nonce and security checks are not satisfied.

tw2113 avatar Feb 15 '23 01:02 tw2113