algoliasearch-client-javascript icon indicating copy to clipboard operation
algoliasearch-client-javascript copied to clipboard

discussion: browse objects does not allow cursor-style paging

Open kopertop opened this issue 4 years ago • 10 comments

Previously, there was a method called browse that allowed us to start a browse request, get a response, and then call browseFrom to continue browsing when we'd processed that initial batch and were ready for another one. Now, it appears all we have an option for is browseObjects, which takes a function to call for each "batch".

This has several issues with our current workflow:

  1. It is nothing like the "search" method
  2. It does not allow "cancelling" a browse (such as if we've hit the limit of how many we want to process)
  3. If batch operations take a while and are asynchronous, this could cause memory issues (imagine processing 100m records all at once without any delay between batches).

It does appear the browseObjects function allows passing in a cursor, but it has no option to return results and not continue.

Proposed solution:

Allow an option to browseObjects that lets us call it the "old" way, or provide a new function browse and browseFrom like the old library supported.

kopertop avatar Feb 03 '20 18:02 kopertop

This is not a permanent solution yet, but a way to get around this limitation right now, would be to use the underlying createBrowsablePromise directly, and modify the shouldStop condition:

https://github.com/algolia/algoliasearch-client-javascript/blob/e4c3f59afb3278b5466c7a5977b45268ef466479/packages/client-search/src/methods/index/browseObjects.ts#L19

Haroenv avatar Feb 04 '20 09:02 Haroenv

Could you clarify a bit more why you are browsing only to a certain number of results, and not the whole dataset, or using filters? This way we can see which changes would enable your use case best

Haroenv avatar Feb 04 '20 09:02 Haroenv

Our specific use-case is for limited exports. We allow some our customers to export up to 10k results (which is configurable per customer). By default we still want to keep the site-search to 1k, so we got around this by using browse.

It's also a very different callback pattern to have to include a callback function vs being able to respond in an async/await style format. We have a wrapper around algoliasearch that works for both the backend and frontend by applying some custom filters on top of any search we pass in, and searching multiple indexes and combining the results. It's nice to be able to have that work with both browse and search methods (which it was before, but isn't now).

kopertop avatar Feb 06 '20 18:02 kopertop

@kopertop A solution like https://github.com/algolia/algoliasearch-client-javascript/pull/1029 would solve your problem?

nunomaduro avatar Feb 28 '20 09:02 nunomaduro

I agree with @kopertop's concerns, specifically about the synchronous nature of the batch option signature.

The python and and php examples seem like the browse methods return an iterator.

I want to prune some excess from some algolia indexes as well as take backups. Returning (e,g,) a readable stream would make it cleaner to compose functionality.

djake avatar Mar 07 '20 19:03 djake

Thanks for your feedback, it does make sense to use readable streams on node, but not not at all in browsers (where browse also works). We thought about async iterators, but their polyfill is very big.

In the mean time, if there's something that doesn't work within the new browseObjects function, I'd advise to people to use the underlying browse method directly:

https://github.com/algolia/algoliasearch-client-javascript/blob/e8af0b23bc995dd56a9d6a9c50fe60cd9de82b0c/packages/client-search/src/methods/index/browseObjects.ts#L21-L25

Haroenv avatar Mar 10 '20 09:03 Haroenv

Hi @Haroenv, are there any plans to provide a readable stream API for nodejs, or at least the pagination API so we can create our own stream honouring backpressure?

(Our use case is taking a snapshot of our index nightly, so we want to browse the entire index but we have memory constraints as we're running on a nodejs server)

richardscarrott avatar Dec 08 '20 11:12 richardscarrott

That's an interesting option, we didn't choose for a stream since there's both browser streams and node streams are quite different, so we didn't want to choose for either of these. What API were you thinking of specifically @richardscarrott ?

Haroenv avatar Dec 08 '20 13:12 Haroenv

Ideally in nodejs we'd have a readable stream so we could pipe to a writable stream, e.g.

const fs = require('fs');
const algoliasearch = require('algoliasearch');

const client = algoliasearch('YourApplicationID', 'YourAdminAPIKey');
const index = client.initIndex('your_index_name');

index.createReadStream().pipe(fs.createWriteStream('./export.json'));

Assuming the above readable stream honoured backpressure this would allow huge data sets to be handled without holding all the hits in memory at once like the algolia docs example. This would make it usable in a lower memory environment like a nodejs server; additionally it'll also be faster as it starts to write as soon as the first page comes in.

We did this using the old client's pagination API so I think we can probably do the same with the new client, but we'd have to use the lower level transporter API?

re: whatwg streams in the browser; I'm not hugely familiar with them but IIRC they don't attempt to interoperate with nodejs streams so I think from an API perspective they'd need to be treated separately.

TBH, it seems a shame that nodejs is burdened with the constraints of the browser as they obv have very different requirements -- I wonder if it'd make sense to expose a nodejs specific SDK via a separate package which wraps this, or perhaps expose 'algoliasearch/node' similar to React with 'react-dom/server'?

richardscarrott avatar Dec 08 '20 17:12 richardscarrott

In the mean time, before we can add streams to the client, this would be the implementation of the previous browse/browseFrom method, which you should be able to use as-is in your gist:

import { encode, addMethods } from '@algolia/client-common';
import { MethodEnum } from '@algolia/requester-common';
import { RequestOptions } from '@algolia/transporter';
import {
  BrowseOptions,
  BrowseResponse,
  SearchIndex,
  SearchOptions,
} from '@algolia/client-search';
import algoliasearch from 'algoliasearch';

export const browseFrom = (base: SearchIndex) => {
  return <TObject>(
    data: { cursor?: string },
    requestOptions?: SearchOptions & BrowseOptions<TObject> & RequestOptions
  ): Readonly<Promise<BrowseResponse<TObject>>> => {
    return base.transporter.read(
      {
        method: MethodEnum.Post,
        path: encode('1/indexes/%s/browse', base.indexName),
        data,
      },
      requestOptions
    );
  };
};

// adding it
const client = algoliasearch('xxx', 'xxx');
const index = addMethods(client.initIndex('xxx'), { browseFrom });

const { cursor } = await index.browseFrom({});
await index.browseFrom({ cursor });

Haroenv avatar Dec 09 '20 09:12 Haroenv