elysia icon indicating copy to clipboard operation
elysia copied to clipboard

Error analyzing large collection ( >150,000)

Open amineDeep94 opened this issue 6 months ago • 15 comments

I have a collection that has about 150,000 objects and I keep getting this when I try to analyze the data:

Error analyzing Approved_foods: Error preprocessing collection: Query call with protocol GRPC search failed with message explorer: list class: search: invalid pagination params: query maximum results exceeded. (Total time: 0s)

What is the most optimal way to process large collections of objects?

amineDeep94 avatar Aug 24 '25 19:08 amineDeep94

Hey, thanks for the report. Is this happening in the app or using the python package?

dannyjameswilliams avatar Aug 25 '25 11:08 dannyjameswilliams

It is happening in both the app and the python package.

amineDeep94 avatar Aug 25 '25 18:08 amineDeep94

I think I found the source of this issue. When trying to fetch objects above the limit of 100,000 objects, Weaviate throws an error. This is configurable in Weaviate, but depends on the user's configuration. So I've set a hard limit for the preprocessing to not go above that when sampling data.

Would you be able to try out #24 to see if this fixes the issue for you?

dannyjameswilliams avatar Aug 26 '25 13:08 dannyjameswilliams

Still getting an error unfortunately:

Error analyzing Approved_foods: Error preprocessing collection: Query call with protocol GRPC search failed with message Deadline Exceeded. (Total time: 30s)

[08/26/25 09:54:44] WARNING Collection is large (greater than 50,000 objects), causing slowdown in pre-processing. Reducing maximum sample size to 20 objects. To override this, set max_sample_sizeas an argument to preprocess.

amineDeep94 avatar Aug 26 '25 16:08 amineDeep94

Okay looks like a new issue, that's progress! I increased the default timeout values for the client in #24. Let me know if this fixes it?

If not, try initialising the client manager manually with increased timeout values and passing it down to the appropriate methods, e.g.

from elysia.util.client import ClientManager
client_manager = ClientManager(
    query_timeout=120, # increase if necessary, but wouldn't recommend going too high
    insert_timeout=120,
)

and then

from elysia import Tree
tree = Tree()
tree("<your prompt>", client_manager=client_manager)

or

from elysia import preprocess
preprocess("<collection name>", client_manager=client_manager)

dannyjameswilliams avatar Aug 27 '25 08:08 dannyjameswilliams

Still does not work in the UI but it somehow works with:

from elysia.util.client import ClientManager
client_manager = ClientManager(
    query_timeout=120, # increase if necessary, but wouldn't recommend going too high
    insert_timeout=120,
)

It would be great to be able to adjust these parameters in the UI. Thanks!

amineDeep94 avatar Sep 03 '25 17:09 amineDeep94

Thank you, we will work on this in a future version to add this to the UI! This would be a good thing to be able to configure.

dannyjameswilliams avatar Sep 08 '25 07:09 dannyjameswilliams

Hello,

i am having the same problem:

Exception: Error preprocessing collection: Query call with protocol GRPC search failed with message explorer: list class: search: invalid pagination params: query maximum results exceeded.

I have tried setting the query_timeout, but it does not seem to change a lot.

My Collection is a bit larger, but I'm not sure how that influences the preprocess

Is there something i need to change on the weaviate side? Its running locally in a docker container with nothing set, straight out of the tutorial

BastianSpatz avatar Oct 27 '25 11:10 BastianSpatz

@BastianSpatz How many items are in your collection? Do you get any logging messages during preprocessing you wouldn't mind sharing? e.g. "Estimated token count of sample", etc.

If not, can you try running again but with your logging level as debug, using

from elysia import configure
configure(logging_level="debug")

dannyjameswilliams avatar Oct 27 '25 13:10 dannyjameswilliams

There are 600.000 items And when using the webapp i get

WARNING  Collection is large (greater than 50,000 objects), causing slowdown in pre-processing. Reducing maximum sample size to 20    collection.py:448 objects. To override this, set max_sample_size as an argument to preprocess.   

DEBUG    (process_collection) sending result with progress: 10.0% processor.py:42
DEBUG    (process_collection) FINISHED!

And when running it like that i get


from elysia.util.client import ClientManager

configure(
    base_provider="ollama",
    complex_provider="ollama",
    base_model="gpt-oss:20b",
    complex_model="gpt-oss:20b",
    model_api_base="http://localhost:11434",
    logging_level="DEBUG")

client_manager = ClientManager(
    query_timeout=120, # increase if necessary, but wouldn't recommend going too high
    insert_timeout=120,
    weaviate_is_local=True,
    wcd_url="http://localhost:8080",  # or "localhost"
)
preprocess(COLLECTION_NAME, client_manager=client_manager)
Exception: Error preprocessing collection: Query call with protocol GRPC search failed with message explorer: list class: search: invalid pagination params: query maximum results exceeded.

DEBUG    Estimated token count of sample: 2260 
DEBUG    Number of objects in sample: 20

BastianSpatz avatar Oct 27 '25 13:10 BastianSpatz

Interesting. @BastianSpatz Could you try this PR: https://github.com/weaviate/elysia/pull/72 to see if it fixes your issue?

dannyjameswilliams avatar Oct 27 '25 15:10 dannyjameswilliams

It did seem to work. i got the error: Out of range float values are not JSON compliant: nan

But I think thats on my side from the data.

BastianSpatz avatar Oct 28 '25 06:10 BastianSpatz

Small update: with my data fixed it worked even in the UI, and i did not have to pass a custom client_manager

BastianSpatz avatar Oct 28 '25 09:10 BastianSpatz

OK great, and that was fixed with the new PR yes?

Out of curiousity, what needed fixing with your data?

dannyjameswilliams avatar Oct 28 '25 11:10 dannyjameswilliams

I hade some properties of my collection which were None. And that seemed to make some problem. But I am very new to weaviate and just threw everything into the database and wanted to see what happens.

And yes the PR fixed it.

BastianSpatz avatar Oct 28 '25 14:10 BastianSpatz