Error analyzing large collection ( >150,000)
I have a collection that has about 150,000 objects and I keep getting this when I try to analyze the data:
Error analyzing Approved_foods: Error preprocessing collection: Query call with protocol GRPC search failed with message explorer: list class: search: invalid pagination params: query maximum results exceeded. (Total time: 0s)
What is the most optimal way to process large collections of objects?
Hey, thanks for the report. Is this happening in the app or using the python package?
It is happening in both the app and the python package.
I think I found the source of this issue. When trying to fetch objects above the limit of 100,000 objects, Weaviate throws an error. This is configurable in Weaviate, but depends on the user's configuration. So I've set a hard limit for the preprocessing to not go above that when sampling data.
Would you be able to try out #24 to see if this fixes the issue for you?
Still getting an error unfortunately:
Error analyzing Approved_foods: Error preprocessing collection: Query call with protocol GRPC search failed with message Deadline Exceeded. (Total time: 30s)
[08/26/25 09:54:44] WARNING Collection is large (greater than 50,000 objects), causing slowdown in pre-processing. Reducing maximum sample size to 20 objects. To override this, set max_sample_sizeas an argument to preprocess.
Okay looks like a new issue, that's progress! I increased the default timeout values for the client in #24. Let me know if this fixes it?
If not, try initialising the client manager manually with increased timeout values and passing it down to the appropriate methods, e.g.
from elysia.util.client import ClientManager
client_manager = ClientManager(
query_timeout=120, # increase if necessary, but wouldn't recommend going too high
insert_timeout=120,
)
and then
from elysia import Tree
tree = Tree()
tree("<your prompt>", client_manager=client_manager)
or
from elysia import preprocess
preprocess("<collection name>", client_manager=client_manager)
Still does not work in the UI but it somehow works with:
from elysia.util.client import ClientManager
client_manager = ClientManager(
query_timeout=120, # increase if necessary, but wouldn't recommend going too high
insert_timeout=120,
)
It would be great to be able to adjust these parameters in the UI. Thanks!
Thank you, we will work on this in a future version to add this to the UI! This would be a good thing to be able to configure.
Hello,
i am having the same problem:
Exception: Error preprocessing collection: Query call with protocol GRPC search failed with message explorer: list class: search: invalid pagination params: query maximum results exceeded.
I have tried setting the query_timeout, but it does not seem to change a lot.
My Collection is a bit larger, but I'm not sure how that influences the preprocess
Is there something i need to change on the weaviate side? Its running locally in a docker container with nothing set, straight out of the tutorial
@BastianSpatz How many items are in your collection? Do you get any logging messages during preprocessing you wouldn't mind sharing? e.g. "Estimated token count of sample", etc.
If not, can you try running again but with your logging level as debug, using
from elysia import configure
configure(logging_level="debug")
There are 600.000 items And when using the webapp i get
WARNING Collection is large (greater than 50,000 objects), causing slowdown in pre-processing. Reducing maximum sample size to 20 collection.py:448 objects. To override this, set max_sample_size as an argument to preprocess.
DEBUG (process_collection) sending result with progress: 10.0% processor.py:42
DEBUG (process_collection) FINISHED!
And when running it like that i get
from elysia.util.client import ClientManager
configure(
base_provider="ollama",
complex_provider="ollama",
base_model="gpt-oss:20b",
complex_model="gpt-oss:20b",
model_api_base="http://localhost:11434",
logging_level="DEBUG")
client_manager = ClientManager(
query_timeout=120, # increase if necessary, but wouldn't recommend going too high
insert_timeout=120,
weaviate_is_local=True,
wcd_url="http://localhost:8080", # or "localhost"
)
preprocess(COLLECTION_NAME, client_manager=client_manager)
Exception: Error preprocessing collection: Query call with protocol GRPC search failed with message explorer: list class: search: invalid pagination params: query maximum results exceeded.
DEBUG Estimated token count of sample: 2260
DEBUG Number of objects in sample: 20
Interesting. @BastianSpatz Could you try this PR: https://github.com/weaviate/elysia/pull/72 to see if it fixes your issue?
It did seem to work. i got the error:
Out of range float values are not JSON compliant: nan
But I think thats on my side from the data.
Small update: with my data fixed it worked even in the UI, and i did not have to pass a custom client_manager
OK great, and that was fixed with the new PR yes?
Out of curiousity, what needed fixing with your data?
I hade some properties of my collection which were None. And that seemed to make some problem.
But I am very new to weaviate and just threw everything into the database and wanted to see what happens.
And yes the PR fixed it.