lilac
lilac copied to clipboard
Error while clustering
After starting the clustering I get this error:
[local/evol1][1 shards] map "extract_text" to "('prompt__cluster',)": 100%|████████████████████████████████████████████████████████████████████████████████| 319/319 [00:00<00:00, 12033.30it/s]
Wrote map output to prompt__cluster-00000-of-00001.parquet
[local/evol1][1 shards] map "cluster_documents" to "('prompt__cluster',)": 0%| | 0/319 [00:00<?, ?it/s]jinaai/jina-embeddings-v2-small-en using device: mps:0
Computing embeddings: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 319/319 [00:22<00:00, 14.27it/s]
Computing embeddings took 27.761s.
/Users/peter/miniconda3/envs/vis/lib/python3.11/site-packages/umap/umap_.py:1943: UserWarning: n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism.
warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")
UMAP: Reducing dim from 512 to 5 of 319 vectors took 2.297s.
HDBSCAN: Clustering took 0.005s.
99 noise points (31.0%) will be assigned to nearest cluster.
HDBSCAN: Computing membership for the noise points took 0.004s.
[local/evol1][1 shards] map "cluster_documents" to "('prompt__cluster',)": 100%|██████████████████████████████████████████████████████████████████████████████| 319/319 [00:32<00:00, 9.89it/s]
Wrote map output to prompt__cluster-00000-of-00001.parquet
[local/evol1][1 shards] map "title_clusters" to "('prompt__cluster',)": 0%| | 0/319 [00:00<?, ?it/s]
Error code: 400 - {'error': {'message': 'you must provide a model parameter', 'type': 'invalid_request_error', 'param': None, 'code': None}}
Hi. For someone who encountered this problem (like me), you need to set the env variable "API_MODEL" to the OpenAI model (GPT-3.5 or GPT-4). This problem relates to OpenAI API calling. Btw, who knows how to set the complete .env file for the project? Thanks!
I get the same issue when just using gte small to cluster a dataset.