graphrag
graphrag copied to clipboard
[Issue]: ValueError: Cannot take a larger sample than population when 'replace=False'
Do you need to file an issue?
- [x] I have searched the existing issues and this bug is not already filed.
- [x] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
- [x] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.
Describe the issue
Run command python -m graphrag prompt-tune ... error:
ValueError: Cannot take a larger sample than population when 'replace=False'
Steps to reproduce
Run commamd without --limit N or with --limit 15 is error:
python -m graphrag prompt-tune \
--root "." \
--config "graphrag_settings.yaml" \
--domain "policy interpretation" \
--n-subset-max 512 \
--k 15 \
--limit 15 \
--max-tokens 2048 \
--min-examples-required 3 \
--chunk-size 1024 \
--overlap 128 \
--language English \
--no-discover-entity-types \
--output "prompts/index"
This command with --limit 1 is ok:
python -m graphrag prompt-tune \
--root "." \
--config "graphrag_settings.yaml" \
--domain "policy interpretation" \
--n-subset-max 512 \
--k 15 \
--limit 1 \
--max-tokens 2048 \
--min-examples-required 3 \
--chunk-size 1024 \
--overlap 128 \
--language English \
--no-discover-entity-types \
--output "prompts/index"
GraphRAG Config Used
# Paste your config here
Logs and screenshots
Error messages:
...
│ C:\Data\workspases\ai\policy-assistant-demo\.venv\Lib\site-packages\graphrag\prompt_tune\loader\input.py:76 in load_docs_in_chunks │
│ │
│ 73 │ if select_method == DocSelectionType.TOP: │
│ 74 │ │ chunks_df = chunks_df[:limit] │
│ 75 │ elif select_method == DocSelectionType.RANDOM: │
│ ❱ 76 │ │ chunks_df = chunks_df.sample(n=limit) │
│ 77 │ elif select_method == DocSelectionType.AUTO: │
│ 78 │ │ if k is None or k <= 0: │
│ 79 │ │ │ msg = "k must be an integer > 0" │
│ │
│ ╭───────────────────────────────────────────────────────────────────────────────────── locals ──────────────────────────────────────────────────────────────────────────────────────╮ │
│ │ chunk_config = ChunkingConfig( │ │
...
│ │ ) │ │
│ │ k = 15 │ │
│ │ limit = 15 │ │
│ │ logger = <graphrag.logger.rich_progress.RichProgressLogger object at 0x000001D109787200> │ │
│ │ n_subset_max = 512 │ │
│ │ overlap = 128 │ │
│ │ root = 'C:\\Data\\workspases\\ai\\policy-assistant-demo\\src\\samples\\d01_agentchat_graphrag' │ │
│ │ select_method = <DocSelectionType.RANDOM: 'random'> │ │
│ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
│ │
│ C:\Data\workspases\ai\policy-assistant-demo\.venv\Lib\site-packages\pandas\core\generic.py:6118 in sample │
│ │
│ 6115 │ │ if weights is not None: │
│ 6116 │ │ │ weights = sample.preprocess_weights(self, weights, axis) │
│ 6117 │ │ │
│ ❱ 6118 │ │ sampled_indices = sample.sample(obj_len, size, replace, weights, rs) │
│ 6119 │ │ result = self.take(sampled_indices, axis=axis) │
│ 6120 │ │ │
│ 6121 │ │ if ignore_index: │
│ │
│ ╭───────────────────────────────────────────────────────────────────────────────────── locals ──────────────────────────────────────────────────────────────────────────────────────╮ │
│ │ axis = 0 │ │
│ │ frac = None │ │
│ │ ignore_index = False │ │
│ │ n = 15 │ │
│ │ obj_len = 1 │ │
│ │ random_state = None │ │
│ │ replace = False │ │
│ │ rs = <module 'numpy.random' from 'C:\\Data\\workspases\\ai\\policy-assistant-demo\\.venv\\Lib\\site-packages\\numpy\\random\\__init__.py'> │ │
│ │ self = │ │ │ │ │ │ │ │ │ │ │ │ id text document_ids │ │
│ │ n_tokens │ │
│ │ [6f309183ea3173d4ab2aea65e824607a1aeedb27142bd... None │ │
│ │ size = 15 │ │
│ │ weights = None │ │
│ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
│ │
│ C:\Data\workspases\ai\policy-assistant-demo\.venv\Lib\site-packages\pandas\core\sample.py:152 in sample │
│ │
│ 149 │ │ else: │
│ 150 │ │ │ raise ValueError("Invalid weights: weights sum to zero") │
│ 151 │ │
│ ❱ 152 │ return random_state.choice(obj_len, size=size, replace=replace, p=weights).astype( │
│ 153 │ │ np.intp, copy=False │
│ 154 │ ) │
│ 155 │
│ │
│ ╭─────────────────────────────────────────────────────────────────────── locals ───────────────────────────────────────────────────────────────────────╮ │
│ │ obj_len = 1 │ │
│ │ random_state = <module 'numpy.random' from 'C:\\Data\\workspases\\ai\\policy-assistant-demo\\.venv\\Lib\\site-packages\\numpy\\random\\__init__.py'> │ │
│ │ replace = False │ │
│ │ size = 15 │ │
│ │ weights = None │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
│ │
│ in numpy.random.mtrand.RandomState.choice:1001 │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Cannot take a larger sample than population when 'replace=False'
Additional Information
- GraphRAG Version: v2.3.0
- Operating System: Windows 11 Home Edition 24H2
- Python Version: 3.12.10
- Related Issues: https://github.com/microsoft/graphrag/issues/664
Can it really be solved? When you choose random mode, if the limit set is greater than your LLM block data, it will default to 15. Unless you change the default limit, it cannot be permanently resolved
Another method is to modify the code of the loader/input