graphrag [Issue]: ValueError: Cannot take a larger sample than population when 'replace=False'

Do you need to file an issue?

[x] I have searched the existing issues and this bug is not already filed.
[x] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
[x] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the issue

Run command python -m graphrag prompt-tune ... error:

ValueError: Cannot take a larger sample than population when 'replace=False'

Steps to reproduce

Run commamd without --limit N or with --limit 15 is error:

python -m graphrag prompt-tune \
  --root "." \
  --config "graphrag_settings.yaml" \
  --domain "policy interpretation" \
  --n-subset-max 512 \
  --k 15 \
  --limit 15 \
  --max-tokens 2048 \
  --min-examples-required 3 \
  --chunk-size 1024 \
  --overlap 128 \
  --language English \
  --no-discover-entity-types \
  --output "prompts/index"

This command with --limit 1 is ok:

python -m graphrag prompt-tune \
  --root "." \
  --config "graphrag_settings.yaml" \
  --domain "policy interpretation" \
  --n-subset-max 512 \
  --k 15 \
  --limit 1 \
  --max-tokens 2048 \
  --min-examples-required 3 \
  --chunk-size 1024 \
  --overlap 128 \
  --language English \
  --no-discover-entity-types \
  --output "prompts/index"

GraphRAG Config Used

# Paste your config here

Logs and screenshots

Error messages:

...
│ C:\Data\workspases\ai\policy-assistant-demo\.venv\Lib\site-packages\graphrag\prompt_tune\loader\input.py:76 in load_docs_in_chunks                                                    │
│                                                                                                                                                                                       │
│    73 │   if select_method == DocSelectionType.TOP:                                                                                                                                   │
│    74 │   │   chunks_df = chunks_df[:limit]                                                                                                                                           │
│    75 │   elif select_method == DocSelectionType.RANDOM:                                                                                                                              │
│ ❱  76 │   │   chunks_df = chunks_df.sample(n=limit)                                                                                                                                   │
│    77 │   elif select_method == DocSelectionType.AUTO:                                                                                                                                │
│    78 │   │   if k is None or k <= 0:                                                                                                                                                 │
│    79 │   │   │   msg = "k must be an integer > 0"                                                                                                                                    │
│                                                                                                                                                                                       │
│ ╭───────────────────────────────────────────────────────────────────────────────────── locals ──────────────────────────────────────────────────────────────────────────────────────╮ │
│ │            chunk_config = ChunkingConfig(                                                                                                                                         │ │
...
│ │                           )                                                                                                                                                       │ │
│ │                       k = 15                                                                                                                                                      │ │
│ │                   limit = 15                                                                                                                                                      │ │
│ │                  logger = <graphrag.logger.rich_progress.RichProgressLogger object at 0x000001D109787200>                                                                         │ │
│ │            n_subset_max = 512                                                                                                                                                     │ │
│ │                 overlap = 128                                                                                                                                                     │ │
│ │                    root = 'C:\\Data\\workspases\\ai\\policy-assistant-demo\\src\\samples\\d01_agentchat_graphrag'                                                                 │ │
│ │           select_method = <DocSelectionType.RANDOM: 'random'>                                                                                                                     │ │
│ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                                                                                                       │
│ C:\Data\workspases\ai\policy-assistant-demo\.venv\Lib\site-packages\pandas\core\generic.py:6118 in sample                                                                             │
│                                                                                                                                                                                       │
│    6115 │   │   if weights is not None:                                                                                                                                               │
│    6116 │   │   │   weights = sample.preprocess_weights(self, weights, axis)                                                                                                          │
│    6117 │   │                                                                                                                                                                         │
│ ❱  6118 │   │   sampled_indices = sample.sample(obj_len, size, replace, weights, rs)                                                                                                  │
│    6119 │   │   result = self.take(sampled_indices, axis=axis)                                                                                                                        │
│    6120 │   │                                                                                                                                                                         │
│    6121 │   │   if ignore_index:                                                                                                                                                      │
│                                                                                                                                                                                       │
│ ╭───────────────────────────────────────────────────────────────────────────────────── locals ──────────────────────────────────────────────────────────────────────────────────────╮ │
│ │         axis = 0                                                                                                                                                                  │ │
│ │         frac = None                                                                                                                                                               │ │
│ │ ignore_index = False                                                                                                                                                              │ │
│ │            n = 15                                                                                                                                                                 │ │
│ │      obj_len = 1                                                                                                                                                                  │ │
│ │ random_state = None                                                                                                                                                               │ │
│ │      replace = False                                                                                                                                                              │ │
│ │           rs = <module 'numpy.random' from 'C:\\Data\\workspases\\ai\\policy-assistant-demo\\.venv\\Lib\\site-packages\\numpy\\random\\__init__.py'>                              │ │
│ │         self = │   │   │   │   │   │   │   │   │   │   │   │     id                                               text                                       document_ids         │ │
│ │                n_tokens                                                                                                                                                           │ │
│ │                [6f309183ea3173d4ab2aea65e824607a1aeedb27142bd...     None                                                                                                         │ │
│ │         size = 15                                                                                                                                                                 │ │
│ │      weights = None                                                                                                                                                               │ │
│ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                                                                                                       │
│ C:\Data\workspases\ai\policy-assistant-demo\.venv\Lib\site-packages\pandas\core\sample.py:152 in sample                                                                               │
│                                                                                                                                                                                       │
│   149 │   │   else:                                                                                                                                                                   │
│   150 │   │   │   raise ValueError("Invalid weights: weights sum to zero")                                                                                                            │
│   151 │                                                                                                                                                                               │
│ ❱ 152 │   return random_state.choice(obj_len, size=size, replace=replace, p=weights).astype(                                                                                          │
│   153 │   │   np.intp, copy=False                                                                                                                                                     │
│   154 │   )                                                                                                                                                                           │
│   155                                                                                                                                                                                 │
│                                                                                                                                                                                       │
│ ╭─────────────────────────────────────────────────────────────────────── locals ───────────────────────────────────────────────────────────────────────╮                              │
│ │      obj_len = 1                                                                                                                                     │                              │
│ │ random_state = <module 'numpy.random' from 'C:\\Data\\workspases\\ai\\policy-assistant-demo\\.venv\\Lib\\site-packages\\numpy\\random\\__init__.py'> │                              │
│ │      replace = False                                                                                                                                 │                              │
│ │         size = 15                                                                                                                                    │                              │
│ │      weights = None                                                                                                                                  │                              │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯                              │
│                                                                                                                                                                                       │
│ in numpy.random.mtrand.RandomState.choice:1001                                                                                                                                        │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Cannot take a larger sample than population when 'replace=False'

Additional Information

GraphRAG Version: v2.3.0
Operating System: Windows 11 Home Edition 24H2
Python Version: 3.12.10
Related Issues: https://github.com/microsoft/graphrag/issues/664

Jun 04 '25 02:06 rui8832

Can it really be solved? When you choose random mode, if the limit set is greater than your LLM block data, it will default to 15. Unless you change the default limit, it cannot be permanently resolved

Jun 24 '25 08:06 wangsiyu666

Another method is to modify the code of the loader/input

Jun 24 '25 08:06 wangsiyu666