graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

[Issue]: ValueError: Cannot take a larger sample than population when 'replace=False'

Open rui8832 opened this issue 6 months ago • 2 comments

Do you need to file an issue?

  • [x] I have searched the existing issues and this bug is not already filed.
  • [x] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • [x] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the issue

Run command python -m graphrag prompt-tune ... error:

ValueError: Cannot take a larger sample than population when 'replace=False'

Steps to reproduce

Run commamd without --limit N or with --limit 15 is error:

python -m graphrag prompt-tune \
  --root "." \
  --config "graphrag_settings.yaml" \
  --domain "policy interpretation" \
  --n-subset-max 512 \
  --k 15 \
  --limit 15 \
  --max-tokens 2048 \
  --min-examples-required 3 \
  --chunk-size 1024 \
  --overlap 128 \
  --language English \
  --no-discover-entity-types \
  --output "prompts/index"

This command with --limit 1 is ok:

python -m graphrag prompt-tune \
  --root "." \
  --config "graphrag_settings.yaml" \
  --domain "policy interpretation" \
  --n-subset-max 512 \
  --k 15 \
  --limit 1 \
  --max-tokens 2048 \
  --min-examples-required 3 \
  --chunk-size 1024 \
  --overlap 128 \
  --language English \
  --no-discover-entity-types \
  --output "prompts/index"

GraphRAG Config Used

# Paste your config here

Logs and screenshots

Error messages:

...
│ C:\Data\workspases\ai\policy-assistant-demo\.venv\Lib\site-packages\graphrag\prompt_tune\loader\input.py:76 in load_docs_in_chunks                                                    │
│                                                                                                                                                                                       │
│    73 │   if select_method == DocSelectionType.TOP:                                                                                                                                   │
│    74 │   │   chunks_df = chunks_df[:limit]                                                                                                                                           │
│    75 │   elif select_method == DocSelectionType.RANDOM:                                                                                                                              │
│ ❱  76 │   │   chunks_df = chunks_df.sample(n=limit)                                                                                                                                   │
│    77 │   elif select_method == DocSelectionType.AUTO:                                                                                                                                │
│    78 │   │   if k is None or k <= 0:                                                                                                                                                 │
│    79 │   │   │   msg = "k must be an integer > 0"                                                                                                                                    │
│                                                                                                                                                                                       │
│ ╭───────────────────────────────────────────────────────────────────────────────────── locals ──────────────────────────────────────────────────────────────────────────────────────╮ │
│ │            chunk_config = ChunkingConfig(                                                                                                                                         │ │
...
│ │                           )                                                                                                                                                       │ │
│ │                       k = 15                                                                                                                                                      │ │
│ │                   limit = 15                                                                                                                                                      │ │
│ │                  logger = <graphrag.logger.rich_progress.RichProgressLogger object at 0x000001D109787200>                                                                         │ │
│ │            n_subset_max = 512                                                                                                                                                     │ │
│ │                 overlap = 128                                                                                                                                                     │ │
│ │                    root = 'C:\\Data\\workspases\\ai\\policy-assistant-demo\\src\\samples\\d01_agentchat_graphrag'                                                                 │ │
│ │           select_method = <DocSelectionType.RANDOM: 'random'>                                                                                                                     │ │
│ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                                                                                                       │
│ C:\Data\workspases\ai\policy-assistant-demo\.venv\Lib\site-packages\pandas\core\generic.py:6118 in sample                                                                             │
│                                                                                                                                                                                       │
│    6115 │   │   if weights is not None:                                                                                                                                               │
│    6116 │   │   │   weights = sample.preprocess_weights(self, weights, axis)                                                                                                          │
│    6117 │   │                                                                                                                                                                         │
│ ❱  6118 │   │   sampled_indices = sample.sample(obj_len, size, replace, weights, rs)                                                                                                  │
│    6119 │   │   result = self.take(sampled_indices, axis=axis)                                                                                                                        │
│    6120 │   │                                                                                                                                                                         │
│    6121 │   │   if ignore_index:                                                                                                                                                      │
│                                                                                                                                                                                       │
│ ╭───────────────────────────────────────────────────────────────────────────────────── locals ──────────────────────────────────────────────────────────────────────────────────────╮ │
│ │         axis = 0                                                                                                                                                                  │ │
│ │         frac = None                                                                                                                                                               │ │
│ │ ignore_index = False                                                                                                                                                              │ │
│ │            n = 15                                                                                                                                                                 │ │
│ │      obj_len = 1                                                                                                                                                                  │ │
│ │ random_state = None                                                                                                                                                               │ │
│ │      replace = False                                                                                                                                                              │ │
│ │           rs = <module 'numpy.random' from 'C:\\Data\\workspases\\ai\\policy-assistant-demo\\.venv\\Lib\\site-packages\\numpy\\random\\__init__.py'>                              │ │
│ │         self = │   │   │   │   │   │   │   │   │   │   │   │     id                                               text                                       document_ids         │ │
│ │                n_tokens                                                                                                                                                           │ │
│ │                [6f309183ea3173d4ab2aea65e824607a1aeedb27142bd...     None                                                                                                         │ │
│ │         size = 15                                                                                                                                                                 │ │
│ │      weights = None                                                                                                                                                               │ │
│ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                                                                                                       │
│ C:\Data\workspases\ai\policy-assistant-demo\.venv\Lib\site-packages\pandas\core\sample.py:152 in sample                                                                               │
│                                                                                                                                                                                       │
│   149 │   │   else:                                                                                                                                                                   │
│   150 │   │   │   raise ValueError("Invalid weights: weights sum to zero")                                                                                                            │
│   151 │                                                                                                                                                                               │
│ ❱ 152 │   return random_state.choice(obj_len, size=size, replace=replace, p=weights).astype(                                                                                          │
│   153 │   │   np.intp, copy=False                                                                                                                                                     │
│   154 │   )                                                                                                                                                                           │
│   155                                                                                                                                                                                 │
│                                                                                                                                                                                       │
│ ╭─────────────────────────────────────────────────────────────────────── locals ───────────────────────────────────────────────────────────────────────╮                              │
│ │      obj_len = 1                                                                                                                                     │                              │
│ │ random_state = <module 'numpy.random' from 'C:\\Data\\workspases\\ai\\policy-assistant-demo\\.venv\\Lib\\site-packages\\numpy\\random\\__init__.py'> │                              │
│ │      replace = False                                                                                                                                 │                              │
│ │         size = 15                                                                                                                                    │                              │
│ │      weights = None                                                                                                                                  │                              │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯                              │
│                                                                                                                                                                                       │
│ in numpy.random.mtrand.RandomState.choice:1001                                                                                                                                        │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Cannot take a larger sample than population when 'replace=False'

Additional Information

  • GraphRAG Version: v2.3.0
  • Operating System: Windows 11 Home Edition 24H2
  • Python Version: 3.12.10
  • Related Issues: https://github.com/microsoft/graphrag/issues/664

rui8832 avatar Jun 04 '25 02:06 rui8832

Image

Image Can it really be solved? When you choose random mode, if the limit set is greater than your LLM block data, it will default to 15. Unless you change the default limit, it cannot be permanently resolved

wangsiyu666 avatar Jun 24 '25 08:06 wangsiyu666

Image Another method is to modify the code of the loader/input

wangsiyu666 avatar Jun 24 '25 08:06 wangsiyu666