graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

[Bug]: `ValueError: cannot convert float NaN to integer` when running global search with dynamic selection

Open lsukharn opened this issue 8 months ago • 1 comments

Do you need to file an issue?

  • [x] I have searched the existing issues and this bug is not already filed.
  • [x] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • [x] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

I ran this notebook on my data: https://github.com/microsoft/graphrag/blob/main/docs/examples_notebooks/global_search_with_dynamic_community_selection.ipynb and got an error message after:

api_key = os.environ["GRAPHRAG_API_KEY"]
llm_model = os.environ["GRAPHRAG_LLM_MODEL"]
api_base = os.environ["API_BASE_TEST"]
deployment_name = os.environ["GRAPHRAG_LLM_MODEL_DEPLOYMENT_NAME"]

config = LanguageModelConfig(
    api_key=api_key,
    type=ModelType.AzureOpenAIChat,
    api_base=api_base,
    api_version='2025-01-01-preview',
    model=llm_model,
    deployment_name=deployment_name,
    max_retries=20,
)
model = ModelManager().get_or_create_chat_model(
    name="global_search",
    model_type=ModelType.AzureOpenAIChat,
    config=config,
)

token_encoder = tiktoken.encoding_for_model(llm_model)

OUTPUT_DIR = "./graphrag_project/output"
COMMUNITY_REPORT_TABLE = "community_reports"
ENTITY_TABLE = "entities"
COMMUNITY_TABLE = "communities"

# we don't fix a specific community level but instead use an agent to dynamicially
# search through all the community reports to check if they are relevant.
COMMUNITY_LEVEL = None

community_df = pd.read_parquet(f"{OUTPUT_DIR}/{COMMUNITY_TABLE}.parquet")
entity_df = pd.read_parquet(f"{OUTPUT_DIR}/{ENTITY_TABLE}.parquet")
report_df = pd.read_parquet(f"{OUTPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")

communities = read_indexer_communities(community_df, report_df)
reports = read_indexer_reports(
    report_df,
    community_df,
    community_level=COMMUNITY_LEVEL,
    dynamic_community_selection=True,
)
entities = read_indexer_entities(
    entity_df, community_df, community_level=COMMUNITY_LEVEL
)

print(f"Total report count: {len(report_df)}")
print(
    f"Report count after filtering by community level {COMMUNITY_LEVEL}: {len(reports)}"
)

report_df.head()

File ~\Desktop\graphrag_repo.venv\Lib\site-packages\graphrag\query\indexer_adapters.py:161, in read_indexer_entities..(x) 158 # group entities by id and degree and remove duplicated community IDs 159 nodes_df = nodes_df.groupby(["id"]).agg({"community": set}).reset_index() 160 nodes_df["community"] = nodes_df["community"].apply( --> 161 lambda x: [str(int(i)) for i in x] 162 ) 163 final_df = nodes_df.merge(entities_df, on="id", how="inner").drop_duplicates( 164 subset=["id"] 165 ) 166 # read entity dataframe to knowledge model objects

ValueError: cannot convert float NaN to integer

See full error in the logs section.

Steps to reproduce

No response

Expected Behavior

I should be able to use the "dynamic" part of global search. The script works when I specify the COMMUNITY_LEVEL=2, but it fails when it's None.

GraphRAG Config Used

# Paste your config here

Logs and screenshots

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[47], line 12
      5 communities = read_indexer_communities(community_df, report_df)
      6 reports = read_indexer_reports(
      7     report_df,
      8     community_df,
      9     community_level=COMMUNITY_LEVEL,
     10     dynamic_community_selection=True,
     11 )
---> 12 entities = read_indexer_entities(
     13     entity_df, community_df, community_level=COMMUNITY_LEVEL
     14 )
     16 print(f"Total report count: {len(report_df)}")
     17 print(
     18     f"Report count after filtering by community level {COMMUNITY_LEVEL}: {len(reports)}"
     19 )

File ~\Desktop\graphrag_repo\.venv\Lib\site-packages\graphrag\query\indexer_adapters.py:160, in read_indexer_entities(final_entities, final_communities, community_level)
    158 # group entities by id and degree and remove duplicated community IDs
    159 nodes_df = nodes_df.groupby(["id"]).agg({"community": set}).reset_index()
--> 160 nodes_df["community"] = nodes_df["community"].apply(
    161     lambda x: [str(int(i)) for i in x]
    162 )
    163 final_df = nodes_df.merge(entities_df, on="id", how="inner").drop_duplicates(
    164     subset=["id"]
    165 )
    166 # read entity dataframe to knowledge model objects

File ~\Desktop\graphrag_repo\.venv\Lib\site-packages\pandas\core\series.py:4924, in Series.apply(self, func, convert_dtype, args, by_row, **kwargs)
   4789 def apply(
   4790     self,
   4791     func: AggFuncType,
   (...)   4796     **kwargs,
   4797 ) -> DataFrame | Series:
   4798     """
   4799     Invoke function on values of Series.
   4800 
   (...)   4915     dtype: float64
   4916     """
   4917     return SeriesApply(
   4918         self,
   4919         func,
   4920         convert_dtype=convert_dtype,
   4921         by_row=by_row,
   4922         args=args,
   4923         kwargs=kwargs,
-> 4924     ).apply()

File ~\Desktop\graphrag_repo\.venv\Lib\site-packages\pandas\core\apply.py:1427, in SeriesApply.apply(self)
   1424     return self.apply_compat()
   1426 # self.func is Callable
-> 1427 return self.apply_standard()

File ~\Desktop\graphrag_repo\.venv\Lib\site-packages\pandas\core\apply.py:1507, in SeriesApply.apply_standard(self)
   1501 # row-wise access
   1502 # apply doesn't have a `na_action` keyword and for backward compat reasons
   1503 # we need to give `na_action="ignore"` for categorical data.
   1504 # TODO: remove the `na_action="ignore"` when that default has been changed in
   1505 #  Categorical (GH51645).
   1506 action = "ignore" if isinstance(obj.dtype, CategoricalDtype) else None
-> 1507 mapped = obj._map_values(
   1508     mapper=curried, na_action=action, convert=self.convert_dtype
   1509 )
   1511 if len(mapped) and isinstance(mapped[0], ABCSeries):
   1512     # GH#43986 Need to do list(mapped) in order to get treated as nested
   1513     #  See also GH#25959 regarding EA support
   1514     return obj._constructor_expanddim(list(mapped), index=obj.index)

File ~\Desktop\graphrag_repo\.venv\Lib\site-packages\pandas\core\base.py:921, in IndexOpsMixin._map_values(self, mapper, na_action, convert)
    918 if isinstance(arr, ExtensionArray):
    919     return arr.map(mapper, na_action=na_action)
--> 921 return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)

File ~\Desktop\graphrag_repo\.venv\Lib\site-packages\pandas\core\algorithms.py:1743, in map_array(arr, mapper, na_action, convert)
   1741 values = arr.astype(object, copy=False)
   1742 if na_action is None:
-> 1743     return lib.map_infer(values, mapper, convert=convert)
   1744 else:
   1745     return lib.map_infer_mask(
   1746         values, mapper, mask=isna(values).view(np.uint8), convert=convert
   1747     )

File lib.pyx:2972, in pandas._libs.lib.map_infer()

File ~\Desktop\graphrag_repo\.venv\Lib\site-packages\graphrag\query\indexer_adapters.py:161, in read_indexer_entities.<locals>.<lambda>(x)
    158 # group entities by id and degree and remove duplicated community IDs
    159 nodes_df = nodes_df.groupby(["id"]).agg({"community": set}).reset_index()
    160 nodes_df["community"] = nodes_df["community"].apply(
--> 161     lambda x: [str(int(i)) for i in x]
    162 )
    163 final_df = nodes_df.merge(entities_df, on="id", how="inner").drop_duplicates(
    164     subset=["id"]
    165 )
    166 # read entity dataframe to knowledge model objects

ValueError: cannot convert float NaN to integer

Additional Information

  • GraphRAG Version: 2.1.0
  • Operating System: Windows 11
  • Python Version: 3.12
  • Related Issues:

lsukharn avatar Apr 04 '25 20:04 lsukharn

We'll double-check the case where None is sent. In the meantime, try a high number like 6 (most conmmunity hierarchies top out at 4 levels deep). Dynamic selection should still perform as usual and we won't try to grab all reports.

natoverse avatar Apr 08 '25 19:04 natoverse

Confirmed that this is working in the latest notebook

natoverse avatar Oct 06 '25 23:10 natoverse