[Bug]: `ValueError: cannot convert float NaN to integer` when running global search with dynamic selection
Do you need to file an issue?
- [x] I have searched the existing issues and this bug is not already filed.
- [x] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
- [x] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.
Describe the bug
I ran this notebook on my data: https://github.com/microsoft/graphrag/blob/main/docs/examples_notebooks/global_search_with_dynamic_community_selection.ipynb
and got an error message after:
api_key = os.environ["GRAPHRAG_API_KEY"]
llm_model = os.environ["GRAPHRAG_LLM_MODEL"]
api_base = os.environ["API_BASE_TEST"]
deployment_name = os.environ["GRAPHRAG_LLM_MODEL_DEPLOYMENT_NAME"]
config = LanguageModelConfig(
api_key=api_key,
type=ModelType.AzureOpenAIChat,
api_base=api_base,
api_version='2025-01-01-preview',
model=llm_model,
deployment_name=deployment_name,
max_retries=20,
)
model = ModelManager().get_or_create_chat_model(
name="global_search",
model_type=ModelType.AzureOpenAIChat,
config=config,
)
token_encoder = tiktoken.encoding_for_model(llm_model)
OUTPUT_DIR = "./graphrag_project/output"
COMMUNITY_REPORT_TABLE = "community_reports"
ENTITY_TABLE = "entities"
COMMUNITY_TABLE = "communities"
# we don't fix a specific community level but instead use an agent to dynamicially
# search through all the community reports to check if they are relevant.
COMMUNITY_LEVEL = None
community_df = pd.read_parquet(f"{OUTPUT_DIR}/{COMMUNITY_TABLE}.parquet")
entity_df = pd.read_parquet(f"{OUTPUT_DIR}/{ENTITY_TABLE}.parquet")
report_df = pd.read_parquet(f"{OUTPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
communities = read_indexer_communities(community_df, report_df)
reports = read_indexer_reports(
report_df,
community_df,
community_level=COMMUNITY_LEVEL,
dynamic_community_selection=True,
)
entities = read_indexer_entities(
entity_df, community_df, community_level=COMMUNITY_LEVEL
)
print(f"Total report count: {len(report_df)}")
print(
f"Report count after filtering by community level {COMMUNITY_LEVEL}: {len(reports)}"
)
report_df.head()
File ~\Desktop\graphrag_repo.venv\Lib\site-packages\graphrag\query\indexer_adapters.py:161, in read_indexer_entities.
. (x) 158 # group entities by id and degree and remove duplicated community IDs 159 nodes_df = nodes_df.groupby(["id"]).agg({"community": set}).reset_index() 160 nodes_df["community"] = nodes_df["community"].apply( --> 161 lambda x: [str(int(i)) for i in x] 162 ) 163 final_df = nodes_df.merge(entities_df, on="id", how="inner").drop_duplicates( 164 subset=["id"] 165 ) 166 # read entity dataframe to knowledge model objects
ValueError: cannot convert float NaN to integer
See full error in the logs section.
Steps to reproduce
No response
Expected Behavior
I should be able to use the "dynamic" part of global search. The script works when I specify the COMMUNITY_LEVEL=2, but it fails when it's None.
GraphRAG Config Used
# Paste your config here
Logs and screenshots
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[47], line 12
5 communities = read_indexer_communities(community_df, report_df)
6 reports = read_indexer_reports(
7 report_df,
8 community_df,
9 community_level=COMMUNITY_LEVEL,
10 dynamic_community_selection=True,
11 )
---> 12 entities = read_indexer_entities(
13 entity_df, community_df, community_level=COMMUNITY_LEVEL
14 )
16 print(f"Total report count: {len(report_df)}")
17 print(
18 f"Report count after filtering by community level {COMMUNITY_LEVEL}: {len(reports)}"
19 )
File ~\Desktop\graphrag_repo\.venv\Lib\site-packages\graphrag\query\indexer_adapters.py:160, in read_indexer_entities(final_entities, final_communities, community_level)
158 # group entities by id and degree and remove duplicated community IDs
159 nodes_df = nodes_df.groupby(["id"]).agg({"community": set}).reset_index()
--> 160 nodes_df["community"] = nodes_df["community"].apply(
161 lambda x: [str(int(i)) for i in x]
162 )
163 final_df = nodes_df.merge(entities_df, on="id", how="inner").drop_duplicates(
164 subset=["id"]
165 )
166 # read entity dataframe to knowledge model objects
File ~\Desktop\graphrag_repo\.venv\Lib\site-packages\pandas\core\series.py:4924, in Series.apply(self, func, convert_dtype, args, by_row, **kwargs)
4789 def apply(
4790 self,
4791 func: AggFuncType,
(...) 4796 **kwargs,
4797 ) -> DataFrame | Series:
4798 """
4799 Invoke function on values of Series.
4800
(...) 4915 dtype: float64
4916 """
4917 return SeriesApply(
4918 self,
4919 func,
4920 convert_dtype=convert_dtype,
4921 by_row=by_row,
4922 args=args,
4923 kwargs=kwargs,
-> 4924 ).apply()
File ~\Desktop\graphrag_repo\.venv\Lib\site-packages\pandas\core\apply.py:1427, in SeriesApply.apply(self)
1424 return self.apply_compat()
1426 # self.func is Callable
-> 1427 return self.apply_standard()
File ~\Desktop\graphrag_repo\.venv\Lib\site-packages\pandas\core\apply.py:1507, in SeriesApply.apply_standard(self)
1501 # row-wise access
1502 # apply doesn't have a `na_action` keyword and for backward compat reasons
1503 # we need to give `na_action="ignore"` for categorical data.
1504 # TODO: remove the `na_action="ignore"` when that default has been changed in
1505 # Categorical (GH51645).
1506 action = "ignore" if isinstance(obj.dtype, CategoricalDtype) else None
-> 1507 mapped = obj._map_values(
1508 mapper=curried, na_action=action, convert=self.convert_dtype
1509 )
1511 if len(mapped) and isinstance(mapped[0], ABCSeries):
1512 # GH#43986 Need to do list(mapped) in order to get treated as nested
1513 # See also GH#25959 regarding EA support
1514 return obj._constructor_expanddim(list(mapped), index=obj.index)
File ~\Desktop\graphrag_repo\.venv\Lib\site-packages\pandas\core\base.py:921, in IndexOpsMixin._map_values(self, mapper, na_action, convert)
918 if isinstance(arr, ExtensionArray):
919 return arr.map(mapper, na_action=na_action)
--> 921 return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)
File ~\Desktop\graphrag_repo\.venv\Lib\site-packages\pandas\core\algorithms.py:1743, in map_array(arr, mapper, na_action, convert)
1741 values = arr.astype(object, copy=False)
1742 if na_action is None:
-> 1743 return lib.map_infer(values, mapper, convert=convert)
1744 else:
1745 return lib.map_infer_mask(
1746 values, mapper, mask=isna(values).view(np.uint8), convert=convert
1747 )
File lib.pyx:2972, in pandas._libs.lib.map_infer()
File ~\Desktop\graphrag_repo\.venv\Lib\site-packages\graphrag\query\indexer_adapters.py:161, in read_indexer_entities.<locals>.<lambda>(x)
158 # group entities by id and degree and remove duplicated community IDs
159 nodes_df = nodes_df.groupby(["id"]).agg({"community": set}).reset_index()
160 nodes_df["community"] = nodes_df["community"].apply(
--> 161 lambda x: [str(int(i)) for i in x]
162 )
163 final_df = nodes_df.merge(entities_df, on="id", how="inner").drop_duplicates(
164 subset=["id"]
165 )
166 # read entity dataframe to knowledge model objects
ValueError: cannot convert float NaN to integer
Additional Information
- GraphRAG Version: 2.1.0
- Operating System: Windows 11
- Python Version: 3.12
- Related Issues:
We'll double-check the case where None is sent. In the meantime, try a high number like 6 (most conmmunity hierarchies top out at 4 levels deep). Dynamic selection should still perform as usual and we won't try to grab all reports.
Confirmed that this is working in the latest notebook