GraphML File Encoding Error with Chinese Text
Do you need to file an issue?
- [X] I have searched the existing issues and this bug is not already filed.
- [ ] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
- [ ] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.
Describe the bug
When the input document contains Chinese text, using graph_intelligence to extract entities and relationships will cause encoding errors for the entities and relationships. In the second step of the workflow, create_base_extracted_entities, a DataFrame is generated where the main content is GraphML. This GraphML will have encoding errors with Chinese characters, resulting in garbled text in the descriptions or names of the entities in the final output.
The general format of the garbled text is similar to: ࠲
You can check the detailed data by inspecting the create_base_extracted_entities.parquet file generated by the index pipeline under output/timestamp/artifacts/. You can view the content using pd.read_parquet("create_base_extracted_entities.parquet")["entity_graph"][0].
Steps to reproduce
Cause of the Encoding Error: The networkx library, which is a dependency of Graphrag, requires version 3 or higher. Testing with versions 3.1, 3.2, and 3.3 of networkx revealed a bug in the generate_graphml() method, which fails to correctly use UTF-8 encoding.
Solution 1:
Modify the source code of Graphrag. In graphrag/index/verbs/entities/extraction/strategies/graph_intelligence/run_graph_intelligence.py, replace the line:
graph_data = "".join(nx.generate_graphml(graph))
with:
path = "./graphml"
nx.write_graphml(G, path, encoding='utf-8')
def read_graphml_by_line(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
for line in the file:
yield line.strip() + ' '
graph = read_graphml_by_line(path)
graph_data = "".join(graph)
Solution 2:
Modify the generate_graphml() method in the networkx library.
In the GraphMLWriter(GraphML) class, replace:
def __str__(self):
from xml.etree.ElementTree import tostring
if self.prettyprint:
self.indent(self.xml)
s = tostring(self.xml).decode(self.encoding)
return s
with:
def __str__(self):
from xml.etree.ElementTree import tostring
if self.prettyprint:
self.indent(self.xml)
s = tostring(self.xml, encoding=self.encoding).decode(self.encoding)
return s
### Expected Behavior
_No response_
### GraphRAG Config Used
```yaml
# Paste your config here
Logs and screenshots
No response
Additional Information
- GraphRAG Version:
- Operating System:
- Python Version:
- Related Issues:
Can you try this with release 0.2.2? We believe a number of encoding errors are resolved with this release.
I'm having the same problem. Updating to the latest version didn't solve it.
Hi! We just released 0.3.0 with a fix to address unicode characters, Can you please try with that version?
Hi! We just released 0.3.0 with a fix to address unicode characters, Can you please try with that version?
The problem remains after update to the 0.3.0 version in my local computer.
the output of nx.generate_graphml is encoding by HTML Entities, so display directly is abnormal. need decoding it by html.unescape().