graphrag GraphML File Encoding Error with Chinese Text

Do you need to file an issue?

[X] I have searched the existing issues and this bug is not already filed.
[ ] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
[ ] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

When the input document contains Chinese text, using graph_intelligence to extract entities and relationships will cause encoding errors for the entities and relationships. In the second step of the workflow, create_base_extracted_entities, a DataFrame is generated where the main content is GraphML. This GraphML will have encoding errors with Chinese characters, resulting in garbled text in the descriptions or names of the entities in the final output.

The general format of the garbled text is similar to: &#2098

You can check the detailed data by inspecting the create_base_extracted_entities.parquet file generated by the index pipeline under output/timestamp/artifacts/. You can view the content using pd.read_parquet("create_base_extracted_entities.parquet")["entity_graph"][0].

Steps to reproduce

Cause of the Encoding Error: The networkx library, which is a dependency of Graphrag, requires version 3 or higher. Testing with versions 3.1, 3.2, and 3.3 of networkx revealed a bug in the generate_graphml() method, which fails to correctly use UTF-8 encoding.

Solution 1: Modify the source code of Graphrag. In graphrag/index/verbs/entities/extraction/strategies/graph_intelligence/run_graph_intelligence.py, replace the line:

graph_data = "".join(nx.generate_graphml(graph))

with:

path = "./graphml"
nx.write_graphml(G, path, encoding='utf-8')

def read_graphml_by_line(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in the file:
            yield line.strip() + ' '

graph = read_graphml_by_line(path)
graph_data = "".join(graph)

Solution 2: Modify the generate_graphml() method in the networkx library.
In the GraphMLWriter(GraphML) class, replace:

def __str__(self):
    from xml.etree.ElementTree import tostring

    if self.prettyprint:
        self.indent(self.xml)
    s = tostring(self.xml).decode(self.encoding)
    return s

with:

def __str__(self):
    from xml.etree.ElementTree import tostring

    if self.prettyprint:
        self.indent(self.xml)
    s = tostring(self.xml, encoding=self.encoding).decode(self.encoding)
    return s


### Expected Behavior

_No response_

### GraphRAG Config Used

```yaml
# Paste your config here

Logs and screenshots

No response

Additional Information

GraphRAG Version:
Operating System:
Python Version:
Related Issues:

Aug 09 '24 07:08 gufengdong

Can you try this with release 0.2.2? We believe a number of encoding errors are resolved with this release.

Aug 09 '24 17:08 natoverse

I'm having the same problem. Updating to the latest version didn't solve it.

Aug 12 '24 09:08 zhuzixiao

Hi! We just released 0.3.0 with a fix to address unicode characters, Can you please try with that version?

Aug 13 '24 00:08 AlonsoGuevara

Hi! We just released 0.3.0 with a fix to address unicode characters, Can you please try with that version?

The problem remains after update to the 0.3.0 version in my local computer.

Aug 16 '24 14:08 zijinyuan

the output of nx.generate_graphml is encoding by HTML Entities, so display directly is abnormal. need decoding it by html.unescape().

Aug 17 '24 03:08 zijinyuan