kuzu
kuzu copied to clipboard
Removing multiple spaces using regex not working in Python
API
Python
Description
The common regex to match an arbitrary number of spaces is \s+
. When using the Python API in Kùzu, this doesn't seem to work. I'm unsure if this is due to passing the regular expression from Python into the C++ layer, or if the functionality isn't implemented at the C++ level.
Here's what I'm trying to do. I have this JSON file persons.json
:
[
{
"name": "Xavier Williams",
"date_of_birth": "1989-06-28",
"residence": {
"address": "901 Oak Lane",
"city": "San Francisco",
"state": "CA",
"zip_code": "94122"
}
}
]
I want to generate a unique string primary key that's a concatenation of all the address fields, that I can then use for relationship creation downstream.
During this process, I want to remove all whitespaces, and as seen in the raw data example above, there's an arbitrary number of white space characters in the string 901 Oak Lane
.
I'm doing this right now:
LOAD FROM 'persons.json'
WITH DISTINCT residence AS r
RETURN
concat(
regexp_replace(lower(r.address), '\s+', ''),
regexp_replace(lower(r.city), '\s+, ''),
regexp_replace(lower(r.state), '\s+', ''),
r.zip_code
) AS id,
r.address,
r.city,
r.state,
r.zip_code
And I get this error:
/Users/prrao/code/ingest_kuzu.py:73: SyntaxWarning: invalid escape sequence '\s'
conn.execute("""
Traceback (most recent call last):
File "/Users/prrao/code/ingest_kuzu.py", line 73, in <module>
conn.execute("""
File "/Users/prrao/code/.venv/lib/python3.12/site-packages/kuzu/connection.py", line 130, in execute
_query_result = self._connection.query(query)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Parser exception: Invalid input <concat(
regexp_replace(lower(r.address), '>: expected rule oC_ProjectionItem (line: 7, offset: 49)
" regexp_replace(lower(r.address), '\s+', ''),"
If I were to run the same regexp replacement in pure Python, it would work as intended.
>>> import re
>>> original_string = "901 Oak Lane"
>>> cleaned_string = re.sub(r'\s+', '', original_string).strip().lower()
>>> cleaned_string
'901oaklane'