kuzu icon indicating copy to clipboard operation
kuzu copied to clipboard

Removing multiple spaces using regex not working in Python

Open prrao87 opened this issue 4 months ago • 1 comments

API

Python

Description

The common regex to match an arbitrary number of spaces is \s+. When using the Python API in Kùzu, this doesn't seem to work. I'm unsure if this is due to passing the regular expression from Python into the C++ layer, or if the functionality isn't implemented at the C++ level.

Here's what I'm trying to do. I have this JSON file persons.json:

[
  {
    "name": "Xavier Williams",
    "date_of_birth": "1989-06-28",
    "residence": {
      "address": "901  Oak    Lane",
      "city": "San Francisco",
      "state": "CA",
      "zip_code": "94122"
    }
  }
]

I want to generate a unique string primary key that's a concatenation of all the address fields, that I can then use for relationship creation downstream.

During this process, I want to remove all whitespaces, and as seen in the raw data example above, there's an arbitrary number of white space characters in the string 901 Oak Lane.

I'm doing this right now:

LOAD FROM 'persons.json'
WITH DISTINCT residence AS r
RETURN
    concat(
        regexp_replace(lower(r.address), '\s+', ''),
        regexp_replace(lower(r.city), '\s+, ''),
        regexp_replace(lower(r.state), '\s+', ''),
        r.zip_code
    ) AS id,
    r.address,
    r.city,
    r.state,
    r.zip_code

And I get this error:

/Users/prrao/code/ingest_kuzu.py:73: SyntaxWarning: invalid escape sequence '\s'
  conn.execute("""
Traceback (most recent call last):
  File "/Users/prrao/code/ingest_kuzu.py", line 73, in <module>
    conn.execute("""
  File "/Users/prrao/code/.venv/lib/python3.12/site-packages/kuzu/connection.py", line 130, in execute
    _query_result = self._connection.query(query)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Parser exception: Invalid input <concat(
                regexp_replace(lower(r.address), '>: expected rule oC_ProjectionItem (line: 7, offset: 49)
"                regexp_replace(lower(r.address), '\s+', ''),"

If I were to run the same regexp replacement in pure Python, it would work as intended.

>>> import re
>>> original_string = "901  Oak    Lane"
>>> cleaned_string = re.sub(r'\s+', '', original_string).strip().lower()
>>> cleaned_string
'901oaklane'

prrao87 avatar Oct 01 '24 22:10 prrao87