gittables
gittables copied to clipboard
syntactic processing
In table_annotator.py on line 632, we process the original column name to match the name against the ontologies of DBpedia and Schema. The original column names are processed using the code below:
cleaned_table_columns = [
re.sub(r"[_-]", " ", " ".join(
re.findall("[0-9,a-z,.,\"#!$%\^&\*;:{}=\-_`~()\n\t\d]+|[A-Z](?:[A-Z]*(?![a-z])|[a-z]*)", col)
)).lower() for col in table_columns.copy()
]
I wonder if the first " "
inside the re.sub() call, currently a space, should be converted to ""
, an empty string. Because we already match the _-
in the regex inside findall, which in turn means the _
or _
is replaced by a space using " ".join()
. This join keeps the matched _
or -
in the string, which in turn means the _
or -
is replaced by another " "
using the re.sub(r"[_-]", " ", ...)
.
For example:
"Team-Name"
would be converted into "team name"
, 2 spaces between 'team'
and 'name'
. Is this desired behaviour, am I missing something? Or is this a bug?