gittables icon indicating copy to clipboard operation
gittables copied to clipboard

syntactic processing

Open SennR-1952135 opened this issue 2 years ago • 0 comments

In table_annotator.py on line 632, we process the original column name to match the name against the ontologies of DBpedia and Schema. The original column names are processed using the code below:

cleaned_table_columns = [
                re.sub(r"[_-]", " ", " ".join(
                    re.findall("[0-9,a-z,.,\"#!$%\^&\*;:{}=\-_`~()\n\t\d]+|[A-Z](?:[A-Z]*(?![a-z])|[a-z]*)", col)
                )).lower() for col in table_columns.copy()
            ]

I wonder if the first " " inside the re.sub() call, currently a space, should be converted to "", an empty string. Because we already match the _- in the regex inside findall, which in turn means the _ or _ is replaced by a space using " ".join(). This join keeps the matched _ or - in the string, which in turn means the _ or - is replaced by another " " using the re.sub(r"[_-]", " ", ...).

For example: "Team-Name" would be converted into "team name", 2 spaces between 'team' and 'name'. Is this desired behaviour, am I missing something? Or is this a bug?

SennR-1952135 avatar Mar 15 '22 18:03 SennR-1952135