text2sql-data icon indicating copy to clipboard operation
text2sql-data copied to clipboard

Anonymised Variables should have consistent naming corresponding to their column

Open DeNeutoy opened this issue 5 years ago • 4 comments

It's a little annoying that the anonymised variable names sometimes but not always correspond to the table/column name they come from. E.g in some datasets like academic, the variable name is derived from the column name:

        "sql": [
            "SELECT JOURNALalias0.HOMEPAGE FROM JOURNAL AS JOURNALalias0 WHERE JOURNALalias0.NAME = \"journal_name0\" ;"
        ],
        "variables": [
            {
                "example": "PVLDB",
                "location": "both",
                "name": "journal_name0",
                "type": "journal_name"
            }
        ]

whereas in geography, variables are named var1, from which you cannot directly infer their type from either the name or the type key.

        "sql": [
            "SELECT CITYalias0.CITY_NAME FROM CITY AS CITYalias0 WHERE CITYalias0.POPULATION = ( SELECT MAX( CITYalias1.POPULATION ) FROM CITY AS CITYalias1 WHERE CITYalias1.STATE_NAME = \"var0\" ) AND CITYalias0.STATE_NAME = \"var0\" ;"
        ],
        "variables": [
            {
                "example": "arizona",
                "location": "both",
                "name": "var0",
                "type": "state"
            }
        ]

DeNeutoy avatar Nov 12 '18 23:11 DeNeutoy

I've worked on addressing this in #27 have a look and let me know what you think!

jkkummerfeld avatar Nov 15 '18 19:11 jkkummerfeld

Hm, though having something that directly maps to the DB for all cases is trickier. More thought required.

jkkummerfeld avatar Nov 15 '18 19:11 jkkummerfeld

Hmm yeah I also found this once I dug into it more - e.g the limit0 variables in the scholar dataset are really a function of the query rather than particular to the database. What you've done for geography looks like an improvement though!

DeNeutoy avatar Nov 15 '18 20:11 DeNeutoy

I've merged that for now, but will keep this open as a reminder that this issue requires more work. My thinking is that I could do the following:

  • For all variables that clearly map to a single field, rename them, specifying both the table and field.
  • For variables that can be used in multiple ways, change them to have a different name in each case.
  • For variables that do not have a single mapping (e.g. they are used in multiple ways in the query) have a special case.

That would be an improvement over the current state, though would also be a fair amount of work.

jkkummerfeld avatar Nov 17 '18 18:11 jkkummerfeld