smartnoise-sdk
smartnoise-sdk copied to clipboard
Rewriter should use NameCompare
The rewriter currently does not apply any engine-specific name comparison rules, despite being guaranteed to have a NameCompare
class available. The name comparison rules are currently applied only at symbol resolution time, which happens before and after rewriting.
This can lead to duplicate column names in the rewritten query. For example:
sample = Table("PUMS", "PUMS", [
Int('pid', is_key=True),
Int('"PiD"')
], 150)
meta = CollectionMetadata([sample], "csv")
query = 'SELECT COUNT (DISTINCT PID) AS foo, COUNT(DISTINCT "PiD") AS bar FROM PUMS.PUMS'
meta.compare = PostgresNameCompare()
reader = PostgresReader("localhost", "PUMS", "admin", "password")
private_reader = PrivateReader(reader, meta, 1.0)
inner, outer = private_reader.rewrite(query) # Fails!!
ne = outer.select.namedExpressions
assert(ne[0].expression.expression.name == 'keycount')
assert(ne[1].expression.expression.name != 'keycount')
In the above, the metadata provided by the data curator has a pid
and a PiD
, which can be separate columns according to Postgres rules. The query provided by the user, however, counts a PID
column, which will resolve to pid
at symbol resolution time, but results in duplicate columns in the rewriter. The rewriter produces 3 columns: A 'pid' (for the keycount
), a PID
(for the user query), and a PiD
for the case-sensitive column. At symbol resolution time, this fails, because the rewritten query has two columns that resolve to pid
.
This occurs because push_scope
does not apply name comparison rules, and therefore does not know that COUNT (DISTINCT pid)
and COUNT (DISTINCT PID)
are counting the same thing. If name comparison rules were used, the column would only be queried once, and would be reused as needed.
Probably the best way to fix this would be to have the hash table in NameScope
use the equality rules from the NameCompare
. The hash table for the naming scope handles things like expressions, and the name-comparison rules would only need to apply to identifiers. This would allow us to keep the current contract of NameCompare.identifier_match
, and would cover all cases, as long as the engine-specific rules are implemented faithfully.
Another fix could be to canonicalize names used in the rewriter. For example, when a user-provided query asks the rewriter to access a column named PID
, the rewriter could scan the curator-provided metadata and look for the first column name that matches (e.g. pid
), and then use the column name from the metadata in the rewritten query. This is probably a good idea even if the hash table implements name compare rules.
If the hash table doesn't implement name compare rules, canonicalization based on metadata would not cover all cases. For computed columns which don't map to TableColumn
, the rewriter would need to use some heuristic like .lower()
or .upper()
. We could extend the NameCompare
contract to include a canonicalize()
method, to allow more fine-tuned canonicalization that honors escaping rules. However, this seems like a lot of additional complexity.