datahub
datahub copied to clipboard
Parsing Power BI queries that contain comments (aka --) causes it to ignore anything after
Describe the bug When we ingest a power bi report that has datasets that contain -- (2 hyphens), the SQL parser ignores everything after it in the upstream lineage. I am using version 14.0.3 of the CLI
EDIT: It also doesn't seem to parse SQL that has hard new lines instead of #(lf) (second screenshot)
To Reproduce
Example query
select#(lf)upper(cs.customercode) as customercode#(lf), cs.ear2id as ear2id#(lf), db.branch_rollup_name#(lf) , db.subregion1#(lf) , db.subregion2#(lf)from nast_truckload_domain.broker.dim_customer cs#(lf)--join nast_customer_domain.broker.dim_customer_ear2_single_ownership_history as so on cs.ear2id = so.ear2_id and is_current = true#(lf)join nast_customer_domain.broker.dim_customer_ear2_single_ownership_history as so on cs.ear2id = so.ear2_id and is_current = true#(lf)join nast_customer_domain.broker.ref_branch db on db.Branch_code = so.branch_code#(lf)-- join nast_customer_domain.broker.customer_long_term_value as clv on cs.ear2id = clv.ear2_id#(lf)-- join nast_customer_domain.broker.customer_order_load cl on cl.customer_code = cs.customercode and cl.revenue>0 and cl.activity_date>= '2023-01-01'#(lf)where cs.customerstatusid = 1 --active#(lf)-- and cs.customercode = 'C8817970'#(lf)and db.primary_business_line_id in ('62','73')#(lf)-- and cs.ear2id = 199
It's a little easier to read if you replace the #(lf) with a new line.
select upper(cs.customercode) as customercode , cs.ear2id as ear2id , db.branch_rollup_name , db.subregion1 , db.subregion2 from nast_truckload_domain.broker.dim_customer cs --join nast_customer_domain.broker.dim_customer_ear2_single_ownership_history as so on cs.ear2id = so.ear2_id and is_current = true join nast_customer_domain.broker.dim_customer_ear2_single_ownership_history as so on cs.ear2id = so.ear2_id and is_current = true join nast_customer_domain.broker.ref_branch db on db.Branch_code = so.branch_code -- join nast_customer_domain.broker.customer_long_term_value as clv on cs.ear2id = clv.ear2_id -- join nast_customer_domain.broker.customer_order_load cl on cl.customer_code = cs.customercode and cl.revenue>0 and cl.activity_date>= '2023-01-01' where cs.customerstatusid = 1 --active -- and cs.customercode = 'C8817970' and db.primary_business_line_id in ('62','73') -- and cs.ear2id = 199
Only the first table shows up because the line after it is commented out. The 3rd and 4th lines of the FROM clause are ignored.
I have tested removing the comments and they show up in the lineage. If there is a comment in the SELECT clause, none of the tables show upstream in the lineage.
Expected behavior I would expect that the parser removes any lines of code from a -- to a new line or #(lf) character as #(lf) is how power bi represents new lines.
Screenshots
Example of query with new lines where lineage doesn't work. If I remove the new lines, it works fine.
Desktop (please complete the following information):
- OS: [e.g. iOS] Windows 11
- Browser [e.g. chrome, safari] Chrome
- Version [e.g. 22]
Additional context Add any other context about the problem here.