datahub icon indicating copy to clipboard operation
datahub copied to clipboard

Parsing Power BI queries that contain comments (aka --) causes it to ignore anything after

Open aaverbec opened this issue 6 months ago • 1 comments

Describe the bug When we ingest a power bi report that has datasets that contain -- (2 hyphens), the SQL parser ignores everything after it in the upstream lineage. I am using version 14.0.3 of the CLI

EDIT: It also doesn't seem to parse SQL that has hard new lines instead of #(lf) (second screenshot)

To Reproduce

Example query

select#(lf)upper(cs.customercode) as customercode#(lf), cs.ear2id as ear2id#(lf), db.branch_rollup_name#(lf) , db.subregion1#(lf) , db.subregion2#(lf)from nast_truckload_domain.broker.dim_customer cs#(lf)--join nast_customer_domain.broker.dim_customer_ear2_single_ownership_history as so on cs.ear2id = so.ear2_id and is_current = true#(lf)join nast_customer_domain.broker.dim_customer_ear2_single_ownership_history as so on cs.ear2id = so.ear2_id and is_current = true#(lf)join nast_customer_domain.broker.ref_branch db on db.Branch_code = so.branch_code#(lf)-- join nast_customer_domain.broker.customer_long_term_value as clv on cs.ear2id = clv.ear2_id#(lf)-- join nast_customer_domain.broker.customer_order_load cl on cl.customer_code = cs.customercode and cl.revenue>0 and cl.activity_date>= '2023-01-01'#(lf)where cs.customerstatusid = 1 --active#(lf)-- and cs.customercode = 'C8817970'#(lf)and db.primary_business_line_id in ('62','73')#(lf)-- and cs.ear2id = 199

It's a little easier to read if you replace the #(lf) with a new line.

select upper(cs.customercode) as customercode , cs.ear2id as ear2id , db.branch_rollup_name , db.subregion1 , db.subregion2 from nast_truckload_domain.broker.dim_customer cs --join nast_customer_domain.broker.dim_customer_ear2_single_ownership_history as so on cs.ear2id = so.ear2_id and is_current = true join nast_customer_domain.broker.dim_customer_ear2_single_ownership_history as so on cs.ear2id = so.ear2_id and is_current = true join nast_customer_domain.broker.ref_branch db on db.Branch_code = so.branch_code -- join nast_customer_domain.broker.customer_long_term_value as clv on cs.ear2id = clv.ear2_id -- join nast_customer_domain.broker.customer_order_load cl on cl.customer_code = cs.customercode and cl.revenue>0 and cl.activity_date>= '2023-01-01' where cs.customerstatusid = 1 --active -- and cs.customercode = 'C8817970' and db.primary_business_line_id in ('62','73') -- and cs.ear2id = 199

Only the first table shows up because the line after it is commented out. The 3rd and 4th lines of the FROM clause are ignored. image

I have tested removing the comments and they show up in the lineage. If there is a comment in the SELECT clause, none of the tables show upstream in the lineage.

Expected behavior I would expect that the parser removes any lines of code from a -- to a new line or #(lf) character as #(lf) is how power bi represents new lines.

Screenshots

Example of query with new lines where lineage doesn't work. If I remove the new lines, it works fine. image

Desktop (please complete the following information):

  • OS: [e.g. iOS] Windows 11
  • Browser [e.g. chrome, safari] Chrome
  • Version [e.g. 22]

Additional context Add any other context about the problem here.

aaverbec avatar Aug 27 '24 15:08 aaverbec