sqllineage
sqllineage copied to clipboard
SQLLineageException when sourcing view from OPENROWSET() (T-SQL)
As mentioned in my other ticket, this library rules. Thank you again :)
T-SQL (as used in Azure Synapse Serverless) lets us define a view as being a file that sits on a filesystem. An example of such a view is the below:
CREATE OR ALTER VIEW [etdl_raw].[exo_calibration_metadata] AS
SELECT * FROM
-- Your parser bugs out here:
OPENROWSET(
BULK (
'/lake-raw/exo_calibration/metadata/**'
, '/lake-raw/this_file_doesnt_exist.parquet'
)
,DATA_SOURCE = 'datasource_etdllake'
,FORMAT = 'Parquet'
-- Contents of the file columns are defined in the below:
) WITH (
[filepath] VARCHAR(255)
,[last_calibration_time] VARCHAR(255)
,[calibration_start_time] VARCHAR(255)
,[calibration_end_time] VARCHAR(255)
) AS t
;
GO
Currently, this gives us the below output:
PS C:\proj> sqllineage -f '.\above_file.sql'
Traceback (most recent call last):
File "C:\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Python310\Scripts\sqllineage.exe\__main__.py", line 7, in <module>
File "C:\Python310\lib\site-packages\sqllineage\cli.py", line 86, in main
runner.print_table_lineage()
File "C:\Python310\lib\site-packages\sqllineage\runner.py", line 166, in print_table_lineage
print(str(self))
File "C:\Python310\lib\site-packages\sqllineage\runner.py", line 22, in wrapper
self._eval()
File "C:\Python310\lib\site-packages\sqllineage\runner.py", line 179, in _eval
self._stmt_holders = [LineageAnalyzer().analyze(stmt) for stmt in self._stmt]
File "C:\Python310\lib\site-packages\sqllineage\runner.py", line 179, in <listcomp>
self._stmt_holders = [LineageAnalyzer().analyze(stmt) for stmt in self._stmt]
File "C:\Python310\lib\site-packages\sqllineage\core\analyzer.py", line 60, in analyze
self._extract_from_dml(stmt, AnalyzerContext())
File "C:\Python310\lib\site-packages\sqllineage\core\analyzer.py", line 135, in _extract_from_dml
next_handler.handle(sub_token, holder)
File "C:\Python310\lib\site-packages\sqllineage\core\handlers\base.py", line 39, in handle
self._handle(token, holder)
File "C:\Python310\lib\site-packages\sqllineage\core\handlers\source.py", line 56, in _handle
self._handle_table(token, holder)
File "C:\Python310\lib\site-packages\sqllineage\core\handlers\source.py", line 78, in _handle_table
raise SQLLineageException(
sqllineage.exceptions.SQLLineageException: An Identifier is expected, got Function[value: OPENROWSET(
BULK
(
'/lake-raw/exo_calibration/metadata/**'
, '/lake-raw/this_file_doesnt_exist.parquet'
)
,DATA_SOURCE = 'datasource_etdllake'
,FORMAT = 'Parquet'
)] instead.
From what I can tell, I need to amend this function with an extra elif
to handle the above. I'll give it a go and keep this open for tracking purposes.
With object storage / data lake becoming more and more mature, a lot of database/data warehouse are supporting the query pattern of selecting directly from files. sqllineage made some initial support on this, see #164 for details.
If we're going to support OPENROWSET, then I'd like to parse the path as source. Please provide the language specification for this statement and make sure we support each optional use cases.