spline-spark-agent icon indicating copy to clipboard operation
spline-spark-agent copied to clipboard

Spline Support of Statment Create table as select

Open zacayd opened this issue 1 year ago • 11 comments

I have a statment in a Notebook in Databricks that is

create table
  lineage_data.lineagedemo.dinner_1
AS 
SELECT
  recipe_id, concat(app," + ", main," + ",dessert)
AS
  full_menu
FROM
  lineage_data.lineagedemo.menu

the lineage is not shown Only on insert into - it shows? Is it by design I use version spark_3_3_spline_agent_bundle_2_12_2_1_0_SNAPSHOT.jar

zacayd avatar Jan 28 '24 12:01 zacayd

It should be supported. Is there any error in logs? Try "USING DELTA" (not sure if it's going to make any difference, but we use it like that).

wajda avatar Jan 30 '24 10:01 wajda

Hi i have run this code

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.functions import col

# Assuming SparkSession is already available as 'spark' in Databricks
# spark = SparkSession.builder.appName("example").getOrCreate()

# Define the schema for the table
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

# Create a DataFrame with the schema
data = [(1, "Alice", 30), (2, "Bob", 35), (3, "Charlie", 40)]
df = spark.createDataFrame(data, schema)

# Define the table name and path (replace with your desired table name and path)
silver_table_name = "silver_table_name1"
path = "/mnt/delta/" + silver_table_name

# Write the DataFrame to the table
df.write.format("delta").mode("overwrite").save(path)

# Register the table in the metastore
spark.sql(f"CREATE TABLE {silver_table_name} USING DELTA LOCATION '{path}'")

# Create a materialized view from the silver layer table
materialized_view_name = "your_materialized_view"
create_mv_sql = f"""
CREATE  VIEW {materialized_view_name}
AS SELECT * FROM {silver_table_name}
"""
spark.sql(create_mv_sql)

# Optionally, verify the contents of the table and materialized view
spark.sql(f"SELECT * FROM {silver_table_name}").show()
spark.sql(f"SELECT * FROM {materialized_view_name}").show()

and according to the execution plan i got in spline it shows only target /mnt/delta/silver_table_name1 with columns seems that the create view statemet is not registered in the lineage

zacayd avatar Jan 31 '24 09:01 zacayd

It should be supported. Is there any error in logs? Try "USING DELTA" (not sure if it's going to make any difference, but we use it like that).

@wajda also when i added Using delta is didnt solved it

-- Drop the table 'Accounts1' if it exists
DROP TABLE IF EXISTS lineage_data.lineagedemo.Accounts1;
-- Drop the table 'V_Accounts1' if it exists
DROP TABLE IF EXISTS lineage_data.lineagedemo.V_Accounts1;

-- Creating the table 'Accounts1' using Delta format
CREATE TABLE lineage_data.lineagedemo.Accounts1(
    Id INT,
    Name STRING
) USING DELTA;

-- Creating or replacing the table 'V_Accounts1' based on the 'Accounts1' table, using Delta format
CREATE TABLE lineage_data.lineagedemo.V_Accounts1
USING DELTA AS
SELECT *
FROM lineage_data.lineagedemo.Accounts1;

zacayd avatar Jan 31 '24 10:01 zacayd

@wajda if i add Using delta it works also in create as select

-- Drop the table 'Accounts1' if it exists
DROP TABLE IF EXISTS Accounts1;
-- Drop the table 'V_Accounts1' if it exists
DROP TABLE IF EXISTS V_Accounts1;

-- Creating the table 'Accounts1' using Delta format
CREATE TABLE Accounts1(
    Id INT,
    Name STRING
) USING DELTA;

-- Creating or replacing the table 'V_Accounts1' based on the 'Accounts1' table, using Delta format
CREATE TABLE V_Accounts1
USING DELTA AS
SELECT *
FROM Accounts1;

zacayd avatar Jan 31 '24 11:01 zacayd

It might be possible that CREATE VIEW is either not supported by Spline, or is not trackable at all in Spark. A more thorough investigation is needed. Try to enable DEBUG logging in Spark and see if there is a message like

XXX was not recognized as a write-command

or Write extraction failed following an object structure dump.

Another small thing to try is to enable non persistent actions capturing like this:

spline.plugins.za.co.absa.spline.harvester.plugin.embedded.NonPersistentActionsCapturePlugin.enabled=true

If you still do no see any mentioning of your "VIEW" anywhere - in logs, errors, or lineage - then it's very likely that Spark simply do not provide the information about that action at the first place. Please try and let us know what you've found.

wajda avatar Jan 31 '24 16:01 wajda

I will try spline.plugins.za.co.absa.spline.harvester.plugin.embedded.NonPersistentActionsCapturePlugin.enabled=true But how can enable debug logging on Databricks?

zacayd avatar Feb 01 '24 06:02 zacayd

I have added spline.plugins.za.co.absa.spline.harvester.plugin.embedded.NonPersistentActionsCapturePlugin.enabled=true to the Advanced Setting it caused additional of execution plans- but none of them showed the view and the columns @wajda - Do you know how can i enable DEBUG logging in Databricks?

zacayd avatar Feb 01 '24 06:02 zacayd

By the way,Unity Catalog of Databricks support create view as select * from table

zacayd avatar Feb 01 '24 07:02 zacayd

Do you know how can i enable DEBUG logging in Databricks?

spark.sparkContext.setLogLevel("DEBUG")

wajda avatar Feb 01 '24 11:02 wajda

I put on the head of my Databricks notebook

%python
spark.sparkContext.setLogLevel("debug")
%python
sc._jvm.za.co.absa.spline.harvester.SparkLineageInitializer.enableLineageTracking(spark._jsparkSession)
%SQL

-- Drop the table 'Accounts1' if it exists
DROP TABLE IF EXISTS Accounts1;
-- Drop the table 'V_Accounts1' if it exists

-- Creating the table 'Accounts1' using Delta format
CREATE TABLE Accounts1(
    Id INT,
    Name STRING
) USING DELTA;


-- Creating or replacing the table 'V_Accounts1' based on the 'Accounts1' table, using Delta format
CREATE OR REPLACE view V_Accounts1
 AS
SELECT *
FROM Accounts1;

But dont see any error on the driver logs

zacayd avatar Feb 01 '24 12:02 zacayd

@wajda I faced the similar issue and was able to see the plan. Although I don't have details right now to see what spline is generating for CTAS, I identified that because of below change in Spark caused this issue. Although this change is applicable starting Spark 3.3, usually databricks ports back changes to older version as well. This was true for ListQuery expression as well I fixed earlier in Spline.

Need to extract the logical plan class correctly for Spline to work

https://github.com/apache/spark/commit/aa1b16b1acd67c22544f969eb8595b02bbc1b433#diff-594addaa7ed3dd43fd0dad5fa81ade5ec2e570adf7c982c0570e3358a08a8e0a

uday1409 avatar Feb 25 '24 10:02 uday1409