dlt
dlt copied to clipboard
Filesystem destination does not raise exception when using scd2 merge strategy
dlt version
dlt==0.5.1
Describe the problem
I have set write_disposition={'disposition': 'merge', 'strategy':'scd2'}
Initially when I ran this with an s3 destination it worked, but when I run it with a local filesystem it gave the exception dlt.common.destination.exceptions.DestinationCapabilitiesException: 'scd2' merge strategy not supported for 'filesystem' destination.
However in writing the reproduction it will no longer raise this exception in either circumstances.
Expected behavior
No response
Steps to reproduce
Clone this repo https://github.com/Nintorac/dlt-merge-strategy-issue-repro
and run docker compose up
Operating system
Linux
Runtime environment
Local
Python version
3.10
dlt data source
No response
dlt destination
Filesystem & buckets
Other deployment details
No response
Additional information
No response
@Nintorac thanks for taking the time to create the repo.
I cloned the repo and ran docker compose up:
As you see in the screenshot, I get exec ./run.sh: no such file or directory.
Do I need to do anything else to make it work?
Should be all there, I will double check that I can run it from a fresh clone when I am back at the computer.
But I do see the run.sh in the repo - https://github.com/Nintorac/dlt-merge-strategy-issue-repro/blob/main/run.sh
I see you are running windows, maybe the run.sh script is losing the execute bit permissions when you clone. You could try add run chmod +x run.sh after the copy . . line in the Docker file
I think I understand what's going on here. I don't think it has anything to do with local versus s3. I think it has to do with dlt versions differences.
I noticed that behavior in dlt==0.5.1 is different than dlt==0.5.2a2.
Setting write_disposition="merge" will succeed on both versions:
import dlt
from dlt.destinations import filesystem
assert dlt.__version__ in ("0.5.1", "0.5.2a2")
pipeline = dlt.pipeline(
pipeline_name="my_pipeline",
destination=filesystem(bucket_url="file://_storage"),
)
pipeline.run(
[{"foo": 1}],
table_name="my_table",
write_disposition="merge",
# write_disposition={"disposition": "merge", "strategy": "scd2"},
)
print(
"I ran without errors, because I silently ignored the `merge` write"
" disposition and used `append` instead."
)
Setting write_disposition={"disposition": "merge", "strategy": "scd2"} will succeed on 0.5.1, but fail on 0.5.2a2:
import dlt
from dlt.destinations import filesystem
assert dlt.__version__ == "0.5.2a2"
pipeline = dlt.pipeline(
pipeline_name="my_pipeline",
destination=filesystem(bucket_url="file://_storage"),
)
pipeline.run(
[{"foo": 1}],
table_name="my_table",
# write_disposition="merge",
write_disposition={"disposition": "merge", "strategy": "scd2"},
)
# dlt.common.destination.exceptions.DestinationCapabilitiesException: `scd2` merge strategy not supported for `filesystem` destination.
@Nintorac could it be you have been using different dlt versions?
Maybe, but I don't think so. I am using poetry and it appears to ignore the pre-release versions.
You also mentioned an exception in 0.5.1 as well, right?
@Nintorac I did mention that, but it was an incorrect statement. The check that raises that exception has been introduced after 0.5.1.
I ran dlt --version back then to check my version and it showed 0.5.1, but that must have been for the dlt package I had installed in my global Python env, not the env I used to run the code that threw the exception.
In any case, the version difference is the only thing we're able to reproduce. The difference can also be explained: upsert support for the filesystem destination (when using delta table format) was added after version 0.5.1. This new feature comes with a new check on supported merge strategies, which explains why dlt.common.destination.exceptions.DestinationCapabilitiesException: scd2 merge strategy not supported for filesystem destination. is raised on 0.5.2a2 but not on 0.5.1.