dlt
dlt copied to clipboard
Incorrect S3 url when using Digital Ocean Spaces as staging and Clickhouse as destination
dlt version
1.1.0
Describe the problem
Using DigitalOcean S3-compatible Spaces storage as staging for loading into Clickhouse destination fails due to incorrect URL building.
Expected behavior
Data should load correctly, but instead a DB exception is thrown by Clickhouse, e.g.
2024-09-29 15:27:21,065|[ERROR]|341017|140111794464320|dlt|reference.py|run_managed:403|Transient exception in job _dlt_pipeline_state.55206d6956.reference in file /home/g/.dlt/pipelines/s3_pipeline/load/normalized/1727612187.8839986/started_jobs/_dlt_pipeline_state.55206d6956.5.reference
Traceback (most recent call last):
File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/dbapi/cursor.py", line 111, in execute
response = execute(
File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/client.py", line 382, in execute
rv = self.process_ordinary_query(
File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/client.py", line 580, in process_ordinary_query
return self.receive_result(with_column_types=with_column_types,
File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/client.py", line 212, in receive_result
return result.get_result()
File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/result.py", line 50, in get_result
for packet in self.packet_generator:
File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/client.py", line 228, in packet_generator
packet = self.receive_packet()
File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/client.py", line 245, in receive_packet
raise packet.exception
clickhouse_driver.errors.ServerException: Code: 499.
DB::Exception: Failed to get object info: No response body.. HTTP response code: 302: while reading _dlt_pipeline_state/1727612187.8839986.55206d6956.jsonl: While executing S3Source. Stack trace:
0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000d0fa5bb
1. DB::S3Exception::S3Exception<String const&, unsigned long>(Aws::S3::S3Errors, fmt::v9::basic_format_string<char, fmt::v9::type_identity<String const&>::type, fmt::v9::type_identity<unsigned long>::type>, String const&, unsigned long&&) @ 0x00000000103ec34a
2. DB::S3::getObjectInfo(DB::S3::Client const&, String const&, String const&, String const&, bool, bool) @ 0x00000000103ec6a8
3. DB::S3ObjectStorage::getObjectMetadata(String const&) const @ 0x00000000103e2cfe
4. DB::StorageObjectStorageSource::KeysIterator::nextImpl(unsigned long) @ 0x0000000010337a05
5. DB::StorageObjectStorageSource::IIterator::next(unsigned long) @ 0x0000000010334ef9
6. DB::StorageObjectStorageSource::createReader(unsigned long, std::shared_ptr<DB::StorageObjectStorageSource::IIterator> const&, std::shared_ptr<DB::StorageObjectStorage::Configuration> const&, std::shared_ptr<DB::IObjectStorage> const&, DB::ReadFromFormatInfo const&, std::optional<DB::FormatSettings> const&, std::shared_ptr<DB::KeyCondition const> const&, std::shared_ptr<DB::Context const> const&, DB::SchemaCache*, std::shared_ptr<Poco::Logger> const&, unsigned long, unsigned long, bool) @ 0x00000000103329ec
7. DB::StorageObjectStorageSource::createReader() @ 0x0000000010331c87
8. DB::StorageObjectStorageSource::generate() @ 0x0000000010331e96
9. DB::ISource::tryGenerate() @ 0x0000000012fbba35
10. DB::ISource::work() @ 0x0000000012fbb4c2
11. DB::ExecutionThreadContext::executeTask() @ 0x0000000012fd5347
12. DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x0000000012fc9c30
13. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<DB::PipelineExecutor::spawnThreads()::$_0, void ()>>(std::__function::__policy_storage const*) @ 0x0000000012fcb2ae
14. ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::worker(std::__list_iterator<ThreadFromGlobalPoolImpl<false, true>, void*>) @ 0x000000000d1b345b
15. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false, true>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x000000000d1b74b1
16. void* std::__thread_proxy[abi:v15000]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void ThreadPoolImpl<std::thread>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>>(void*) @ 0x000000000d1b6263
17. ? @ 0x00007883062ecac3
18. ? @ 0x000078830637e850
Steps to reproduce
If bucket_url is set as "s3://bucket_name/", then Clickhouse gives error "Bucket name length is out of bounds in virtual hosted style S3 URI" because file URL is converted to "http://bucket_name.nyc3.digitaloceanspaces.com//dataset/_dlt_pipeline_state/1727612187.8839986.55206d6956.jsonl"
Setting bucket_url as "s3://bucket_name" still doesn't work, as http endpoint is called (even if endpoint is endpoint_url is set with https), which receives a 302 HTTP status that is not acted upon.
Operating system
Linux
Runtime environment
Local
Python version
3.10
dlt data source
No response
dlt destination
No response
Other deployment details
No response
Additional information
By setting use_https=True
in clickhouse.py
the 302 problem is fixed. I also recommend dealing with double slashes (i.e. //) that are built in URLs.