dlt icon indicating copy to clipboard operation
dlt copied to clipboard

Incorrect S3 url when using Digital Ocean Spaces as staging and Clickhouse as destination

Open gfrmin opened this issue 4 months ago • 2 comments

dlt version

1.1.0

Describe the problem

Using DigitalOcean S3-compatible Spaces storage as staging for loading into Clickhouse destination fails due to incorrect URL building.

Expected behavior

Data should load correctly, but instead a DB exception is thrown by Clickhouse, e.g.

2024-09-29 15:27:21,065|[ERROR]|341017|140111794464320|dlt|reference.py|run_managed:403|Transient exception in job _dlt_pipeline_state.55206d6956.reference in file /home/g/.dlt/pipelines/s3_pipeline/load/normalized/1727612187.8839986/started_jobs/_dlt_pipeline_state.55206d6956.5.reference
Traceback (most recent call last):
  File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/dbapi/cursor.py", line 111, in execute
    response = execute(
  File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/client.py", line 382, in execute
    rv = self.process_ordinary_query(
  File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/client.py", line 580, in process_ordinary_query
    return self.receive_result(with_column_types=with_column_types,
  File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/client.py", line 212, in receive_result
    return result.get_result()
  File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/result.py", line 50, in get_result
    for packet in self.packet_generator:
  File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/client.py", line 228, in packet_generator
    packet = self.receive_packet()
  File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/client.py", line 245, in receive_packet
    raise packet.exception
clickhouse_driver.errors.ServerException: Code: 499.
DB::Exception: Failed to get object info: No response body.. HTTP response code: 302: while reading _dlt_pipeline_state/1727612187.8839986.55206d6956.jsonl: While executing S3Source. Stack trace:

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000d0fa5bb
1. DB::S3Exception::S3Exception<String const&, unsigned long>(Aws::S3::S3Errors, fmt::v9::basic_format_string<char, fmt::v9::type_identity<String const&>::type, fmt::v9::type_identity<unsigned long>::type>, String const&, unsigned long&&) @ 0x00000000103ec34a
2. DB::S3::getObjectInfo(DB::S3::Client const&, String const&, String const&, String const&, bool, bool) @ 0x00000000103ec6a8
3. DB::S3ObjectStorage::getObjectMetadata(String const&) const @ 0x00000000103e2cfe
4. DB::StorageObjectStorageSource::KeysIterator::nextImpl(unsigned long) @ 0x0000000010337a05
5. DB::StorageObjectStorageSource::IIterator::next(unsigned long) @ 0x0000000010334ef9
6. DB::StorageObjectStorageSource::createReader(unsigned long, std::shared_ptr<DB::StorageObjectStorageSource::IIterator> const&, std::shared_ptr<DB::StorageObjectStorage::Configuration> const&, std::shared_ptr<DB::IObjectStorage> const&, DB::ReadFromFormatInfo const&, std::optional<DB::FormatSettings> const&, std::shared_ptr<DB::KeyCondition const> const&, std::shared_ptr<DB::Context const> const&, DB::SchemaCache*, std::shared_ptr<Poco::Logger> const&, unsigned long, unsigned long, bool) @ 0x00000000103329ec
7. DB::StorageObjectStorageSource::createReader() @ 0x0000000010331c87
8. DB::StorageObjectStorageSource::generate() @ 0x0000000010331e96
9. DB::ISource::tryGenerate() @ 0x0000000012fbba35
10. DB::ISource::work() @ 0x0000000012fbb4c2
11. DB::ExecutionThreadContext::executeTask() @ 0x0000000012fd5347
12. DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x0000000012fc9c30
13. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<DB::PipelineExecutor::spawnThreads()::$_0, void ()>>(std::__function::__policy_storage const*) @ 0x0000000012fcb2ae
14. ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::worker(std::__list_iterator<ThreadFromGlobalPoolImpl<false, true>, void*>) @ 0x000000000d1b345b
15. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false, true>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x000000000d1b74b1
16. void* std::__thread_proxy[abi:v15000]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void ThreadPoolImpl<std::thread>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>>(void*) @ 0x000000000d1b6263
17. ? @ 0x00007883062ecac3
18. ? @ 0x000078830637e850

Steps to reproduce

If bucket_url is set as "s3://bucket_name/", then Clickhouse gives error "Bucket name length is out of bounds in virtual hosted style S3 URI" because file URL is converted to "http://bucket_name.nyc3.digitaloceanspaces.com//dataset/_dlt_pipeline_state/1727612187.8839986.55206d6956.jsonl"

Setting bucket_url as "s3://bucket_name" still doesn't work, as http endpoint is called (even if endpoint is endpoint_url is set with https), which receives a 302 HTTP status that is not acted upon.

Operating system

Linux

Runtime environment

Local

Python version

3.10

dlt data source

No response

dlt destination

No response

Other deployment details

No response

Additional information

By setting use_https=True in clickhouse.py the 302 problem is fixed. I also recommend dealing with double slashes (i.e. //) that are built in URLs.

gfrmin avatar Sep 29 '24 13:09 gfrmin