rill icon indicating copy to clipboard operation
rill copied to clipboard

Fail to ingest parquet files without type extension in filename

Open arnavawasthi opened this issue 1 year ago • 1 comments

Describe the bug

Error, I get:

failed to ingest source: file type not supported :

Source YAML file:

# Source YAML
# Reference documentation: https://docs.rilldata.com/reference/project-files/sources
  
type: source

connector: "s3"
glob.max_total_size: 107374182400 
uri: "s3://bucket/prefix/**"

And file names in S3 are:

2024-07-01 07:50:18  779791358 20240701_144951_00037_v74wa_1ad6f92e-a1e0-471e-933c-0e33736726fd
2024-07-01 07:50:17  773709002 20240701_144951_00037_v74wa_21ea1e1c-5a15-462e-8631-66397d29dc69
2024-07-01 07:50:18  774956158 20240701_144951_00037_v74wa_3dab18e3-d7ef-4861-8b68-5a9069bfd57e
2024-07-01 07:50:14  902486230 20240701_144951_00037_v74wa_4ca7963c-9c84-4867-bbeb-4b595f71b0fc
2024-07-01 07:50:17  784569768 20240701_144951_00037_v74wa_52b967bd-8aab-409e-8f3f-34e4e2fa2e48
2024-07-01 07:50:19  733080044 20240701_144951_00037_v74wa_56c2ee6d-76d1-4159-af8b-038c7cf35789
2024-07-01 07:50:18  787194935 20240701_144951_00037_v74wa_5b7e9797-6d55-44cc-a24b-aa169ff34149
2024-07-01 07:50:19  749618093 20240701_144951_00037_v74wa_5ce3f8c8-6d72-4f4d-bb18-d8d850a3a534
2024-07-01 07:50:18  765216693 20240701_144951_00037_v74wa_5e4edaec-4115-4298-a5c0-9a695a2707cf
2024-07-01 07:50:18  762384548 20240701_144951_00037_v74wa_5e9ed5e7-46ae-4703-b50f-2e499d399db3
2024-07-01 07:50:18  772863606 20240701_144951_00037_v74wa_619374d9-b248-4843-bb8f-9f77ff5749d4
2024-07-01 07:50:27  584326998 20240701_144951_00037_v74wa_6a5ebd07-4743-498f-b320-c984353affe3
2024-07-01 07:50:17  799802376 20240701_144951_00037_v74wa_7e00baba-1697-495f-8e2a-ae0d1cf53a44
2024-07-01 07:50:18  768385474 20240701_144951_00037_v74wa_7e81186d-8d06-4ba1-af9a-58d0361674b2
2024-07-01 07:50:27  587906621 20240701_144951_00037_v74wa_85a08cd2-6d2d-42ef-bcea-8d0d6b428bf3
2024-07-01 07:50:18  775116585 20240701_144951_00037_v74wa_89eef1e5-6891-469d-9119-5a73e5187364
2024-07-01 07:50:18  758135930 20240701_144951_00037_v74wa_89f45cdd-5e86-405e-a403-5a147ef52894
2024-07-01 07:50:14  915606927 20240701_144951_00037_v74wa_91770820-22b7-4423-b463-f0672c8abf32
2024-07-01 07:50:14  908777260 20240701_144951_00037_v74wa_a3296f85-807e-4681-a809-0d3471a4aadf
2024-07-01 07:50:18  744792673 20240701_144951_00037_v74wa_b4063da9-2948-4b50-ad17-e1388bc0ef17
2024-07-01 07:50:19  764710075 20240701_144951_00037_v74wa_b8e15b72-1572-4421-8355-3fc132621711
2024-07-01 07:50:18  776850723 20240701_144951_00037_v74wa_bee1d1ab-2fba-4e32-bf4e-cc13f58f6af0
2024-07-01 07:50:27  582347840 20240701_144951_00037_v74wa_c1f185d5-3eee-4cbc-8bbd-356d3d15f0e5
2024-07-01 07:50:26  593165688 20240701_144951_00037_v74wa_c231fc8e-00c2-4397-8044-4f49464427ff
2024-07-01 07:50:17  797126163 20240701_144951_00037_v74wa_c32e5947-31d7-4516-9aa1-65493ddcbfb0
2024-07-01 07:50:17  795242131 20240701_144951_00037_v74wa_c80b101b-1712-4bfb-a093-2e9dabf54351
2024-07-01 07:50:25  613974954 20240701_144951_00037_v74wa_cace9f9b-9a07-4b11-b88c-b64ba8a330dc
2024-07-01 07:50:18  778014252 20240701_144951_00037_v74wa_d6f2e1c7-f69d-4c0b-81cd-532dcc62529c
2024-07-01 07:50:18  784425017 20240701_144951_00037_v74wa_e735722a-d774-4830-a252-6ecd4a05e040
2024-07-01 07:50:18  757113650 20240701_144951_00037_v74wa_e78e8123-cb79-4266-ba6f-8bff1e4fe1d8

Expected behavior Parquet files are ingested

Additional context These files are generated by "Athena" source, which runs UNLOAD command to create parquet files in s3 bucket. UNLOAD command doesn't append parquet and "Athena" source couldn't ingest more than 10GB, so I tried to use "s3" source with GLOB. So that I can increase to 100gb through: glob.max_total_size. But now I am getting file type not supported.

arnavawasthi avatar Jul 01 '24 18:07 arnavawasthi

Hey @arnavawasthi

Thanks for the report. The problem is that in absence of any extension the system can not determine the type of the file. This works for Athena connector because the system knows that it just exported Parquet files. You can try the following source definition which gives the hint that the files are Parquet files.

type: source

connector: "duckdb"
sql: "select * from read_parquet('s3://bucket/prefix/**')"
glob.max_total_size: 107374182400 

k-anshul avatar Jul 02 '24 07:07 k-anshul