kaskada
kaskada copied to clipboard
bug: Windows build fails to parse invalid URL in prepared file
Summary
Windows build fails to parse an invalid URL due to presence of tilde in prepared file path.
Initial Bug:
Below is initial bug finding:
Windows build is failing after recent changes to read parquet files directly rather than convert them to batches.
https://github.com/kaskada-ai/kaskada/actions/runs/6423963739/job/17443601949
---------------------------- Captured stderr call -----------------------------
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: \x1b[1mfailed to prepare batch\x1b[22m\n\u251c\u2574at \x1b[3mD:\\a\\kaskada\\kaskada\\crates\\sparrow-session\\src\\table.rs:153:14\x1b[23m\n\u2502\n\u251c\u2500\u25b6 \x1b[1minternal error\x1b[22m\n\u2502 \u2570\u2574at \x1b[3mD:\\a\\kaskada\\kaskada\\crates\\sparrow-runtime\\src\\prepare\\preparer.rs:133:10\x1b[23m\n\u2502\n\u251c\u2500\u25b6 \x1b[1mfailed to create Parquet file reader\x1b[22m\n\u2502 \u2570\u2574at \x1b[3mD:\\a\\kaskada\\kaskada\\crates\\sparrow-runtime\\src\\prepare.rs:52:22\x1b[23m\n\u2502\n\u251c\u2500\u25b6 \x1b[1minvalid parquet file metadata\x1b[22m\n\u2502 \u2570\u2574at \x1b[3mD:\\a\\kaskada\\kaskada\\crates\\sparrow-runtime\\src\\read\\parquet_file.rs:63:18\x1b[23m\n\u2502\n\u2570\u2500\u25b6 \x1b[1mGeneric LocalFileSystem error: Unable to access metadata for D:/a/kaskada/kaskada/python/D:/a/kaskada/kaskada/testdata/purchases/purchases_part1.parquet: The filename, directory name, or volume label syntax is incorrect. (os error 123)\x1b[22m\n \u2570\u2574at \x1b[3mD:\\a\\kaskada\\kaskada\\crates\\sparrow-runtime\\src\\read\\parquet_file.rs:62:18\x1b[23m', src\\table.rs:94:24\nnote: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
_______________________ test_read_parquet_with_subsort ________________________
golden = <conftest.GoldenFixture object at 0x0000018C117B2910>
async def test_read_parquet_with_subsort(golden) -> None:
> source = await kd.sources.Parquet.create(
"../testdata/purchases/purchases_part1.parquet",
time_column="purchase_time",
key_column="customer_id",
subsort_column="subsort_id",
)
pytests\parquet_source_test.py:17:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.venv\Lib\site-packages\kaskada\sources\arrow.py:582: in create
await source.add_file(path)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <kaskada.sources.arrow.Parquet object at 0x0000018C117A42D0>
path = 'D:\\a\\kaskada\\kaskada\\python/../testdata/purchases/purchases_part1.parquet'
async def add_file(self, path: str) -> None:
"""Add data to the source."""
> await self._ffi_table.add_parquet(str(Source._get_absolute_path(path)))
E pyo3_asyncio.RustPanic: rust future panicked
.venv\Lib\site-packages\kaskada\sources\arrow.py:587: RustPanic
Failing here: https://github.com/kaskada-ai/kaskada/blob/c700afe114b6bac4d2bd92659a9438f395f36223/crates/sparrow-runtime/src/read/parquet_file.rs#L59
Which means that it correctly created the Url
and was able to get the path
from it. It's possible a source of issue is coming from how the path is stored in the SourceData
proto object? We do create the ObjectStoreUrl
successfully right before calling into the ParquetFile
, though I don't know if that indicates it is a valid url.
https://github.com/kaskada-ai/kaskada/blob/2e5494e23655af8290c6d11dbd874d79be3b5391/crates/sparrow-runtime/src/prepare.rs#L50
Example failed run: https://github.com/kaskada-ai/kaskada/actions/runs/6436537441/job/17480075331
capture current state of the issue and then move back to backlog.
Current status:
Windows build fails when attempting to parse an invalid URL. Despite our best efforts to use URLs
and Paths
(instead of manual string manipulation), the windows build path to the prepared file includes a tilde:
Error parsing Path "/C:/Users/RUNNER~1/AppData/Local/Temp/.tmpzmwvaw/3390f96b-b364-4e95-aaaf-11a41153f3e8/part-0.parquet": Encountered illegal character sequence "~" whilst parsing path segment "RUNNER~1"
Action Items:
- [x] Double appending the root to the URL
- [ ] Prepending
file://
manually in cases still - [ ] Clean up
SourceData
proto - [ ] Figure out why URL includes the tilde for windows builds