kaskada icon indicating copy to clipboard operation
kaskada copied to clipboard

bug: Windows build fails to parse invalid URL in prepared file

Open jordanrfrazier opened this issue 1 year ago • 4 comments

Summary

Windows build fails to parse an invalid URL due to presence of tilde in prepared file path.

Initial Bug:

Below is initial bug finding:

Windows build is failing after recent changes to read parquet files directly rather than convert them to batches.

https://github.com/kaskada-ai/kaskada/actions/runs/6423963739/job/17443601949

   ---------------------------- Captured stderr call -----------------------------
  thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: \x1b[1mfailed to prepare batch\x1b[22m\n\u251c\u2574at \x1b[3mD:\\a\\kaskada\\kaskada\\crates\\sparrow-session\\src\\table.rs:153:14\x1b[23m\n\u2502\n\u251c\u2500\u25b6 \x1b[1minternal error\x1b[22m\n\u2502   \u2570\u2574at \x1b[3mD:\\a\\kaskada\\kaskada\\crates\\sparrow-runtime\\src\\prepare\\preparer.rs:133:10\x1b[23m\n\u2502\n\u251c\u2500\u25b6 \x1b[1mfailed to create Parquet file reader\x1b[22m\n\u2502   \u2570\u2574at \x1b[3mD:\\a\\kaskada\\kaskada\\crates\\sparrow-runtime\\src\\prepare.rs:52:22\x1b[23m\n\u2502\n\u251c\u2500\u25b6 \x1b[1minvalid parquet file metadata\x1b[22m\n\u2502   \u2570\u2574at \x1b[3mD:\\a\\kaskada\\kaskada\\crates\\sparrow-runtime\\src\\read\\parquet_file.rs:63:18\x1b[23m\n\u2502\n\u2570\u2500\u25b6 \x1b[1mGeneric LocalFileSystem error: Unable to access metadata for D:/a/kaskada/kaskada/python/D:/a/kaskada/kaskada/testdata/purchases/purchases_part1.parquet: The filename, directory name, or volume label syntax is incorrect. (os error 123)\x1b[22m\n    \u2570\u2574at \x1b[3mD:\\a\\kaskada\\kaskada\\crates\\sparrow-runtime\\src\\read\\parquet_file.rs:62:18\x1b[23m', src\\table.rs:94:24\nnote: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
  _______________________ test_read_parquet_with_subsort ________________________
  
  golden = <conftest.GoldenFixture object at 0x0000018C117B2910>
  
      async def test_read_parquet_with_subsort(golden) -> None:
  >       source = await kd.sources.Parquet.create(
              "../testdata/purchases/purchases_part1.parquet",
              time_column="purchase_time",
              key_column="customer_id",
              subsort_column="subsort_id",
          )
  
  pytests\parquet_source_test.py:17: 
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  .venv\Lib\site-packages\kaskada\sources\arrow.py:582: in create
      await source.add_file(path)
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  
  self = <kaskada.sources.arrow.Parquet object at 0x0000018C117A42D0>
  path = 'D:\\a\\kaskada\\kaskada\\python/../testdata/purchases/purchases_part1.parquet'
  
      async def add_file(self, path: str) -> None:
          """Add data to the source."""
  >       await self._ffi_table.add_parquet(str(Source._get_absolute_path(path)))
  E       pyo3_asyncio.RustPanic: rust future panicked
  
  .venv\Lib\site-packages\kaskada\sources\arrow.py:587: RustPanic

jordanrfrazier avatar Oct 05 '23 22:10 jordanrfrazier

Failing here: https://github.com/kaskada-ai/kaskada/blob/c700afe114b6bac4d2bd92659a9438f395f36223/crates/sparrow-runtime/src/read/parquet_file.rs#L59

Which means that it correctly created the Url and was able to get the path from it. It's possible a source of issue is coming from how the path is stored in the SourceData proto object? We do create the ObjectStoreUrl successfully right before calling into the ParquetFile, though I don't know if that indicates it is a valid url.

https://github.com/kaskada-ai/kaskada/blob/2e5494e23655af8290c6d11dbd874d79be3b5391/crates/sparrow-runtime/src/prepare.rs#L50

jordanrfrazier avatar Oct 05 '23 22:10 jordanrfrazier

Example failed run: https://github.com/kaskada-ai/kaskada/actions/runs/6436537441/job/17480075331

jordanrfrazier avatar Oct 09 '23 17:10 jordanrfrazier

capture current state of the issue and then move back to backlog.

epinzur avatar Oct 09 '23 17:10 epinzur

Current status:

Windows build fails when attempting to parse an invalid URL. Despite our best efforts to use URLs and Paths (instead of manual string manipulation), the windows build path to the prepared file includes a tilde:

Error parsing Path "/C:/Users/RUNNER~1/AppData/Local/Temp/.tmpzmwvaw/3390f96b-b364-4e95-aaaf-11a41153f3e8/part-0.parquet": Encountered illegal character sequence "~" whilst parsing path segment "RUNNER~1"

Action Items:

  • [x] Double appending the root to the URL
  • [ ] Prepending file:// manually in cases still
  • [ ] Clean up SourceData proto
  • [ ] Figure out why URL includes the tilde for windows builds

jordanrfrazier avatar Oct 09 '23 17:10 jordanrfrazier