James Knighton
James Knighton
This PR was split out of a larger Parquet streaming PR, to follow. 1. Implement `allow_schema_mismatch` -- checks all shards to verify that their schema (column name and type signatures)...
## Description of changes: ## Issue #, if available: ## Merge Checklist: _Put an `x` without space in the boxes that apply. If you are unsure about any checklist, please...
Add `varint`, `varuint` encodings to MDS.
Add the option to pre-generate the epoch. This should save us a lot of time when there is a lot of work happening between creating the StreamingDataset and iterating it....
*In which we blow away 1) torch dist, 2) shared memory, and 3) filelock #YOLO* ## Nuke torch dist Can we better contain or even eliminate Streaming's dependencies on PyTorch,...
- [x] replace prefix registration/lookup aka local dir collision detection dist - [ ] replace shared memory, shared array, shared scalar, and shared barrier - [x] replace streaming dataset init...
In `streaming/base/format/base/writer.py`: ```py @classmethod def _get_timer(cls) -> Timer: """Get a timer tree for the process of writing a dataset. Returns: Timer: The tree of timers. """ return Timer([ ('write', Timer([...