Ronan Lamy comments

Results 29 comments of


                                            Ronan Lamy

Pre-UDF table logic creates unnecessary copies

IIUC, that issue only happens when the result-set of the query is non-deterministic, i.e. only when there's a non-deterministic filter on the rows (such as LIMIT). So I can see...

pre_fetch option in additional to cache for lib.File

Note that with the current architecture, pre_fetch won't do much, since only one `File` object exists at a time (assuming no batching).

pre_fetch option in additional to cache for lib.File

Some notes: * In order to implement this, we need to insert logic similar to `DatasetQuery.extract()` before (or maybe in) `udf.run()`. * Fetching should be implemented, or at least controlled...

pre_fetch option in additional to cache for lib.File

After probably too much refactoring, I can confirm that this can be implemented inside `udf.run()` which means that: * we don't need any (significant) changes to parallel or distributed code...

pre_fetch option in additional to cache for lib.File

The proposed implementation has a problem: it hangs when run in distributed mode, i.e. when using something like `.settings(prefetch=2, workers=2)`. Here's what happens (with some simplifications!) when running a mapper...

pre_fetch option in additional to cache for lib.File

Using threading in `AsyncMapper.produce()` runs into the issue that iteration needs to be thread-safe, but that seems fixable, see #521. That PR only deals with `Mapper` and `Generator` though. Regarding...

pre_fetch option in additional to cache for lib.File

@skshetry I think you've understood all the issues by now, but to clarify: my first attempt was hanging in distributed mode which I then fixed in #521, but that introduced...

map/gen/agg: overwrite columns

In the case of `.map()`, being able to replace a compound signal (i.e a column whose type is a DataModel) with another requires us to be able to tell the...

New persist() method

`.persist()` is the name of the method in the [dataframe API standard](https://data-apis.org/dataframe-api/draft/API_specification/dataframe_object.html#dataframe_api.DataFrame.persist). I think that's what we should use - assuming it works exactly as described in the standard.