podium
podium copied to clipboard
Remove `from_tabular_file` in ArrowDataset
ArrowDataset.from_tabular_file is similar to TabularDataset's __init__
. I think it's safe to remove this function. The same effect can be achieved with:
ArrowDataset.from_dataset(TabularDataset(...), ...)
E.g. #267 introduced some changes to TabularDataset's __init__
and this logic was not added to ArrowDataset. By deleting from_tabular_file, we won't have such problems in the future.
cc @ivansmokovic @mttk @FilipBolt
Agreed but I think the difference is that .from_dataset has the examples in-memory, while from_tabular_file doesn't actually load them (they are lazily loaded via a genexp). Correct me if I'm wrong @ivansmokovic I guess we can abstract the loading logic out of TabularDataset and reuse it instead of having the code duplicated.
Btw; what's the purpose of having a disk backed dataset if the dataset is already loaded in memory (e.g., using from_dataset
)?
@mttk You are right. The reason to reimplement this in ArrowDataset
was to avoid loading the whole dataset into memory.
We could abstract the loading code out in some way, but that just brings additional complexity to the user interface. Let's update this method and leave this as-is for now.
I'm fine with abstacting the logic out to avoid duplication or we can wait (to follow the Rule of three)
that just brings additional complexity to the user interface
Not if we keep it private.
@mariosasko @ivansmokovic @FilipBolt please take a look at #274