podium icon indicating copy to clipboard operation
podium copied to clipboard

Remove `from_tabular_file` in ArrowDataset

Open mariosasko opened this issue 4 years ago • 4 comments

ArrowDataset.from_tabular_file is similar to TabularDataset's __init__. I think it's safe to remove this function. The same effect can be achieved with:

ArrowDataset.from_dataset(TabularDataset(...), ...)

E.g. #267 introduced some changes to TabularDataset's __init__ and this logic was not added to ArrowDataset. By deleting from_tabular_file, we won't have such problems in the future.

cc @ivansmokovic @mttk @FilipBolt

mariosasko avatar Jan 14 '21 11:01 mariosasko

Agreed but I think the difference is that .from_dataset has the examples in-memory, while from_tabular_file doesn't actually load them (they are lazily loaded via a genexp). Correct me if I'm wrong @ivansmokovic I guess we can abstract the loading logic out of TabularDataset and reuse it instead of having the code duplicated.

Btw; what's the purpose of having a disk backed dataset if the dataset is already loaded in memory (e.g., using from_dataset)?

mttk avatar Jan 14 '21 16:01 mttk

@mttk You are right. The reason to reimplement this in ArrowDataset was to avoid loading the whole dataset into memory. We could abstract the loading code out in some way, but that just brings additional complexity to the user interface. Let's update this method and leave this as-is for now.

ivansmokovic avatar Jan 14 '21 16:01 ivansmokovic

I'm fine with abstacting the logic out to avoid duplication or we can wait (to follow the Rule of three)

that just brings additional complexity to the user interface

Not if we keep it private.

mariosasko avatar Jan 14 '21 17:01 mariosasko

@mariosasko @ivansmokovic @FilipBolt please take a look at #274

mttk avatar Jan 14 '21 17:01 mttk