InstaNovo icon indicating copy to clipboard operation
InstaNovo copied to clipboard

parquest file format

Open denisbeslic opened this issue 8 months ago • 1 comments

Dear authors,

I used your datasets on huggingface to train InstaNovo. As I see it, you only accept .csv and .ipc files for training. Could you add .parquet files as an file format for training? Or would you recommend to manually transform the .parquet files to csv/ipc files?

...
elif train_path.endswith(".parquet"):
        train_df = pd.read_parquet(train_path)
        train_df = train_df.sample(frac=config["train_subset"], random_state=0)
        valid_df = pd.read_parquet(valid_path)
        valid_df = valid_df.sample(frac=config["valid_subset"], random_state=0)
...

denisbeslic avatar Oct 19 '23 11:10 denisbeslic

Hi, currently we do not intend on officially supporting the parquet format, as they're generally passed around as multiple files. We recommend concatenating them and saving as a .csv or .ipc

Instead, we plan to support HuggingFace for both the training and prediction scripts, where the user may provide a link to a local or online HuggingFace dataset.

KevinEloff avatar Oct 20 '23 10:10 KevinEloff