InstaNovo
InstaNovo copied to clipboard
parquest file format
Dear authors,
I used your datasets on huggingface to train InstaNovo. As I see it, you only accept .csv and .ipc files for training. Could you add .parquet files as an file format for training? Or would you recommend to manually transform the .parquet files to csv/ipc files?
...
elif train_path.endswith(".parquet"):
train_df = pd.read_parquet(train_path)
train_df = train_df.sample(frac=config["train_subset"], random_state=0)
valid_df = pd.read_parquet(valid_path)
valid_df = valid_df.sample(frac=config["valid_subset"], random_state=0)
...
Hi, currently we do not intend on officially supporting the parquet format, as they're generally passed around as multiple files. We recommend concatenating them and saving as a .csv
or .ipc
Instead, we plan to support HuggingFace for both the training and prediction scripts, where the user may provide a link to a local or online HuggingFace dataset.