text icon indicating copy to clipboard operation
text copied to clipboard

How do I load data from a csv file

Open nawshad opened this issue 4 years ago • 10 comments

I have a dataset containing text and labels seperated by tabs. How I can load this dataset using torchtext?

nawshad avatar Mar 25 '20 06:03 nawshad

It's probably similar to the text classification datasets here.

@mttk Do you know if the current library support a csv file loading?

zhangguanheng66 avatar Mar 25 '20 13:03 zhangguanheng66

from torchtext import data

TEXT = data.Field()
LABEL = data.LabelField()

fields = [('text', TEXT), ('label', LABEL)]

train_data, test_data = data.TabularDataset.splits(
                            path = 'data',
                            train = 'train.csv',
                            test = 'test.csv',
                            format = 'tsv', #'tsv' for tabs, 'csv' for commas
                            fields = fields
)

bentrevett avatar Mar 25 '20 15:03 bentrevett

@zhangguanheng66 Could we add the example @bentrevett posted in an example/usage section, torchtext doesn't really have any examples for external data sets. Adding a few examples for datasets that are not built into torchtext will help new users in understanding how to use torchtext better.

M-e-r-c-u-r-y avatar Apr 03 '20 17:04 M-e-r-c-u-r-y

We plan to eventually retire Field class as legacy code. However, at this moment, we could land a OSS PR as the example to help the usage case above. @M-e-r-c-u-r-y

zhangguanheng66 avatar Apr 03 '20 22:04 zhangguanheng66

How can I load AG_news or DBpedia datasets from local csv file using 'text_classification.DATASETS' instead of from google drive?

GaoJiqiang avatar Apr 14 '20 15:04 GaoJiqiang

If you have the paths of train/test files for AG_NEWS and DBpedia, you could save them as a list and start from here. So why not just call the AG_NEWS and DBpedia API to load the datasets?

zhangguanheng66 avatar Apr 14 '20 21:04 zhangguanheng66

If you have the paths of train/test files for AG_NEWS and DBpedia, you could save them as a list and start from here. So why not just call the AG_NEWS and DBpedia API to load the datasets?

Thx, because every time I have to use vpn to run the project for getting data from google drive, it's a little trouble. I want to download the CSV file and store them in local directory for convenience. I will try your method, best wishes.

GaoJiqiang avatar Apr 15 '20 01:04 GaoJiqiang

Is there a way to load datasets from CSV files in torchtext == 0.12? It seems like they removed legacy as well.

y12uc231 avatar Jun 12 '22 03:06 y12uc231

@y12uc231 In torchtext 0.12 we have migrated our datasets on top of torchdata. You can look at datasets implementation that offer plenty of examples how to work with CSV files or refer the torchdata documentation for additional information on usage and available functionality in datapipes.

Datapipe for reading data from CSV files is here

from torchdata.datapipes.iter import IterableWrapper, FileOpener
dp = IterableWrapper(["my_csv_file.csv"])
dp = FileOpener(dp, mode='b')
dp = dp.parse_csv()

for sample in dp:
     print(sample)

parmeet avatar Jun 13 '22 14:06 parmeet

Thanks! This works!

Maybe a bit of a different question but do you know how to load Glove embedding vocabulary for my dataset? Vocab class used to have "load_vectors" which don't seem to exist in the latest versions of torchtext.

y12uc231 avatar Jun 13 '22 23:06 y12uc231