texar-pytorch icon indicating copy to clipboard operation
texar-pytorch copied to clipboard

Doc polish: "Data Loaders" --> "Datasets"

Open ZhitingHu opened this issue 5 years ago • 6 comments

The section is titled "Data Loaders" https://texar-pytorch.readthedocs.io/en/latest/code/data.html#data-loaders

Would "Datasets" be better? Or does "Data Loaders" fit the Pytorch convention better?

@huzecong @AvinashBukkittu

ZhitingHu avatar Oct 04 '19 18:10 ZhitingHu

I personally like Datasets as section heading here. All the classes described under this are Datasets provided by texar. Our Data Iterators share similarities with Data Loaders of pytorch. Also, I see that we are missing the doc for SingleDatasetIterator. I don't know if this was intentional.

AvinashBukkittu avatar Oct 04 '19 18:10 AvinashBukkittu

The doc of Args is missing for Batch https://texar-pytorch.readthedocs.io/en/latest/code/data.html#texar.torch.data.Batch

ZhitingHu avatar Oct 04 '19 19:10 ZhitingHu

I like Dataset as well. I think the terms people use to describe data-related modules are pretty messy, so as long as we're being consistent it's fine. Let me reiterate our definitions:

  • A data source is something that reads and returns raw data examples one by one. Typical data sources include Python lists and iterators (SequenceDataSource and IterDataSource), lines from text files (TextLineDataSource), and pickled objects from binary files (PickleDataSource).
  • A dataset (or data loader) defines how data examples are preprocessed into a format suitable for the task, and how these processed examples can be batched. These are called *Data in our framework for compatibility with the TF version (although I kind of prefer names like MonoTextData to MonoTextDataset because it's shorter and nonetheless to the point). Note that dataset does not perform any of the operations by itself.
  • A data iterator executes the process and batch operations defined in a dataset. PyTorch calls this a "data loader".

It is intentional that we don't include the doc for SingleDatasetIterator. Users are expected to only use the DataIterator interface.

huzecong avatar Oct 04 '19 19:10 huzecong

Thanks for the clarifation. Can these definitions be added to somewhere in the doc?

ZhitingHu avatar Oct 04 '19 19:10 ZhitingHu

We can probably have an "Overview" page for each set of modules, to give an overview and highlight key features. Like in TF: https://www.tensorflow.org/api_docs/python/tf/data

ZhitingHu avatar Oct 04 '19 19:10 ZhitingHu

Sure. I'll get on it.

huzecong avatar Oct 04 '19 22:10 huzecong