keras icon indicating copy to clipboard operation
keras copied to clipboard

Supporting list of files for `keras.utils.text_dataset_from_directory`

Open jessechancy opened this issue 3 years ago • 1 comments
trafficstars

Currently the function creates a dataset from files in a directory. It is targeted for text classification, hence it defaults to supporting a nested directory structure with each folder being a class. However, for the use case of just reading text files to a dataset, this API does not fully support the use case. The main inconvenience is that it requires the files to be in the same directory, and having to convert files to end with ".txt".

I'm proposing an additional function for creating a text dataset from a list of filenames. With the list of filenames, we could also remove the ".txt" restriction since each file is explicitly written in the list. It would be an API that takes the text files to dataset conversion code from the keras.utils.text_dataset_from_directory, but is exposed in a convenient minimal API that is similar to tf.data.TextLineDataset.

People who wish to do the simple task of converting text files into a dataset would benefit from this API. Currently, the only two ways to do so is

  • tf.data.TextLineDataset: This reads each line instead of the whole file
  • keras.utils.text_dataset_from_directory: Requires a lot of user action steps including moving the files to be in the same directory, change to .txt files, set parameter label=None and shuffle=False.

Contributing

  • Do you want to contribute a PR? (yes/no): yes

jessechancy avatar Jul 26 '22 23:07 jessechancy

Hi @jessechancy , Could you please share a reproducible code that supports your statement so that the issue can be easily understood. Thank you!

tilakrayal avatar Jul 27 '22 12:07 tilakrayal