text icon indicating copy to clipboard operation
text copied to clipboard

How to preprocess document data?

Open Haihsu opened this issue 7 years ago • 1 comments

In document classification, or text summarization, the data is a document, precisely a sequence of sentences. How to use torchtext to preprocess a document into a matrix?

Haihsu avatar Mar 23 '18 04:03 Haihsu

To preprocess a document into an array using PyTorch and the torchtext package, you can follow these steps:

Install the torchtext package, if it is not already installed. You can do this using pip:

    pip install torchtext

Import the necessary libraries:

    import torch
    from torchtext.data import Field, TabularDataset, Iterator

Set the document and label fields. In your case, you can set the document field to a string of sentences and the label field if applicable, depending on the sorting or summarizing task.

    document_field = Field(sequential=True)
    label_field = Field(sequential=False)

Load the data into a TabularDataset object. Make sure your data file is in a suitable format such as CSV or TSV where each training example has a column for the document and optionally a column for the label.

     train_data, valid_data, test_data = TabularDataset.splits(
         path='path_to_data_folder',
         train='train.csv',
         validation='valid.csv',
         test='test.csv',
         format='csv',
         fields=[('document', document_field), ('label', label_field)]
     )

Build the vocabulary using only the data

    from training:document_field.build_vocab(train_data)

Create an Iterator object to iterate through data in batches during training or evaluation:

     batch_size = 32
     train_iterator, valid_iterator, test_iterator = Iterator.splits(
          (train_data, valid_data, test_data),
           batch_sizes=(batch_size, batch_size, batch_size),
           sort_key=lambda x: len(x.document),
           device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
      )

You can now use the train_iterator, valid_iterator, and test_iterator to iterate over the data in batches while training your model. Each batch will contain a numerical matrix representing the document and, if applicable, the corresponding label.

Pietro19 avatar Jun 06 '23 22:06 Pietro19