skorch NeuralNetClassifier with DataLoader

I have created Dataloader for train and val but how can I use in the NeuralNetClassifer. NeuralNetClassifier.fit() needs Dataset as parameter.

Sep 18 '22 12:09 kamalabdul

The DataLoader should take the dataset as the first argument when being initialized. If you follow that design, you can pass your dataloader class (not instance) to a skorch net with the iterator_train and iterator_valid arguments (see docs). E.g.:

class MyDataloader:
    def __init__(self, dataset, ...):
        self.dataset = dataset
        ...

net = NeuralNet(..., iterator_train=MyDataLoader, iterator_valid=MyDataLoader)
net.fit(X, y)

Note how the net is initialized with the classes, this is important. If you need to pass more init arguments, use the double underscore notation:

net = NeuralNet(..., iterator_train=MyDataLoader, iterator_train__foo=bar)

Sep 18 '22 12:09 BenjaminBossan

This is my DataLoader : train_loader = DataLoader(train_dataset, batch_size=512, collate_fn=vectorize_batch) test_loader = DataLoader(test_dataset, batch_size=512, collate_fn=vectorize_batch) These loaders are inside the NeuralNetClassifier

neuralNetModel.fit(train_dataset, y=y_train)

I've got thisTypeError: 'DataLoader' object is not callable

Sep 18 '22 13:09 kamalabdul

The error is caused by the fact that you pass the initialized DataLoader. Change your code like this:

net = NeuralNetClassifier(
    ...,
    iterator_train=DataLoader,  # <= uninitialized DataLoader
    iterator_train__collate_fn=vectorize_batch,  # <= extra arguments for DataLoader
    iterator_valid=DataLoader,  # <= uninitialized DataLoader
    iterator_valid__collate_fn=vectorize_batch,  # <= extra arguments for DataLoader
    batch_size=512,
)
net.fit(train_dataset, y=y_train)

Sep 18 '22 17:09 BenjaminBossan

Still no luck. IndexError: invalid index to scalar variable when fit. Generally the original code is working before migrating to Skorch.

from torch.utils.data import Dataset, DataLoader,random_split from torchtext import datasets, transforms, models

class AESDataset(Dataset):

def __init__(self, dataframe, transform):
      self.dataframe = dataframe
      self.transform = transform
    
def __len__(self):
    return len(self.dataframe) #3999

def __getitem__(self, idx):
    essay = self.dataframe.iloc[idx]['essay']
    final_score = self.dataframe.iloc[idx]['final_score'] + 1

    sample = (final_score,  essay)
    return sample

from torchtext import transforms transform = transforms.ToTensor() # not used yet train_dataset = AESDataset(train_data, transform) test_dataset = AESDataset(test_data, transform)

from skorch import NeuralNetClassifier from torch import optim from skorch.callbacks import EpochScoring, BatchScoring, EarlyStopping early_stop = EarlyStopping(monitor="valid_loss", patience=3, threshold_mode="rel") f1 = EpochScoring(scoring='f1_weighted', lower_is_better=False, name="F1-score")

max_epoch_val = 2 torch.manual_seed(360) neuralNetModel = NeuralNetClassifier( neuralNetModule, max_epochs=max_epoch_val , # ori 10 iterator_train__num_workers=0, #higher value may cause broken pipe iterator_valid__num_workers=0, #higher value may cause broken pipe lr=1e-3,

iterator_train=DataLoader,  # <= uninitialized DataLoader
iterator_train__collate_fn=vectorize_batch,  # <= extra arguments for DataLoader
iterator_valid=DataLoader,  # <= uninitialized DataLoader
iterator_valid__collate_fn=vectorize_batch,  # <= extra arguments for DataLoader
batch_size=512,    

optimizer=optim.Adam,
#callbacks =[f1, early_stop],
criterion=nn.CrossEntropyLoss,
device=device

)

y_train = [y for y,x in iter(train_dataset)]

neuralNetModel.fit(train_dataset, y=y_train)

I got this error when fitting. File /usr/local/lib/python3.9/dist-packages/skorch/utils.py:352, in multi_indexing(data, i, indexing) 349 return indexing(data, i) 351 # If we don't know how to index, find out and apply --> 352 return check_indexing(data)(data, i)

File /usr/local/lib/python3.9/dist-packages/skorch/utils.py:244, in _indexing_other(data, i) 241 def _indexing_other(data, i): 242 # sklearn's safe_indexing doesn't work with tuples since 0.22 243 if isinstance(i, (int, np.integer, slice, tuple)): --> 244 return data[i] 245 return safe_indexing(data, i)

IndexError: invalid index to scalar variable.

Sep 19 '22 03:09 kamalabdul

Tip: For your code to show properly, please use triple backticks, like so:

``` x = 2 + 2 y = x * 3 ```

becomes

x = 2 + 2
y = x * 3

Regarding your problem, I wonder if we can simplify this whole thing. Looking at these lines:

    essay = self.dataframe.iloc[idx]['essay']
    final_score = self.dataframe.iloc[idx]['final_score'] + 1

It looks to me like you're using a pandas DataFrame, where two columns are relevant, "essay" and "final_score". I assume that "essay" is a column of strings and "final_score" are integers. Could you try something like this:

X = df["essay"].values
y = df["final_score"].values
net = NeuralNetClassifier(
    neuralNetModule,
    max_epochs=max_epoch_val,
    lr=1e-3,
    batch_size=512,    
    optimizer=optim.Adam,
    #callbacks =[f1, early_stop],
    criterion=nn.CrossEntropyLoss,
    device=device,
)
net.fit(X, y)

This should be much simpler than they way you posted above. I'm not quite sure how exactly your neuralNetModule deals with the "essay" data but I assume you have figured out something.

Sep 20 '22 23:09 BenjaminBossan

I need to vectorize(Glove) the essay before training. The vectorization is done in the trian _loader - which is supposed to be more efficient. Anyway to achieve that?

Thanks Benjamin for your help.

On Wed, 21 Sep 2022, 7:13 am Benjamin Bossan, @.***> wrote:

Tip: For your code to show properly, please use triple backticks, like so:
x = 2 + 2
y = x * 3
becomes

x = 2 + 2 y = x * 3

Regarding your problem, I wonder if we can simplify this whole thing. Looking at these lines:
essay = self.dataframe.iloc[idx]['essay']
final_score = self.dataframe.iloc[idx]['final_score'] + 1
It looks to me like you're using a pandas DataFrame, where two columns are relevant, "essay" and "final_score". I assume that "essay" is a column of strings and "final_score" are integers. Could you try something like this:

X = df["essay"].valuesy = df["final_score"].valuesnet = NeuralNetClassifier( neuralNetModule, max_epochs=max_epoch_val, lr=1e-3, batch_size=512, optimizer=optim.Adam, #callbacks =[f1, early_stop], criterion=nn.CrossEntropyLoss, device=device, )net.fit(X, y)

This should be much simpler than they way you posted above. I'm not quite sure how exactly your neuralNetModule deals with the "essay" data but I assume you have figured out something.

— Reply to this email directly, view it on GitHub https://github.com/skorch-dev/skorch/issues/892#issuecomment-1253007392, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACU5M3OW7W3SYXKYEI4VBK3V7JAKPANCNFSM6AAAAAAQPNNEWY . You are receiving this because you authored the thread.Message ID: @.***>

Sep 21 '22 05:09 kamalabdul

I see, this is what the vectorize_batch does, right?

Would it be possible for you to call this function on your input data once before training the net and then passing the Glove vectors to skorch? So it would look something like this:

X_essay = df["essay"].values
X_vector = vectorize_batch(X_essay)
...
net.fit(X_vector, y)

Possibly, your existing vectorize_batch needs to be rewritten a bit to make it work but that should be the better solution. The reason why it's better is because this way, you vectorize the data only once, before starting the training. In your initial solution, you would vectorize your data once per epoch, which would considerably slow down the training time.

Since Glove vectorization should be deterministic (same result each time you call with the same text), there is no need to call it once per epoch (it's different if there are random augmentations, as is common for image data).

Sep 21 '22 08:09 BenjaminBossan

Following your suggestion, I should vectorize each essay ( instead of each batch of 64) and put it into X_vector or put them into train data.

On Wed, 21 Sep 2022, 4:54 pm Benjamin Bossan, @.***> wrote:

I see, this is what the vectorize_batch does, right?

Would it be possible for you to call this function on your input data once before training the net and then passing the Glove vectors to skorch? So it would look something like this:

X_essay = df["essay"].valuesX_vector = vectorize_batch(X_essay) ...net.fit(X_vector, y)

Possibly, your existing vectorize_batch needs to be rewritten a bit to make it work but that should be the better solution. The reason why it's better is because this way, you vectorize the data only once, before starting the training. In your initial solution, you would vectorize your data once per epoch, which would considerably slow down the training time.

Since Glove vectorization should be deterministic (same result each time you call with the same text), there is no need to call it once per epoch (it's different if there are random augmentations, as is common for image data).

— Reply to this email directly, view it on GitHub https://github.com/skorch-dev/skorch/issues/892#issuecomment-1253404001, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACU5M3PQP2SMMTWJTQILWSLV7LELBANCNFSM6AAAAAAQPNNEWY . You are receiving this because you authored the thread.Message ID: @.***>

Sep 21 '22 09:09 kamalabdul

Just put all your vectors into X_vector. There should be no need to do anything with Dataset or DataLoader.

One potential issue with my suggestion is that it might require a lot of memory, depending on your data. Let me know if you run into memory issues.

Sep 21 '22 09:09 BenjaminBossan

@kamalabdul Any updates?

Oct 07 '22 11:10 BenjaminBossan

I have tried but there was problem. I created my project in Paperspace Gradient Notebook. Is it alright if I invite you to see the code through email to Gradient Notebook?

Oct 08 '22 07:10 kamalabdul

Is it alright if I invite you to see the code through email to Gradient Notebook?

Can this code be made public? Then I can take a look. If that's not possible (e.g. because it contains proprietary code/data from work), then it's better not to share it.

Oct 10 '22 09:10 BenjaminBossan

@kamalabdul any updates?

Oct 26 '22 14:10 BenjaminBossan

Hi Benjamin. Sorry for the late response. I've a customized dataset which is similar to the AG_NEWS dataset in structure. I attach a Jupyter file PyTorch_AG_NEWS.ipynb. I want a guide on how to convert a plain Pytorch implementation into Skorch implementation. Pytorch_AGNEWS.zip

This is related to my queries earlier.

Nov 06 '22 14:11 kamalabdul

Thanks for providing the notebook, this gave me a much better idea of what you try to achieve.

Please take a look at the modified notebook that I attached. I hope that this provides a solution for you.

In your notebook, you use CountVectorizer, I switched to TfidfVectorizer, but I think it doesn't matter. What you tried with passing the vectorizer to the collate function is too complicated, I simplified this step but putting everything inside an sklearn Pipeline. I hope you agree that this is a cleaner approach.

Btw, I think if you want to go with a bag of words approach such as this, you can probably stick to pure sklearn, no need for skorch. Just replace the skorch NeuralNetClassifier in the example above with something like LogisticRegression (which works well with sparse data) or MLPClassifier from sklearn and you should get pretty good results :)

However, if you really want to tap into the power of neural nets, I added a section how you can use skorch and Hugging Face transformers to train a BERT model on the same data. This is potentially a much more powerful model, but also takes much longer to train. Try and see which approach works better for you.

skorch_AGNEWS.zip

Nov 06 '22 15:11 BenjaminBossan

Thanks Benjamin for putting things together. I will study the script and adapt from there.

Nov 07 '22 01:11 kamalabdul

@kamalabdul any updates?

Nov 22 '22 09:11 BenjaminBossan

Since there is no more reply, I consider this solved. Feel free to re-open if there are still issues.

Mar 09 '23 11:03 BenjaminBossan

skorch skorch copied to clipboard

NeuralNetClassifier with DataLoader

from torch.utils.data import Dataset, DataLoader,random_split from torchtext import datasets, transforms, models

from torchtext import transforms transform = transforms.ToTensor() # not used yet train_dataset = AESDataset(train_data, transform) test_dataset = AESDataset(test_data, transform)

)

skorch
skorch copied to clipboard