skorch
skorch copied to clipboard
NeuralNetClassifier with DataLoader
I have created Dataloader for train and val but how can I use in the NeuralNetClassifer. NeuralNetClassifier.fit() needs Dataset as parameter.
The DataLoader should take the dataset as the first argument when being initialized. If you follow that design, you can pass your dataloader class (not instance) to a skorch net with the iterator_train and iterator_valid arguments (see docs). E.g.:
class MyDataloader:
def __init__(self, dataset, ...):
self.dataset = dataset
...
net = NeuralNet(..., iterator_train=MyDataLoader, iterator_valid=MyDataLoader)
net.fit(X, y)
Note how the net is initialized with the classes, this is important. If you need to pass more init arguments, use the double underscore notation:
net = NeuralNet(..., iterator_train=MyDataLoader, iterator_train__foo=bar)
This is my DataLoader : train_loader = DataLoader(train_dataset, batch_size=512, collate_fn=vectorize_batch) test_loader = DataLoader(test_dataset, batch_size=512, collate_fn=vectorize_batch) These loaders are inside the NeuralNetClassifier
neuralNetModel.fit(train_dataset, y=y_train)
I've got thisTypeError: 'DataLoader' object is not callable
The error is caused by the fact that you pass the initialized DataLoader. Change your code like this:
net = NeuralNetClassifier(
...,
iterator_train=DataLoader, # <= uninitialized DataLoader
iterator_train__collate_fn=vectorize_batch, # <= extra arguments for DataLoader
iterator_valid=DataLoader, # <= uninitialized DataLoader
iterator_valid__collate_fn=vectorize_batch, # <= extra arguments for DataLoader
batch_size=512,
)
net.fit(train_dataset, y=y_train)
Still no luck. IndexError: invalid index to scalar variable when fit. Generally the original code is working before migrating to Skorch.
from torch.utils.data import Dataset, DataLoader,random_split from torchtext import datasets, transforms, models
class AESDataset(Dataset):
def __init__(self, dataframe, transform):
self.dataframe = dataframe
self.transform = transform
def __len__(self):
return len(self.dataframe) #3999
def __getitem__(self, idx):
essay = self.dataframe.iloc[idx]['essay']
final_score = self.dataframe.iloc[idx]['final_score'] + 1
sample = (final_score, essay)
return sample
from torchtext import transforms transform = transforms.ToTensor() # not used yet train_dataset = AESDataset(train_data, transform) test_dataset = AESDataset(test_data, transform)
from skorch import NeuralNetClassifier from torch import optim from skorch.callbacks import EpochScoring, BatchScoring, EarlyStopping early_stop = EarlyStopping(monitor="valid_loss", patience=3, threshold_mode="rel") f1 = EpochScoring(scoring='f1_weighted', lower_is_better=False, name="F1-score")
max_epoch_val = 2 torch.manual_seed(360) neuralNetModel = NeuralNetClassifier( neuralNetModule, max_epochs=max_epoch_val , # ori 10 iterator_train__num_workers=0, #higher value may cause broken pipe iterator_valid__num_workers=0, #higher value may cause broken pipe lr=1e-3,
iterator_train=DataLoader, # <= uninitialized DataLoader
iterator_train__collate_fn=vectorize_batch, # <= extra arguments for DataLoader
iterator_valid=DataLoader, # <= uninitialized DataLoader
iterator_valid__collate_fn=vectorize_batch, # <= extra arguments for DataLoader
batch_size=512,
optimizer=optim.Adam,
#callbacks =[f1, early_stop],
criterion=nn.CrossEntropyLoss,
device=device
)
y_train = [y for y,x in iter(train_dataset)]
neuralNetModel.fit(train_dataset, y=y_train)
I got this error when fitting. File /usr/local/lib/python3.9/dist-packages/skorch/utils.py:352, in multi_indexing(data, i, indexing) 349 return indexing(data, i) 351 # If we don't know how to index, find out and apply --> 352 return check_indexing(data)(data, i)
File /usr/local/lib/python3.9/dist-packages/skorch/utils.py:244, in _indexing_other(data, i) 241 def _indexing_other(data, i): 242 # sklearn's safe_indexing doesn't work with tuples since 0.22 243 if isinstance(i, (int, np.integer, slice, tuple)): --> 244 return data[i] 245 return safe_indexing(data, i)
IndexError: invalid index to scalar variable.
Tip: For your code to show properly, please use triple backticks, like so:
``` x = 2 + 2 y = x * 3 ```
becomes
x = 2 + 2
y = x * 3
Regarding your problem, I wonder if we can simplify this whole thing. Looking at these lines:
essay = self.dataframe.iloc[idx]['essay']
final_score = self.dataframe.iloc[idx]['final_score'] + 1
It looks to me like you're using a pandas DataFrame, where two columns are relevant, "essay" and "final_score". I assume that "essay" is a column of strings and "final_score" are integers. Could you try something like this:
X = df["essay"].values
y = df["final_score"].values
net = NeuralNetClassifier(
neuralNetModule,
max_epochs=max_epoch_val,
lr=1e-3,
batch_size=512,
optimizer=optim.Adam,
#callbacks =[f1, early_stop],
criterion=nn.CrossEntropyLoss,
device=device,
)
net.fit(X, y)
This should be much simpler than they way you posted above. I'm not quite sure how exactly your neuralNetModule deals with the "essay" data but I assume you have figured out something.
I need to vectorize(Glove) the essay before training. The vectorization is done in the trian _loader - which is supposed to be more efficient. Anyway to achieve that?
Thanks Benjamin for your help.
On Wed, 21 Sep 2022, 7:13 am Benjamin Bossan, @.***> wrote:
Tip: For your code to show properly, please use triple backticks, like so:
x = 2 + 2 y = x * 3becomes
x = 2 + 2 y = x * 3
Regarding your problem, I wonder if we can simplify this whole thing. Looking at these lines:
essay = self.dataframe.iloc[idx]['essay'] final_score = self.dataframe.iloc[idx]['final_score'] + 1It looks to me like you're using a pandas DataFrame, where two columns are relevant, "essay" and "final_score". I assume that "essay" is a column of strings and "final_score" are integers. Could you try something like this:
X = df["essay"].valuesy = df["final_score"].valuesnet = NeuralNetClassifier( neuralNetModule, max_epochs=max_epoch_val, lr=1e-3, batch_size=512, optimizer=optim.Adam, #callbacks =[f1, early_stop], criterion=nn.CrossEntropyLoss, device=device, )net.fit(X, y)
This should be much simpler than they way you posted above. I'm not quite sure how exactly your neuralNetModule deals with the "essay" data but I assume you have figured out something.
— Reply to this email directly, view it on GitHub https://github.com/skorch-dev/skorch/issues/892#issuecomment-1253007392, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACU5M3OW7W3SYXKYEI4VBK3V7JAKPANCNFSM6AAAAAAQPNNEWY . You are receiving this because you authored the thread.Message ID: @.***>
I see, this is what the vectorize_batch does, right?
Would it be possible for you to call this function on your input data once before training the net and then passing the Glove vectors to skorch? So it would look something like this:
X_essay = df["essay"].values
X_vector = vectorize_batch(X_essay)
...
net.fit(X_vector, y)
Possibly, your existing vectorize_batch needs to be rewritten a bit to make it work but that should be the better solution. The reason why it's better is because this way, you vectorize the data only once, before starting the training. In your initial solution, you would vectorize your data once per epoch, which would considerably slow down the training time.
Since Glove vectorization should be deterministic (same result each time you call with the same text), there is no need to call it once per epoch (it's different if there are random augmentations, as is common for image data).
Following your suggestion, I should vectorize each essay ( instead of each batch of 64) and put it into X_vector or put them into train data.
On Wed, 21 Sep 2022, 4:54 pm Benjamin Bossan, @.***> wrote:
I see, this is what the vectorize_batch does, right?
Would it be possible for you to call this function on your input data once before training the net and then passing the Glove vectors to skorch? So it would look something like this:
X_essay = df["essay"].valuesX_vector = vectorize_batch(X_essay) ...net.fit(X_vector, y)
Possibly, your existing vectorize_batch needs to be rewritten a bit to make it work but that should be the better solution. The reason why it's better is because this way, you vectorize the data only once, before starting the training. In your initial solution, you would vectorize your data once per epoch, which would considerably slow down the training time.
Since Glove vectorization should be deterministic (same result each time you call with the same text), there is no need to call it once per epoch (it's different if there are random augmentations, as is common for image data).
— Reply to this email directly, view it on GitHub https://github.com/skorch-dev/skorch/issues/892#issuecomment-1253404001, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACU5M3PQP2SMMTWJTQILWSLV7LELBANCNFSM6AAAAAAQPNNEWY . You are receiving this because you authored the thread.Message ID: @.***>
Just put all your vectors into X_vector. There should be no need to do anything with Dataset or DataLoader.
One potential issue with my suggestion is that it might require a lot of memory, depending on your data. Let me know if you run into memory issues.
@kamalabdul Any updates?
I have tried but there was problem. I created my project in Paperspace Gradient Notebook. Is it alright if I invite you to see the code through email to Gradient Notebook?
Is it alright if I invite you to see the code through email to Gradient Notebook?
Can this code be made public? Then I can take a look. If that's not possible (e.g. because it contains proprietary code/data from work), then it's better not to share it.
@kamalabdul any updates?
Hi Benjamin. Sorry for the late response. I've a customized dataset which is similar to the AG_NEWS dataset in structure. I attach a Jupyter file PyTorch_AG_NEWS.ipynb. I want a guide on how to convert a plain Pytorch implementation into Skorch implementation. Pytorch_AGNEWS.zip
This is related to my queries earlier.
Thanks for providing the notebook, this gave me a much better idea of what you try to achieve.
Please take a look at the modified notebook that I attached. I hope that this provides a solution for you.
In your notebook, you use CountVectorizer, I switched to TfidfVectorizer, but I think it doesn't matter. What you tried with passing the vectorizer to the collate function is too complicated, I simplified this step but putting everything inside an sklearn Pipeline. I hope you agree that this is a cleaner approach.
Btw, I think if you want to go with a bag of words approach such as this, you can probably stick to pure sklearn, no need for skorch. Just replace the skorch NeuralNetClassifier in the example above with something like LogisticRegression (which works well with sparse data) or MLPClassifier from sklearn and you should get pretty good results :)
However, if you really want to tap into the power of neural nets, I added a section how you can use skorch and Hugging Face transformers to train a BERT model on the same data. This is potentially a much more powerful model, but also takes much longer to train. Try and see which approach works better for you.
Thanks Benjamin for putting things together. I will study the script and adapt from there.
@kamalabdul any updates?
Since there is no more reply, I consider this solved. Feel free to re-open if there are still issues.