sdk TypeError: conv2d() received an invalid combination of argument

TypeError: conv2d() received an invalid combination of argument

Open kadirnar opened this issue 1 year ago • 15 comments

Hi,

I want to train using the Torchvision library. I want to watch the training results using the layer library. My training code worked in colab. Marul_Notebook

I get an error when I add the layer library.

import torch
from torch import nn
from torch import optim
from torchvision import transforms,models
from collections import OrderedDict
from layer.decorators import model
import layer
import torchvision

layer.login()
layer.init("marul-classification")


train_transforms = transforms.Compose([transforms.RandomRotation(30),
                                       transforms.RandomResizedCrop(224),
                                       transforms.RandomHorizontalFlip(),
                                       transforms.ToTensor(),
                                       transforms.Normalize([0.485, 0.456, 0.406],
                                                            [0.229, 0.224, 0.225])
])


train_data = torchvision.datasets.ImageFolder(root="./train/",transform=train_transforms)
train_data_loader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle = True) # num_workers=2 daha sonra dene
dataiter = iter(train_data_loader)
images,labels = dataiter.next()

model = models.densenet121(pretrained=True)
for param in model.parameters():
    param.requires_grad = False

classifier = nn.Sequential(OrderedDict([
    ('fc1', nn.Linear(1024, 512)),
    ('relu', nn.LeakyReLU()),
    ('fc2', nn.Linear(512, 3)),
    ('output', nn.LogSoftmax(dim=1))
]))

model.classifier = classifier

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)

criterion = nn.CrossEntropyLoss().to(device)
optimizer = optim.Adam(model.classifier.parameters(), lr=0.001)

@model("my_first_model")
def train_model(model, optimizer, n_epochs, criterion):
    import time
    start_time = time.time()
    for epoch in range(1, n_epochs+1):
        epoch_time = time.time()
        epoch_loss = 0
        correct = 0
        total=0
        print("Epoch {} / {}".format(epoch, n_epochs))
        model.train()
        
        for inputs, labels in train_data_loader:
            inputs = inputs.to(device)
            labels = labels.to(device)
            optimizer.zero_grad() # zeroed grads
            outputs = model(inputs) # forward pass
            loss = criterion(outputs, labels) # softmax + cross entropy
            loss.backward() # back pass
            optimizer.step() # updated params
            epoch_loss += loss.item() # train loss
            _, pred = torch.max(outputs, dim=1)
            correct += (pred.cpu() == labels.cpu()).sum().item()
            total += labels.shape[0]
        acc = correct / total
        
        model.eval()
        a=0
        pred_val=0
        correct_val=0
        total_val=0
        with torch.no_grad():
            for inp_val, lab_val in train_data_loader:
                inp_val = inp_val.to(device)
                lab_val = lab_val.to(device)
                out_val = model(inp_val)
                loss_val = criterion(out_val, lab_val)
                a += loss_val.item()
                _, pred_val = torch.max(out_val, dim=1)
                correct_val += (pred_val.cpu()==lab_val.cpu()).sum().item()
                total_val += lab_val.shape[0]
            acc_val = correct_val / total_val
        epoch_time2 = time.time()
        print("Duration: {:.0f}s, Train Loss: {:.4f}, Train Acc: {:.4f}, Val Loss: {:.4f}, Val Acc: {:.4f}"
              .format(epoch_time2-epoch_time, epoch_loss/len(labels), acc, a/len(lab_val), acc_val))
    end_time = time.time()
    print("Total Time:{:.0f}s".format(end_time-start_time))

layer.run([train_model(model, optimizer,50, criterion)])

Error Message:

TypeError: conv2d() received an invalid combination of arguments - got (str, Parameter, NoneType, tuple, tuple, tuple, int), but expected one of:
 * (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups)
      didn't match because some of the arguments have invalid types: (str, Parameter, NoneType, tuple, tuple, tuple, int)
 * (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, str padding, tuple of ints dilation, int groups)
      didn't match because some of the arguments have invalid types: (str, Parameter, NoneType, tuple, tuple, tuple, int)

Why am I getting this error? Could you help?

Jul 08 '22 08:07 kadirnar

Hi @kadirnar I don't think layer sdk support such use. layer.run expects a function. What you are passing is a call/invokation of a function. It can work only if your function itself, train_model, returns a callable. Can you try to refactor your function to either return a callable or not accept parameters explicitly?

Jul 08 '22 11:07 yuranos

Maybe you can re-write it to something like:

@model
def train_model():
    model = // init model
    optimizer = // init optimiser
    n_epochs = 50
    criterion = // criterion

    def train_inner(model, optimizer, n_epochs, criterion):
       // actual training

    return train_inner(model, optimizer, n_epochs, criterion)

layer.run([train_model])

Jul 08 '22 12:07 yuranos

Hi @yuranos ,

Thank you for solving the problem. What should I do to use my gpu? Is GPU support public?

Jul 08 '22 14:07 kadirnar

When I add this code to the train function, I get a cuda error.

fabric("f-gpu-small")
@model("my_first_model")
def train_model():
...
layer.run([train_model()])

Error Message:

  File "/home/kadir/miniconda3/envs/layer/lib/python3.8/site-packages/torch/cuda/__init__.py", line 211, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
⠙  my_first_model       ━━━━━━╸━━━ TRAINING [0:00:20]

My code works on cpu, but I couldn't run it for gpu. Can you help me?

Jul 08 '22 16:07 kadirnar

Hey @kadirnar, instead of:

layer.run([train_model()])

can you try:

layer.run([train_model])

When you run train_model(), it executes that function locally and that seems to be where your code is failing since the error code refers to /home/kadir/.

Jul 11 '22 08:07 volkangurel

layer.init("marul-classification",pip_packages=['torchvision','torch','QuantStub'])

Error Message:

09:12:10 my_first_model: ModuleNotFoundError: No module named 'torchvision'

Jul 11 '22 09:07 kadirnar

Hmm, not sure why that's not working, but instead of using pip_packages on layer.init can you please try to put it on @model? Like so:

@fabric("f-gpu-small")
@model("my_first_model", pip_packages=['torchvision','torch','QuantStub'])
def train_model():
...
layer.run([train_model])

Jul 11 '22 15:07 volkangurel

Error Message:

@model("my_first_model", pip_packages=['torchvision','torch','QuantStub'])
TypeError: model() got an unexpected keyword argument 'pip_packages'

Solution:

layer.init("marul-classification",pip_packages=['torchvision'])

I want to run the dataset locally.

19:37:40 my_first_model: FileNotFoundError: [Errno 2] No such file or directory: 'train/'

Main: -train(folder) -train.py

Jul 11 '22 19:07 kadirnar

Cool, progress! Can you try adding a @resources decorator? This is documented here: https://docs.app.layer.ai/docs/sdk-library/resources-decorator.

In your case, I think this should work:

@fabric("f-gpu-small")
@model("my_first_model")
@resources("train/")
def train_model():
...
layer.run([train_model])

Jul 12 '22 09:07 volkangurel

Thank you, I fixed the error. But I am getting a new error.

⠧  my_first_model       ━╸━━━━━━━━ UPLOADING [94/757 files, 37 MB/234 MB, 1.8 MB/s, 0:01:47] 
....
.....
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected

Can you create documentation for error messages? @mwitiderrick @volkangurel

Jul 13 '22 11:07 kadirnar

@kadirnar are you still experiencing this error? @mjbcopland can you take a look?

Jul 16 '22 07:07 mwitiderrick

Hi @kadirnar, does it reliably fail/disconnect in the same place when uploading? Does it work with a smaller subset of the training resources rather than all 757 files?

Jul 18 '22 13:07 mjbcopland

I no longer use linux operating system. Does it work in windows for the layer library? https://github.com/layerai/sdk/issues/97#issuecomment-1160262151

Jul 19 '22 15:07 kadirnar

from layer.decorators import model ... model = models.densenet121(pretrained=True) ... model.classifier = classifier ... model.to(device) ...

@model("my_first_model") ... layer.run([train_model(model, optimizer,50, criterion)])

@kadirnar there is a variable naming issue for model. You are mixing from layer.decorators.model with nn.Module. Thats the reason why you are getting these errors:

09:12:10 my_first_model: ModuleNotFoundError: No module named 'torchvision'

TypeError: model() got an unexpected keyword argument 'pip_packages'

Jul 24 '22 19:07 fcakyon

Currently there is only windows operating system. I will reinstall ubuntu to try it. But I don't know the solution of the error. At the moment I am not getting any error about package installation. I am getting connection related error while uploading data.

Note: I don't have the code file due to OS change. I sent the last saved code image file. The latest version of the code:

code

Jul 24 '22 20:07 kadirnar

sdk sdk copied to clipboard

TypeError: conv2d() received an invalid combination of argument

sdk
sdk copied to clipboard