nni icon indicating copy to clipboard operation
nni copied to clipboard

Serialize TensorDataset

Open conceptofmind opened this issue 3 years ago • 10 comments

Describe the issue: I am having an error when trying to run NNI with a tensor dataset and parquet file. Is there a way to serialize this easily and properly?

Environment:

  • Training service (local|remote|pai|aml|etc): local
  • Client OS: Ubuntu
  • Python version: 3.9
  • Is conda/virtualenv/venv used?: Yes
  • Is running in Docker?: No

Configuration:

  • Experiment config (remember to remove secrets!):
  • Search space:

Log message:

TypeError: <torch.utils.data.dataset.TensorDataset object at 0x7fb348f84d60> of type <class 'torch.utils.data.dataset.TensorDataset'> is not supported to be traced. File an issue at https://github.com/microsoft/nni/issues if you believe this is a mistake.

PayloadTooLarge: Pickle too large when trying to dump <torch.utils.data.dataset.Subset object at 0x7fdc80f7dd30>. This might be caused by classes that are not decorated by @nni.trace. Another option is to force bytes pickling and try to raise pickle_size_limit.

ValueError: Serialization failed when trying to dump the model because payload too large (larger than 64 KB). This is usually caused by pickling large objects (like datasets) by mistake. See the full error traceback for details and https://nni.readthedocs.io/en/stable/NAS/Serialization.html for how to resolve such issue.

How to reproduce it?: Parquet file can be found here: https://www.kaggle.com/pythonash/end-to-end-simple-and-powerful-dnn-with-leakyrelu/data?select=train.parquet

@nni.trace
def preprocess_dataframe(parquet_file="../train.parquet"):
  df = pd.read_parquet(parquet_file)
  df_x = df.astype('float16').drop(['time_id'], axis=1)
  df_y = pd.DataFrame(df['target'])
  df_x = df_x.astype('float16').drop(['target'], axis=1)
  scaler = preprocessing.StandardScaler()
  df_x[[f'f_{i}' for i in range(300)]] = scaler.fit_transform(df_x[[f'f_{i}' for i in range(300)]])
  df_x['investment_id'] = scaler.fit_transform(pd.DataFrame(df_x['investment_id']))
  return df_x, df_y

data_df, target = preprocess_dataframe("/train_low_mem.parquet")
inputs, targets = data_df.values, target.values
train, val = train_test_split(inputs, test_size=0.2)

dataset = serialize(TensorDataset(torch.tensor(inputs, dtype=torch.float32),
                       torch.tensor(targets, dtype=torch.float32)))

train_ds, val_ds = random_split(dataset, [train.shape[0], val.shape[0]])

conceptofmind avatar Feb 23 '22 02:02 conceptofmind

Could you try this:

@nni.trace
def preprocess_dataframe(parquet_file="../train.parquet"):
  df = pd.read_parquet(parquet_file)
  df_x = df.astype('float16').drop(['time_id'], axis=1)
  df_y = pd.DataFrame(df['target'])
  df_x = df_x.astype('float16').drop(['target'], axis=1)
  scaler = preprocessing.StandardScaler()
  df_x[[f'f_{i}' for i in range(300)]] = scaler.fit_transform(df_x[[f'f_{i}' for i in range(300)]])
  df_x['investment_id'] = scaler.fit_transform(pd.DataFrame(df_x['investment_id']))

  data_df, target = df_x, df_y
  # data_df, target = preprocess_dataframe("/train_low_mem.parquet")
  inputs, targets = data_df.values, target.values
  train, val = train_test_split(inputs, test_size=0.2)

  dataset = TensorDataset(torch.tensor(inputs, dtype=torch.float32),
                       torch.tensor(targets, dtype=torch.float32))
  train_ds, val_ds = random_split(dataset, [train.shape[0], val.shape[0]])
  return train_ds, val_ds

Could you elaborate your usage of serialize -- why are you using serialization?

ultmaster avatar Feb 23 '22 03:02 ultmaster

Hello!

Thank you for the response.

I was attempting to use serialize on the Tensor Dataset in order to fix one of the errors thrown. It is possible that serialization may not be the real issue since the nni.trace decorator is already added.

I had made the changes you provided but still receive the errors:

PayloadTooLarge: Pickle too large when trying to dump <torch.utils.data.dataset.Subset object at 0x7f64350625e0>. This might be caused by classes that are not decorated by @nni.trace. Another option is to force bytes pickling and try to raise pickle_size_limit.

ValueError: Serialization failed when trying to dump the model because payload too large (larger than 64 KB). This is usually caused by pickling large objects (like datasets) by mistake. See the full error traceback for details and https://nni.readthedocs.io/en/stable/NAS/Serialization.html for how to resolve such issue.

Forgive my ignorance or incompetence as this is my first time attempting to use NNI.

Thank you again.

@nni.trace
def preprocess_dataframe(parquet_file="../train.parquet"):
  df = pd.read_parquet(parquet_file)
  df_x = df.astype('float16').drop(['time_id'], axis=1)
  df_y = pd.DataFrame(df['target'])
  df_x = df_x.astype('float16').drop(['target'], axis=1)
  scaler = preprocessing.StandardScaler()
  df_x[[f'f_{i}' for i in range(300)]] = scaler.fit_transform(df_x[[f'f_{i}' for i in range(300)]])
  df_x['investment_id'] = scaler.fit_transform(pd.DataFrame(df_x['investment_id']))

  data_df, target = df_x, df_y
  inputs, targets = data_df.values, target.values
  train, val = train_test_split(inputs, test_size=0.2)

  dataset = TensorDataset(torch.tensor(inputs, dtype=torch.float32),
                       torch.tensor(targets, dtype=torch.float32))
  train_ds, val_ds = random_split(dataset, [train.shape[0], val.shape[0]])
  return train_ds, val_ds

train_ds, val_ds = preprocess_dataframe("/train_low_mem.parquet")

conceptofmind avatar Feb 23 '22 03:02 conceptofmind

As it reminds you, you should probably write:

nni.trace(Subset)(blabla)

As I didn't see any usage of Subset in your code, and I'm still unaware of why you are using serialization, that's the help I can offer now. I guess there should be a better way to satisfy your need without serialization.

ultmaster avatar Feb 24 '22 02:02 ultmaster

Hi @ultmaster,

I do not believe that I need serialization after adding nni.trace to a few different functions and classes while testing today. The errors seem to be thrown due to the size of the parquet file (~4 GB) itself. When attempting to load the file from train_ds, val_ds into:

trainer = pl.Regression(train_dataloader=pl.DataLoader(trainds, batch_size=1024),
                            val_dataloaders=pl.DataLoader(valds, batch_size=1024*2),
                            max_epochs=30,
                            gpus=1)

And then run:

exp = RetiariiExperiment(model_space, trainer, [], simple_strategy)
exp_config = RetiariiExeConfig('local')
exp_config.experiment_name = 'QuantDNN_1'
exp_config.trial_concurrency = 1
exp_config.max_trial_number = 500
exp_config.trial_gpu_number = 1
exp_config.execution_engine = 'base'
exp_config.training_service.use_active_gpu = True
exp.run(exp_config, 8745)

The errors I listed above are thrown:

PayloadTooLarge: Pickle too large when trying to dump <torch.utils.data.dataset.Subset object at 0x7f64350625e0>. This might be caused by classes that are not decorated by @nni.trace. Another option is to force bytes pickling and try to raise pickle_size_limit.

ValueError: Serialization failed when trying to dump the model because payload too large (larger than 64 KB). This is usually caused by pickling large objects (like datasets) by mistake. See the full error traceback for details and https://nni.readthedocs.io/en/stable/NAS/Serialization.html for how to resolve such issue.

If I create a PyTorch Dataset to load the data and output batches instead:

X_train, X_test, y_train, y_test = train_test_split(df, df_y, test_size=0.20, random_state=1)

@nni.trace
class RegressionDataset(Dataset):
    def __init__(self, df, df_y):
        self.df = df
        self.conts = torch.tensor(np.array([i for i in self.df.values]), dtype=torch.float32)
        self.df_y = torch.tensor(df_y.values, dtype=torch.float32)
        
    def __len__(self): return len(self.df_y)

    def __getitem__(self, idx):
        return [self.conts[idx], self.df_y[idx]]

trainds = RegressionDataset(X_train, y_train)
valds = RegressionDataset(X_test, y_test)

Then exp.run(exp_config, 8745) will run for a brief period and then crash.

I am still working through everything. I could provide the full code or notebook if needed.

Thank you again,

Eric

conceptofmind avatar Feb 24 '22 03:02 conceptofmind

I see your use scenario.

Maybe you should do this instead:

@nni.trace
def get_train_val_dataset(split):
  if split == 'train':
     ...
  elif split == 'val':
     ...
     return valid_dataset

You can write arbitrary code in get_train_val_dataset. Including using TensorDataset and Subset.

What @nni.trace wraps are treated as a blackbox, and nni doesn't attempt to open it. So please try to feed the output of the function directly to pl.DataLoader and pl.Regression.

If you try to add something else between, things are likely to go wrong.

ultmaster avatar Feb 24 '22 03:02 ultmaster

Hello @ultmaster ,

Thank you for the input!

I will make the recommended changes as you had listed as I continue to work on this.

After implementing everything, hopefully correctly this time, I will report back if errors still persist or close if resolved.

Thank you again,

Eric

conceptofmind avatar Feb 24 '22 03:02 conceptofmind

Hello @ultmaster ,

Thank you for the input!

I will make the recommended changes as you had listed as I continue to work on this.

After implementing everything, hopefully correctly this time, I will report back if errors still persist or close if resolved.

Thank you again,

Eric

Hi Eric, ping to check the results, hope things worked out well.

scarlett2018 avatar Mar 18 '22 06:03 scarlett2018

@scarlett2018

Hi Scarlett,

I have made some progress but have yet to fully resolve the issue. I still am having a reoccurring problem with running out of memory during the training and search phase of neural architecture search. A few trial runs will complete before an OOM error and program crash.

I will hopefully come up with a viable solution sooner than later. I will update this thread as well when I am able to resolve it or have further questions.

I appreciate you checking in.

Thank you,

Eric

conceptofmind avatar Mar 18 '22 14:03 conceptofmind

Hi, is there any update on this?

tanmay2798 avatar Jun 23 '22 20:06 tanmay2798

What kind of updates?

ultmaster avatar Jun 25 '22 00:06 ultmaster

Closing as there has been no updates for a long time.

If anyone is experiencing a new problem with serializer, please see the solution in this issue, or open a new issue.

ultmaster avatar Sep 09 '22 03:09 ultmaster

nni.common.serializer.PayloadTooLarge: Pickle too large when trying to dump <nni.nas.nn.pytorch.mutator.ParameterChoiceMutator object at 0x7fa9b9e48be0>. This might be caused by classes that are not decorated by @nni.trace. Another option is to force bytes pickling and try to raise pickle_size_limit

I had the same problem

qw1319 avatar Nov 07 '22 07:11 qw1319

@qw1319 I will take a look at your suggestion as I was unable to resolve this issue previously.

conceptofmind avatar Nov 09 '22 20:11 conceptofmind

I would like to re-open this issue. I have very large TSV data, and there is no way to run NNI unless I increase the pickle limit size. Otherwise NAS crashes. However, when I view results on NNI browser app after some time it takes very long to show anything, I suspect maybe because pickled objects are printed to logs? At least that's what I noticed, unreasonably big log file with some non-ascii printed data there.

-rw-r----- 1 mateusz mateusz 1.6G Dec 11 08:21 nnimanager.log

Also when I try to view the web-app it gets stuck for very long and experiment crashes as well as the app. So there is no possibility to view metric curves.

This looks like a bug to me related to serialization.

mashu avatar Dec 11 '22 07:12 mashu

@mashu You shouldn't pickle the large TSV data in any case.

You should put the logic of reading data into a function, and pickle that function instead; or use nni.trace to trace the initialization of an object so that only the init arguments are serialized instead of the whole object. Please avoid pickle the data directly. Pickle limit size exists for a reason.

matluster avatar Dec 12 '22 01:12 matluster

So my use-case is that model needs to be fast, I define pytorch Dataset class which in constructor takes parameter panda's DataFrame already loaded into memory and sets that to self.data within that class. So other functions like get_item can access elements by index. This is standard, since TSV files are not random access and I don't want to suffer from disk IO latency. So I do have a Dataset class that must contain big data inside it after object is initialized. It looks like this

data = pd.read_csv("data/ERR_complete.tsv", sep='\t')
training_data = data.sample(frac=0.90, random_state=25)
testing_data = data.drop(training_data.index)
train_dataset = MiAIRRDataset(training_data, "data/M.fasta")
test_dataset = MiAIRRDataset(testing_data, "data/M.fasta")
train_dataloader = nni.trace(DataLoader)(train_dataset, batch_size=64, collate_fn=PadSequence())
test_dataloader = nni.trace(DataLoader)(test_dataset, batch_size=64, collate_fn=PadSequence())

This setup allows me to train the model with Lightning module and DARTS if I use that webgui is not starting so there is no issue. But still would like to know how to use webgui properly so it's not crashing.

mashu avatar Dec 12 '22 06:12 mashu

It might look ugly, but my point here is that train_dataset and test_dataset shouldn't be directly pickled. Instead, their constructing process should be pickled. You might be shooting for something like this:

@nni.trace
def get_train_dataset():
  data = pd.read_csv("data/ERR_complete.tsv", sep='\t')
  training_data = data.sample(frac=0.90, random_state=25)
  train_dataset = MiAIRRDataset(training_data, "data/M.fasta")

@nni.trace
def get_test_dataset():
  data = pd.read_csv("data/ERR_complete.tsv", sep='\t')
  testing_data = data.drop(training_data.index)
  test_dataset = MiAIRRDataset(testing_data, "data/M.fasta")

train_dataloader = nni.trace(DataLoader)(get_train_dataset(), batch_size=64, collate_fn=PadSequence())
test_dataloader = nni.trace(DataLoader)(get_test_dataset(), batch_size=64, collate_fn=PadSequence())

The idea here is: when the dataloader gets serialized, it resorts to serializing its init arguments, as DataLoader is wrapped by nni.trace. Then the first argument get_train_dataset() also remembers where it comes from (i.e., from the function get_train_dataset), so that we don't need to pickle the dataset itself, but rather we only need to serialize the function get_train_dataset (which is probably some path or binary).

Notice here data file is read twice in training and testing. To save that IO, you might use a global data and save the dataframe into a global variable. Or more elegantly, you could use datamodule, and put that into fit_kwargs. (I'm not sure whether DataModule is compatible with one-shot strategy though.)

matluster avatar Dec 12 '22 15:12 matluster

Hello, I'm trying out NNI NAS for the first time, and I've also encountered a similar issue.

Pickle too large when trying to dump <function get_train_val_data at 0x000001EB7EC24E50>. Please try to raise pickle_size_limit if you insist.

I'm reading csv, which I split into train and validation. Following some suggestions in this issue, my code right now is:

@nni.trace
class MyDataset(Dataset):
  def __init__(self,x,y):
    self.x_train=x.values.astype('float32')
    self.y_train=y.values.astype('float32')

  def __len__(self):
    return len(self.y_train)

  def __getitem__(self,idx):
    return self.x_train[idx],self.y_train[idx]

@nni.trace
def get_train_val_data(split):
    x_df = pd.read_csv("C:\\Users\\Leonardo\\Desktop\\x_train.csv", sep=",", header=0, index_col=0)
    y_df = pd.read_csv("C:\\Users\\Leonardo\\Desktop\\y_train.csv", sep=",", header=0, index_col=0)

    x_train, x_val, y_train, y_val = train_test_split(x_df, y_df, test_size=0.20, random_state=1)

    if split == 'train':
        return MyDataset(x_train,y_train)
    elif split == 'val':
        return MyDataset(x_val,y_val)

I've also tried splitting only at the DataLoader step using SubsetRandomSampler, similar to the DARTS tutorial, however, I got a similar issue.

I use this function on the PyTorch lightning Regression evaluator as such:

import nni.retiarii.evaluator.pytorch.lightning as pl

evaluator = pl.Regression(
    train_dataloaders=pl.DataLoader(get_train_val_data('train'), batch_size=64, num_workers=6),
    val_dataloaders=pl.DataLoader(get_train_val_data('val'), batch_size=64, num_workers=6),
    max_epochs=10, accelerator='gpu', devices=1

Using a custom evaluator prevents this issue. If needed, I can provide my notebook.

sw33zy avatar Feb 28 '23 17:02 sw33zy

The point here is: whenever you put an @nni.trace on a class or a function, its argument should never be a large object.

For example,

@nni.trace
class MyDataset(Dataset):
  def __init__(self,x,y):

Then x and y needs to be small.

For example,

@nni.trace
def get_train_val_data(split):
    x_df = pd.read_csv("C:\\Users\\Leonardo\\Desktop\\x_train.csv", sep=",", header=0, index_col=0)

Then split needs be a small object.

One exception is that the argument itself is another object created by @nni.trace. For example, the split here is created by create_split_function(a, b, c). In that case, a, b, c here needs be small objects.

Please refer to this explanation if it's still not clear enough to you.

ultmaster avatar Mar 01 '23 09:03 ultmaster

I've managed to understand finally! Thank you.

I changed my MyDataset class to a TensorDataset, thus avoiding that @nni.trace which received large objects.

sw33zy avatar Mar 01 '23 10:03 sw33zy

Hi,

This is not working for me either. I have just started using NNI. I would be very grateful if you could please help.

Error:

File "C:\Users\ADMIN\anaconda3\lib\site-packages\nni\common\serializer.py", line 864, in _json_tricks_any_object_encode raise PayloadTooLarge(f'Pickle too large when trying to dump {obj}. This might be caused by classes that are '

PayloadTooLarge: Pickle too large when trying to dump tensor([[[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 1.], [0., 0., 0., ..., 0., 0., 1.], [0., 0., 0., ..., 0., 0., 1.]]], device='cuda:0'). This might be caused by classes that are not decorated by @nni.trace. Another option is to force bytes pickling and try to raise pickle_size_limit.

What I did,

'''splitting train-test data'''

train_data,test_data, train_labels,test_labels = train_test_split(data, labels, test_size=.2)

@nni.trace def get_train_dataset(): train_dataDataset = TensorDataset(train_data,train_labels) return (train_dataDataset)

@nni.trace def get_test_dataset():
test_dataDataset = TensorDataset(test_data,test_labels) return (test_dataDataset)

create dataloader objects

train_loader = nni.trace(DataLoader)(get_train_dataset(),batch_size=15, shuffle=True, drop_last=True) test_loader = nni.trace(DataLoader)(get_test_dataset(), batch_size=20, shuffle=True, drop_last=True)

#In evaluate model,

def evaluate_model(): #argument should be equal to search-space params device = torch. Device('cuda') if torch.cuda.is_available() else torch.device('cpu') model_space.to(device) numepochs = 10 trainAcc = [] SkAcc = [] losses = [] learning_rate = 0.00018
L2lambda = 0 momentum = 0 lossfun,optimizer = OptandLoss(model_space,learning_rate,momentum,L2lambda,'Adam') trainAcc,SkAcc,losses, BestModel = trainTheModel(model_space,numepochs,lossfun,optimizer,train_loader,test_loader,trainAcc,SkAcc,losses) finalSkAcc = SkAcc[-1] nni.report_final_result(finalSkAcc) return nni.report_final_result(finalSkAcc)

Thanks for your help. Please let me know if I should send you the notebook. Regards, Shruti

ShrutiSarikaChakraborty avatar May 14 '23 20:05 ShrutiSarikaChakraborty

Hi, I am using the latest version of NNI (nni==v3.0rc1) and I am getting a pickle too large error on my evaluate function.

Error: ValueError: Pickle too large when trying to dump <function evaluate_model at 0x00000219CB486E50>. Please try to raise pickle_size_limit if you insist.

I am working with numpy data that I convert into a TensorDataset before passing to the DataLoader. I have reproduced my error, and adding the relevant code below (notebook can be provided):

Importing DataLoader: from nni.nas.evaluator.pytorch import DataLoader

Function to convert input numpy data to TensorDataset, decorated with @nni.trace

@nni.trace
def get_dataset(features, labels):
    features_tensor = torch.tensor(features, dtype=torch.float32)
    labels_tensor = torch.tensor(labels, dtype=torch.long)
    return TensorDataset(features_tensor, labels_tensor)

Dataloader:

train_dataloader = nni.trace(DataLoader)(get_dataset(training_data, training_labels), batch_size=10, shuffle=True, drop_last=True)
val_dataloader = nni.trace(DataLoader)(get_dataset(validation_data, validation_labels), batch_size=10, shuffle=True, drop_last=True)

The evaluate, train and test functions used are exactly similar to the NAS tutorial. The ModelSpace is a custom modelspace. Where exactly is the problem stemming from? Which part of the code to zoom in to resolve? Kindly provide some guidance.

Awizhey avatar Aug 03 '23 10:08 Awizhey

I also had similar issues when I started using NNI. I believe it is because the features and labels parameters are too large. When you use @nni.trace, its parameters cannot be large objects. Check this comment for more detail. Instead of having those arguments you should read your data inside the get_dataset function, for example.

sw33zy avatar Aug 03 '23 17:08 sw33zy