pytorch-frame icon indicating copy to clipboard operation
pytorch-frame copied to clipboard

sklearn-compatible interface

Open 34j opened this issue 2 years ago • 22 comments
trafficstars

I think it would be great to have this feature, as I think sklearn is often used for tabular data. I tried to use skorch, but skorch does not allow TensorFrames and did not work well.

(examples/tutorial.py)

from skorch import NeuralNetClassifier

net = NeuralNetClassifier(module=model, max_epochs=args.epochs, lr=args.lr, 
                            device=device, batch_size=args.batch_size, 
                            classes=dataset.num_classes, iterator_train=DataLoader,
                            iterator_valid=DataLoader, train_split=None)
net.fit(train_dataset, y=None)
Traceback (most recent call last):
  File "\examples\tutorial.py", line 346, in <module>
    net.fit(train_dataset, y=None)
  File "\site-packages\skorch\classifier.py", line 165, in fit
    return super(NeuralNetClassifier, self).fit(X, y, **fit_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\net.py", line 1319, in fit
    self.partial_fit(X, y, **fit_params)
  File "\site-packages\skorch\net.py", line 1278, in partial_fit
    self.fit_loop(X, y, **fit_params)
  File "\site-packages\skorch\net.py", line 1190, in fit_loop
    self.run_single_epoch(iterator_train, training=True, prefix="train",
  File "\site-packages\skorch\net.py", line 1226, in run_single_epoch
    step = step_fn(batch, **fit_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\net.py", line 1105, in train_step
    self._step_optimizer(step_fn)
  File "\site-packages\skorch\net.py", line 1060, in _step_optimizer
    optimizer.step(step_fn)
  File "\site-packages\torch\optim\optimizer.py", line 373, in wrapper
    out = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\torch\optim\optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\torch\optim\sgd.py", line 66, in step
    loss = closure()
           ^^^^^^^^^
  File "\site-packages\skorch\net.py", line 1094, in step_fn
    step = self.train_step_single(batch, **fit_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\net.py", line 993, in train_step_single
    y_pred = self.infer(Xi, **fit_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\net.py", line 1517, in infer
    x = to_tensor(x, device=self.device)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\utils.py", line 104, in to_tensor
    return [to_tensor_(x) for x in X]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\utils.py", line 104, in <listcomp>
    return [to_tensor_(x) for x in X]
            ^^^^^^^^^^^^^
  File "\site-packages\skorch\utils.py", line 118, in to_tensor
    raise TypeError("Cannot convert this data type to a torch tensor.")
TypeError: Cannot convert this data type to a torch tensor.

I think the following changes are needed:

  • Add an ability to convert from DataFrame to TensorFrame without much prior information.
  • Create a wrapper that passes Tensor to skorch or create a scikit-learn compatible estimator specifically for this package.

I am sorry, but I cannot take much time to assist in creating this feature, so if it is not possible, please close this.

34j avatar Oct 24 '23 14:10 34j

You can convert a DataFrame to TensorFrame easily with

dataset = Dataset(df, col_to_stype=col_to_stype, target_col="y")
dataset.tensor_frame

See tutorial.

yiweny avatar Oct 25 '23 05:10 yiweny

Thanks for your suggestion! I think this is great to add. Setting this as P2 feature, as we first want to prioritize more stype support https://github.com/pyg-team/pytorch-frame/issues/88.

weihua916 avatar Oct 25 '23 05:10 weihua916

Is someone already working on that?

MacOS avatar Dec 18 '23 21:12 MacOS

No, as far as I know. Let us know if you are interested!

weihua916 avatar Dec 18 '23 23:12 weihua916

Yes, I'm interested. Hence, you can assign this to me. How fast should this task be completed?

MacOS avatar Dec 22 '23 20:12 MacOS

@MacOS Great, thank you! It'd be good to complete this feature by the end of January. Would that be possible?

weihua916 avatar Dec 29 '23 11:12 weihua916

@weihua916 As of now, yes.

MacOS avatar Jan 01 '24 19:01 MacOS

I have tried this and it seems to be very difficult. As a quick fix that isn't pretty, the following seems necessary:

  • Patch skorch.utils.to_tensor_ to bypass TensorFrame.
  • Add index = torch.sensor(index) to torch_frame.DataLoader.collapse_fn to make it return TensorFrame instead of list[TensorFrame].

~However, I don't know how to pass the validation dataset.~

Next, we want to pass a validation dataset as well, but if we pass them using a tuple like skorch.NeuralNet.fit((train_dataset.tensorframe, val_dataset.tensor_frame), None), skorch would raise a lot of errors. Therefore, I tried to split them inside skorch.

  • ~Pass col_to_stype as y, as in skorch.NeuralNet.fit(dataset.df, dataset.col_to_stype), utilizing the internal structure.~
  • Remove self.check_data(X, y) in skorch.NeuralNet.fit_loop().
  • ~Modify TensorFrame to call self.materialize() in the constructor.~
  • To avoid an error in torch_frame.Dataset.split(), set split_col like skorch.NeuralNet(... , dataset=lambda d, c: Dataset(d, c, split_col='split_col')).

34j avatar Mar 11 '24 03:03 34j

:thinking:

Thank you for looking into this, @34j! I was about to start working on it.

Add an ability to convert from DataFrame to TensorFrame without much prior information.

I would have simply converted the DataFrame to TensorFrame internally, work with it, and if requested, return the DataFrame again. This means, of course, that one has to track what was given. Or am I missing something?

Create a wrapper that passes Tensor to skorch or create a scikit-learn compatible estimator specifically for this package.

This seems to be very big and unrealistic because we would have to make all estimators compatible with scikit-learn, which is a lot to ask for. At the moment, scikit-learn is an optional dependency.

May I ask you, @34j, to post a self-contained example (or examples) that what qualify pytorch-frame as being sklearn-compatible?

PS: I would submit one PR today, but maybe only as a draft.

MacOS avatar Mar 11 '24 11:03 MacOS

Add an ability to convert from DataFrame to TensorFrame without much prior information.

This is an implicit request for the recently implemented infer_df_stype, which has thankfully already been resolved.

Create a wrapper that passes Tensor to skorch

I feel like this could probably be done, I'll send a draft PR in an hour and I want to ask @MacOS to take it over and do the documentation, testing and tutorial work.

dirty prototype code

example/tutorial.py:

from skorch import NeuralNetClassifier
from skorch.dataset import Dataset as SkorchDataset
import torch.nn as nn
from torch_frame.utils import infer_df_stype
from torch_frame.data.dataset import DataFrameToTensorFrameConverter, Dataset


def create_dataset(df, _) -> Dataset:
    dataset_ = Dataset(
        df, dataset.col_to_stype, split_col="split_col", target_col="target_col"
    )
    dataset_.materialize()
    return dataset_


def split_dataset(dataset: Dataset) -> tuple[SkorchDataset, SkorchDataset]:
    datasets = dataset.split()[:2]
    return datasets[0].tensor_frame, datasets[1].tensor_frame


def get_iterator(dataset: SkorchDataset, **kwargs) -> DataLoader:
    return DataLoader2(dataset, **kwargs)


class DataLoader2(DataLoader):
    def collate_fn(
        self, index: int | List[int] | range | slice | Tensor
    ) -> tuple[TensorFrame, Tensor | None]:
        index = torch.tensor(index)
        res = super().collate_fn(index).to(device)
        return res, res.y

net = NeuralNetClassifier(
    module=model,
    max_epochs=args.epochs,
    lr=args.lr,
    device=device,
    batch_size=6,
    iterator_train=get_iterator,
    dataset=create_dataset,
    iterator_valid=get_iterator,
    train_split=split_dataset,
    classes=dataset.df["target_col"].unique(),
    verbose=1,
    criterion=nn.CrossEntropyLoss,
)
net.fit(dataset.df, None)

34j avatar Mar 11 '24 12:03 34j

@34j Is fine with me!

So we drop the second part of your request then, correct?

MacOS avatar Mar 11 '24 12:03 MacOS

Heads up everyone, I started working on it. I already merge the PR draft of @34j into my fork.

Would be nice if you guys would be available in case I have questions. :)

MacOS avatar Mar 13 '24 09:03 MacOS

Heads up everyone, I started working on it. I already merge the PR draft of @34j into my fork.

Would be nice if you guys would be available in case I have questions. :)

~May I ask you what is your question~ nvm plz, sorry for my terrible English comprehension

34j avatar Mar 14 '24 06:03 34j

So far none. I meant just in case.

Sorry for the delay at all, but I had personal matters to deal with. I'm confident that I can submit a PR this month.

MacOS avatar Mar 19 '24 07:03 MacOS

Hi all,

short update, unfortunately, I got sick, hence again a delay. Should I still work on it?

MacOS avatar Apr 06 '24 16:04 MacOS

Hi all,

short update, unfortunately, I got sick, hence again a delay. Should I still work on it?

I think it should continue. Are you still working on this part? Otherwise I can take over.

qychen2001 avatar Jul 10 '24 10:07 qychen2001

Yes, still working on it @qychen2001!

MacOS avatar Jul 10 '24 15:07 MacOS

Yes, still working on it @qychen2001!

That's great! This feature is really important, looking forward to your PR.

qychen2001 avatar Jul 11 '24 00:07 qychen2001

Sorry but I have almost completed this feature by myself in #375 (as MacOS seemed to be sick) and am just waiting for @weihua916 's review. However, the styling work for pre-commit by MacOS I referred certainly helped this.

34j avatar Jul 11 '24 03:07 34j

Sorry but I have almost completed this feature by myself in #375 (as MacOS seemed to be sick) and am just waiting for @weihua916 's review. However, the styling work for pre-commit by MacOS I referred certainly helped this.

That's fantastic! But I'm still concerned about the relationship between skorch and sklearn, can your PR directly support models in sklearn such as svm?

qychen2001 avatar Jul 11 '24 03:07 qychen2001

Excuse me but what do you mean by relationship? skorch works perfectly, trust me plz 🫠

34j avatar Jul 11 '24 03:07 34j

can your PR directly support models in sklearn such as svm?

sklearn models already have sklearn-compatible interface apparently

34j avatar Jul 11 '24 03:07 34j