handson-ml Chapter 2 full_pipeline.fit error

I'm having trouble with the full_pipeline section of code below. I'm confused by the code in the book and the adjustments on the website. When i run the code below i get the error:

fit_transform() takes 2 positional arguments but 3 were given on the final line. Not sure what the problem is because it was working fine last night. Where have I gone astray?

from sklearn.pipeline import FeatureUnion

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', Imputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('label_binarizer', LabelBinarizer()),
    ])

full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])

housing_prepared = full_pipeline.fit_transform(housing)

Mar 27 '19 16:03 cmcgrath1982

Hi @cmcgrath1982 , The problem may come from the definition of the DataFrameSelector class, or the CombinedAttributesAdded class. Make sure their fit() and transform() methods both take two arguments: X and y=None. Even if you don't need the y, you still need to have it, because the Pipeline object will call all the transformers with both X and y, even if there is no y (it will be called with y=None). Hope this helps.

Mar 28 '19 02:03 ageron

Thanks, i was also using a depreciated version of Scikit learn and that wasn't helping. Think i got this sorted out for now.

On Wed, Mar 27, 2019 at 10:39 PM Aurélien Geron [email protected] wrote:

Hi @cmcgrath1982 https://github.com/cmcgrath1982 , The problem may come from the definition of the DataFrameSelector class, or the CombinedAttributesAdded class. Make sure their fit() and transform() methods both take two arguments: X and y=None. Even if you don't need the y, you still need to have it, because the Pipeline object will call all the transformers with both X and y, even if there is no y (it will be called with y=None). Hope this helps.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ageron/handson-ml/issues/388#issuecomment-477422719, or mute the thread https://github.com/notifications/unsubscribe-auth/AdL7k0Jm4HE2fxZ5DZK5Me4b9cJhtS94ks5vbCtMgaJpZM4cOTTM .

Mar 28 '19 16:03 cmcgrath1982

Hi @cmcgrath1982, I am also getting the same error. Which version of Scikit learn are you using?

Mar 31 '19 13:03 varun21290

You need the latest version of Scikit-Learn: 0.20.3

pip3 install -U scikit-learn

Apr 02 '19 02:04 ageron

The book uses LabelBinarizer() instead of OneHotEncoder() in the Jupyter Notebook. It seems that using LabelBinarizer() would give such error while using OneHotEncoder() won't. Not sure why?

Apr 06 '19 23:04 LeoXLiu-DS

Hi @229539687 , thanks for your feedback. You probably have an older revision of the book. You can check which release you have on the page immediately before the table of contents. The latest is the 12th release. The first releases used the LabelBinarizer because I think there was no OneHotEncoder at the time, but the LabelBinarizer solution I used was really a hack, and it stopped working in a later version of Scikit-Learn because it's not designed to work in pipelines (since it's meant for the labels, not the input features). In particular, the LabelBinarizer can only handle one column at a time, and it only handles one argument (y), so it cannot be used in pipelines (which expect both X and y). So I switched to the OneHotEncoder when it started working really well, in Scikit-Learn 0.20. So you should definitely use OneHotEncoder and Scikit-Learn ≥0.20. Hope this helps.

Apr 07 '19 02:04 ageron

Hi, while running the full pipeline code I get the following error- name 'DataFrameSelector' is not defined. error

Please suggest some solution.

Jun 05 '19 08:06 yashGuleria

Hi @yashGuleria ,

Thanks for your question. This DataFrameSelector class was a custom class. It has to be defined as indicated in the book, if you have one of the earlier releases. However, if you have one of the newer releases of the book, then you'll see that it uses a new approach, using the ColumnTransformer class. This class was added in Scikit-Learn 0.20, and it's preferable to use it. I wrote several comments in the Jupyter notebook for chapter 2, to explain what changed, please check it out. Hope this helps.

Jun 05 '19 17:06 ageron

I on the other hand is getting a whole other issue from this line:

housing_prepared = full_pipeline.fit_transform(train)

I am using the ames,iowa dataset. However, i keep getting the error:

\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py in _check_X(self, X)
     54             if not _get_config()['assume_finite']:
     55                 if _object_dtype_isnan(X).any():
---> 56                     raise ValueError("Input contains NaN")
     57 
     58         return X

ValueError: Input contains NaN

I checked, train_num and train_cat has no NaN. What am i doing wrong? someone, anyone, please help. I am new to this. link to full code

Nov 08 '19 18:11 Meesaan

Hi @ML-magazine , thanks for your question. The link to the full code does not work. Could you please paste the full URL here? Also, could you please make sure the code is exactly identical to the code in the notebook? Thanks

Nov 10 '19 04:11 ageron

Thank you so much for your response. wow! Found the error. Thank you so much

Nov 11 '19 12:11 Meesaan

Hi @229539687 , thanks for your feedback. You probably have an older revision of the book. You can check which release you have on the page immediately before the table of contents. The latest is the 12th release. The first releases used the LabelBinarizer because I think there was no OneHotEncoder at the time, but the LabelBinarizer solution I used was really a hack, and it stopped working in a later version of Scikit-Learn because it's not designed to work in pipelines (since it's meant for the labels, not the input features). In particular, the LabelBinarizer can only handle one column at a time, and it only handles one argument (y), so it cannot be used in pipelines (which expect both X and y). So I switched to the OneHotEncoder when it started working really well, in Scikit-Learn 0.20. So you should definitely use OneHotEncoder and Scikit-Learn ≥0.20. Hope this helps.

Hi, @ageron as mentioned in the book for OneHotEncoder to work it first need categorical variables to be converted into numerical form by using LabelEncoder, so it is a two-step process Label Encoding and then One Hot Encoding.

But this contradicts with the new version of git hub where in the full pipeline only OneHotEncoder is used:

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

Since you mentioned in the notebook that OneHotEncoder now handles strings

Warning: earlier versions of the book used the LabelBinarizer or CategoricalEncoder classes to convert each categorical value to a one-hot vector. It is now preferable to use the OneHotEncoder class. Since Scikit-Learn 0.20 it can handle string categorical >inputs (see PR #10521),

I tried the same directly but I cannot feed the series to the encoder, I still need to convert it into numpy array as shown below:

x=housing["ocean_proximity"]
x=x.to_numpy()
a=OneHotEncoder()
a.fit_transform(x.reshape(1,-1))

So does the ColumnTransformer() directly pick the features as numpy array?

Sep 05 '20 19:09 amritvirsinghx

Hi @ageron, How are we fitting the model with two different datatypes here? how does this work?

Sep 05 '20 20:09 amritvirsinghx

I'm also having this error

Apr 01 '24 14:04 rocknlen

handson-ml handson-ml copied to clipboard

Chapter 2 full_pipeline.fit error

handson-ml
handson-ml copied to clipboard