handson-ml
handson-ml copied to clipboard
Chapter 2 full_pipeline.fit error
I'm having trouble with the full_pipeline
section of code below. I'm confused by the code in the book and the adjustments on the website. When i run the code below i get the error:
fit_transform() takes 2 positional arguments but 3 were given
on the final line. Not sure what the problem is because it was working fine last night. Where have I gone astray?
from sklearn.pipeline import FeatureUnion
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer()),
])
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
housing_prepared = full_pipeline.fit_transform(housing)
Hi @cmcgrath1982 ,
The problem may come from the definition of the DataFrameSelector
class, or the CombinedAttributesAdded
class. Make sure their fit()
and transform()
methods both take two arguments: X
and y=None
. Even if you don't need the y
, you still need to have it, because the Pipeline
object will call all the transformers with both X
and y
, even if there is no y
(it will be called with y=None
).
Hope this helps.
Thanks, i was also using a depreciated version of Scikit learn and that wasn't helping. Think i got this sorted out for now.
On Wed, Mar 27, 2019 at 10:39 PM Aurélien Geron [email protected] wrote:
Hi @cmcgrath1982 https://github.com/cmcgrath1982 , The problem may come from the definition of the DataFrameSelector class, or the CombinedAttributesAdded class. Make sure their fit() and transform() methods both take two arguments: X and y=None. Even if you don't need the y, you still need to have it, because the Pipeline object will call all the transformers with both X and y, even if there is no y (it will be called with y=None). Hope this helps.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ageron/handson-ml/issues/388#issuecomment-477422719, or mute the thread https://github.com/notifications/unsubscribe-auth/AdL7k0Jm4HE2fxZ5DZK5Me4b9cJhtS94ks5vbCtMgaJpZM4cOTTM .
Hi @cmcgrath1982, I am also getting the same error. Which version of Scikit learn are you using?
You need the latest version of Scikit-Learn: 0.20.3
pip3 install -U scikit-learn
The book uses LabelBinarizer()
instead of OneHotEncoder()
in the Jupyter Notebook. It seems that using LabelBinarizer()
would give such error while using OneHotEncoder()
won't. Not sure why?
Hi @229539687 , thanks for your feedback. You probably have an older revision of the book. You can check which release you have on the page immediately before the table of contents. The latest is the 12th release. The first releases used the LabelBinarizer
because I think there was no OneHotEncoder
at the time, but the LabelBinarizer
solution I used was really a hack, and it stopped working in a later version of Scikit-Learn because it's not designed to work in pipelines (since it's meant for the labels, not the input features). In particular, the LabelBinarizer
can only handle one column at a time, and it only handles one argument (y
), so it cannot be used in pipelines (which expect both X
and y
). So I switched to the OneHotEncoder
when it started working really well, in Scikit-Learn 0.20. So you should definitely use OneHotEncoder
and Scikit-Learn ≥0.20.
Hope this helps.
Hi,
while running the full pipeline code I get the following error-
name 'DataFrameSelector' is not defined.
Please suggest some solution.
Hi @yashGuleria ,
Thanks for your question. This DataFrameSelector
class was a custom class. It has to be defined as indicated in the book, if you have one of the earlier releases. However, if you have one of the newer releases of the book, then you'll see that it uses a new approach, using the ColumnTransformer
class. This class was added in Scikit-Learn 0.20, and it's preferable to use it.
I wrote several comments in the Jupyter notebook for chapter 2, to explain what changed, please check it out.
Hope this helps.
I on the other hand is getting a whole other issue from this line:
housing_prepared = full_pipeline.fit_transform(train)
I am using the ames,iowa dataset. However, i keep getting the error:
\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py in _check_X(self, X)
54 if not _get_config()['assume_finite']:
55 if _object_dtype_isnan(X).any():
---> 56 raise ValueError("Input contains NaN")
57
58 return X
ValueError: Input contains NaN
I checked, train_num
and train_cat
has no NaN
.
What am i doing wrong? someone, anyone, please help. I am new to this.
link to full code
Hi @ML-magazine , thanks for your question. The link to the full code does not work. Could you please paste the full URL here? Also, could you please make sure the code is exactly identical to the code in the notebook? Thanks
Thank you so much for your response. wow! Found the error. Thank you so much
Hi @229539687 , thanks for your feedback. You probably have an older revision of the book. You can check which release you have on the page immediately before the table of contents. The latest is the 12th release. The first releases used the
LabelBinarizer
because I think there was noOneHotEncoder
at the time, but theLabelBinarizer
solution I used was really a hack, and it stopped working in a later version of Scikit-Learn because it's not designed to work in pipelines (since it's meant for the labels, not the input features). In particular, theLabelBinarizer
can only handle one column at a time, and it only handles one argument (y
), so it cannot be used in pipelines (which expect bothX
andy
). So I switched to theOneHotEncoder
when it started working really well, in Scikit-Learn 0.20. So you should definitely useOneHotEncoder
and Scikit-Learn ≥0.20. Hope this helps.
Hi, @ageron as mentioned in the book for OneHotEncoder to work it first need categorical variables to be converted into numerical form by using LabelEncoder, so it is a two-step process Label Encoding and then One Hot Encoding.
But this contradicts with the new version of git hub where in the full pipeline only OneHotEncoder is used:
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", OneHotEncoder(), cat_attribs),
])
Since you mentioned in the notebook that OneHotEncoder now handles strings
Warning: earlier versions of the book used the LabelBinarizer or CategoricalEncoder classes to convert each categorical value to a one-hot vector. It is now preferable to use the OneHotEncoder class. Since Scikit-Learn 0.20 it can handle string categorical >inputs (see PR #10521),
I tried the same directly but I cannot feed the series to the encoder, I still need to convert it into numpy array as shown below:
x=housing["ocean_proximity"]
x=x.to_numpy()
a=OneHotEncoder()
a.fit_transform(x.reshape(1,-1))
So does the ColumnTransformer() directly pick the features as numpy array?
Hi @ageron,
How are we fitting the model with two different datatypes here? how does this work?
I'm also having this error