[Chapter 2] error when trying to fit linear regression model on dataset after pipeline
Hi @ageron I was trying to go through chapter two but applying the info to a seperate dataset than that used in the book to try and supplement my learning a bit and its proving to be quite useful in terms of understanding the concept. I did however come across a bug when trying to fit my linear regression model to the transformed data and I am not quite sure how to debug it, I was wondering if you would be able to help me. I am using the Ames Iowa dataset, and here is the code I am trying to run
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(training_data, test_size=0.2, random_state=42)
X_train = train_set.drop('SalePrice', axis=1)
y_train = train_set['SalePrice'].copy()
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
#here we have to distinguish between the numerical columns and the categorical columns, because the transformations being applied to each is different
num_attributes = list(X_train.select_dtypes(exclude=['object'])) #to select all num columns, we exclude any column with object types
cat_attributes = list(X_train.select_dtypes(include=['object'])) #here we select all columns with object types
cat_pipeline = Pipeline([
('imputer', SimpleImputer(fill_value='none', strategy='constant')),
('one_hot', OneHotEncoder())
])
full_pipeline = ColumnTransformer([
('num', StandardScaler(), num_attributes),
('cat', cat_pipeline, cat_attributes)
])
X_train_prepared = full_pipeline.fit_transform(X_train, y_train)
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train_prepared, y_train)
this is the error I am getting
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-38-ffdac2dfbcea> in <module>
4 lin_reg = LinearRegression()
5
----> 6 lin_reg.fit(X_train_prepared, y_train)
~\Anaconda3\envs\ml_book\lib\site-packages\sklearn\linear_model\base.py in fit(self, X, y, sample_weight)
461 n_jobs_ = self.n_jobs
462 X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
--> 463 y_numeric=True, multi_output=True)
464
465 if sample_weight is not None and np.atleast_1d(sample_weight).ndim > 1:
~\Anaconda3\envs\ml_book\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
717 ensure_min_features=ensure_min_features,
718 warn_on_dtype=warn_on_dtype,
--> 719 estimator=estimator)
720 if multi_output:
721 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
~\Anaconda3\envs\ml_book\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
484 dtype=dtype, copy=copy,
485 force_all_finite=force_all_finite,
--> 486 accept_large_sparse=accept_large_sparse)
487 else:
488 # If np.array(..) gives ComplexWarning, then we convert the warning
~\Anaconda3\envs\ml_book\lib\site-packages\sklearn\utils\validation.py in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite, accept_large_sparse)
318 else:
319 _assert_all_finite(spmatrix.data,
--> 320 allow_nan=force_all_finite == 'allow-nan')
321
322 return spmatrix
~\Anaconda3\envs\ml_book\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan)
54 not allow_nan and not np.isfinite(X).all()):
55 type_err = 'infinity' if allow_nan else 'NaN, infinity'
---> 56 raise ValueError(msg_err.format(type_err, X.dtype))
57 # for object dtype data, we only check for NaNs (GH-13254)
58 elif X.dtype == np.dtype('object') and not allow_nan:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
could this have something to do with the fact that my pipeline is returning a sparse matrix as opposed to a dense matrix? any help would be greatly appreciated. Thank you
I am also getting come similar error after passing data to predict API after pipelining ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 16 is different from 14)
size 16 is different from 14
I think you might be doing what I just did (o: When preparing data to make predictions, make sure you call transform on the pipeline rather than fit_transform
e.g. some_data_prepared = full_pipeline.transform(some_data)