handson-ml [Chapter 2] error when trying to fit linear regression model on dataset after pipeline

Hi @ageron I was trying to go through chapter two but applying the info to a seperate dataset than that used in the book to try and supplement my learning a bit and its proving to be quite useful in terms of understanding the concept. I did however come across a bug when trying to fit my linear regression model to the transformed data and I am not quite sure how to debug it, I was wondering if you would be able to help me. I am using the Ames Iowa dataset, and here is the code I am trying to run

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(training_data, test_size=0.2, random_state=42)

X_train = train_set.drop('SalePrice', axis=1)
y_train = train_set['SalePrice'].copy()

from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

#here we have to distinguish between the numerical columns and the categorical columns, because the transformations being applied to each is different
num_attributes = list(X_train.select_dtypes(exclude=['object'])) #to select all num columns, we exclude any column with object types
cat_attributes = list(X_train.select_dtypes(include=['object'])) #here we select all columns with object types

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(fill_value='none', strategy='constant')),
    ('one_hot', OneHotEncoder())
])

full_pipeline = ColumnTransformer([
    ('num', StandardScaler(), num_attributes),
    ('cat', cat_pipeline, cat_attributes)
])

X_train_prepared = full_pipeline.fit_transform(X_train, y_train)

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()

lin_reg.fit(X_train_prepared, y_train)

this is the error I am getting

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-38-ffdac2dfbcea> in <module>
      4 lin_reg = LinearRegression()
      5 
----> 6 lin_reg.fit(X_train_prepared, y_train)

~\Anaconda3\envs\ml_book\lib\site-packages\sklearn\linear_model\base.py in fit(self, X, y, sample_weight)
    461         n_jobs_ = self.n_jobs
    462         X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
--> 463                          y_numeric=True, multi_output=True)
    464 
    465         if sample_weight is not None and np.atleast_1d(sample_weight).ndim > 1:

~\Anaconda3\envs\ml_book\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    717                     ensure_min_features=ensure_min_features,
    718                     warn_on_dtype=warn_on_dtype,
--> 719                     estimator=estimator)
    720     if multi_output:
    721         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

~\Anaconda3\envs\ml_book\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    484                                       dtype=dtype, copy=copy,
    485                                       force_all_finite=force_all_finite,
--> 486                                       accept_large_sparse=accept_large_sparse)
    487     else:
    488         # If np.array(..) gives ComplexWarning, then we convert the warning

~\Anaconda3\envs\ml_book\lib\site-packages\sklearn\utils\validation.py in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite, accept_large_sparse)
    318         else:
    319             _assert_all_finite(spmatrix.data,
--> 320                                allow_nan=force_all_finite == 'allow-nan')
    321 
    322     return spmatrix

~\Anaconda3\envs\ml_book\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan)
     54                 not allow_nan and not np.isfinite(X).all()):
     55             type_err = 'infinity' if allow_nan else 'NaN, infinity'
---> 56             raise ValueError(msg_err.format(type_err, X.dtype))
     57     # for object dtype data, we only check for NaNs (GH-13254)
     58     elif X.dtype == np.dtype('object') and not allow_nan:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

could this have something to do with the fact that my pipeline is returning a sparse matrix as opposed to a dense matrix? any help would be greatly appreciated. Thank you

Nov 14 '19 00:11 ssilverac

I am also getting come similar error after passing data to predict API after pipelining ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 16 is different from 14)

Mar 03 '20 11:03 gandharavkatyal1

size 16 is different from 14

I think you might be doing what I just did (o: When preparing data to make predictions, make sure you call transform on the pipeline rather than fit_transform e.g. some_data_prepared = full_pipeline.transform(some_data)

May 18 '20 16:05 pete88b