dask-ml icon indicating copy to clipboard operation
dask-ml copied to clipboard

Specify data type in CountVectorzier

Open zkid18 opened this issue 2 years ago • 1 comments

What happened:

I have the data in the following format:

0    Satellite TV|Golf Course|Airport Shuttle|Cosme...
1    Satellite TV|Cosmetic Mirror|Safe (Hotel)|Tele...
2    Satellite TV|Cosmetic Mirror|Safe (Hotel)|Tele...
3    Satellite TV|Sailing|Cosmetic Mirror|Telephone...
4    Satellite TV|Sailing|Diving|Cosmetic Mirror|Sa...
vect = CountVectorizer(tokenizer=lambda x: x.split("|"))
tf_df = vect.fit_transform(item_data['properties'])

What you expected to happen:

I want to use CountVectorizer to get the dataframe with the corresponding columns

Seems I need to specify the output type somehow. But I don't see CountVectorizer having an interface to specify metadata

ValueError: Metadata inference failed in `_count_vectorizer_transform`.

You have supplied a custom function and Dask is unable to 
determine the type of output that that function returns. 

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Minimal Complete Verifiable Example:

# Put your MCVE code here

Anything else we need to know?:

Environment:

  • Dask version:
dask                      2021.1.1           pyhd3eb1b0_0  
dask-core                 2021.1.1           pyhd3eb1b0_0  
dask-glm                  0.2.0                    py38_0  
dask-ml                   1.8.0              pyhd3eb1b0_0  
  • Python version: Python 3.8.13
  • Operating System: Ubuntu 20.04
  • Install method (conda, pip, source): conda
Cluster Dump State:

zkid18 avatar Apr 23 '22 16:04 zkid18

sir/mam, may i work with that issue ,can you assign to me

prafull904434 avatar Oct 03 '23 08:10 prafull904434