dask-ml
dask-ml copied to clipboard
HashingVectorizer behaves differently from FeatureHasher
Describe the issue: HashingVectorizer behaves differently from FeatureHasher, HashingVectorizer can work off a Sting like
JUNK_FOOD_DOCS = (
"the pizza pizza beer copyright",
"the pizza burger beer copyright",
"the the pizza beer beer copyright",
"the burger beer beer copyright",
"the coke burger coke copyright",
"the coke burger burger",
)
but FeatureHasher expects an iterable of strings like:
JUNK_FOOD_DOCS = [["the", "pizza", "pizza", "beer", "copyright"],
["the", "coke", "burger", "burger"] ]
Which is the correct behavior:
- expect the hasher to parse strings into vectors
- fix the test by sending list of lists of strings to FeatureHasher instead list of strings, like the other hasher expects
Minimal Complete Verifiable Example:
See dask-ml/tests/feature_extraction/test_text.py: test_basic()
Anything else we need to know?:
This is illustrated in the failing test
Environment:
- Dask version: dask-ml-3.8 conda env dask 2023.1.1
- Python version: 3.8, 3.9, 3.10
- Operating System: ubuntu-latest
- Install method (conda, pip, source): conda