dask-ml icon indicating copy to clipboard operation
dask-ml copied to clipboard

HashingVectorizer behaves differently from FeatureHasher

Open brian-methodical opened this issue 2 years ago • 1 comments

Describe the issue: HashingVectorizer behaves differently from FeatureHasher, HashingVectorizer can work off a Sting like

JUNK_FOOD_DOCS = (
    "the pizza pizza beer copyright",
    "the pizza burger beer copyright",
    "the the pizza beer beer copyright",
    "the burger beer beer copyright",
    "the coke burger coke copyright",
    "the coke burger burger",
)

but FeatureHasher expects an iterable of strings like:

JUNK_FOOD_DOCS = [["the", "pizza", "pizza", "beer", "copyright"],
   ["the", "coke", "burger", "burger"] ]

Which is the correct behavior:

  1. expect the hasher to parse strings into vectors
  2. fix the test by sending list of lists of strings to FeatureHasher instead list of strings, like the other hasher expects

Minimal Complete Verifiable Example:

See dask-ml/tests/feature_extraction/test_text.py: test_basic()

Anything else we need to know?:

This is illustrated in the failing test

Environment:

  • Dask version: dask-ml-3.8 conda env dask 2023.1.1
  • Python version: 3.8, 3.9, 3.10
  • Operating System: ubuntu-latest
  • Install method (conda, pip, source): conda

brian-methodical avatar Feb 08 '23 02:02 brian-methodical