bulk icon indicating copy to clipboard operation
bulk copied to clipboard

if a keyword is not there (in any of the datapoints) - exception

Open zbenmo opened this issue 1 year ago • 7 comments

Looked for "GPU" in some text body (just playing with 20newsgroups). Apparently not in any of the texts. Index not found (searching for color or so).

zbenmo avatar Jul 27 '23 13:07 zbenmo

I may need to have a bit more context ... what is the specific thing you tried and what went wrong on the bulk-side?

koaning avatar Jul 27 '23 14:07 koaning

python -m bulk text file.csv --keywords "cannot_find_me_in_any_sample"

On the bulk side I see an exception raised. And the browser is not happy (Server Error).

zbenmo avatar Jul 27 '23 14:07 zbenmo

Just to double check, you're aware that the --keywords param needs two dashes? Just want to make sure that it's not that.

If it's not that, got a link to your dataset?

koaning avatar Jul 27 '23 14:07 koaning

I fixed it in the command line above. In my console I used it right, with two dashes.

..
from sklearn.datasets import fetch_20newsgroups
..

categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=42
)

zbenmo avatar Jul 27 '23 14:07 zbenmo

And I follow your videos with UMAP:

X = model.encode(twenty_train.data)
umap = UMAP()
X_tfm = umap.fit_transform(X)
df = (
    pd.DataFrame(X_tfm, columns=['x', 'y'])
    .assign(
        category=twenty_train.target
    )
)
(
    df
    .drop(['category'], axis=1)
    .assign(
        text=twenty_train.data
    )
).to_csv("ready.csv", index=False)

zbenmo avatar Jul 27 '23 14:07 zbenmo

Does the error have anything to do with the fact that "GPU" does not appear in your subset?

import pandas as pd 
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=42
)

print([ex for ex in twenty_train.data if "GPU" in ex])

This prints an empty list.

koaning avatar Jul 27 '23 15:07 koaning

Yes, this is the issue. Then some access to an illegal index in a colors list or so.

zbenmo avatar Jul 27 '23 16:07 zbenmo