bulk
bulk copied to clipboard
if a keyword is not there (in any of the datapoints) - exception
Looked for "GPU" in some text body (just playing with 20newsgroups). Apparently not in any of the texts. Index not found (searching for color or so).
I may need to have a bit more context ... what is the specific thing you tried and what went wrong on the bulk-side?
python -m bulk text file.csv --keywords "cannot_find_me_in_any_sample"
On the bulk side I see an exception raised. And the browser is not happy (Server Error).
Just to double check, you're aware that the --keywords
param needs two dashes? Just want to make sure that it's not that.
If it's not that, got a link to your dataset?
I fixed it in the command line above. In my console I used it right, with two dashes.
..
from sklearn.datasets import fetch_20newsgroups
..
categories = ['alt.atheism', 'soc.religion.christian',
'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(
subset='train',
categories=categories,
shuffle=True,
random_state=42
)
And I follow your videos with UMAP:
X = model.encode(twenty_train.data)
umap = UMAP()
X_tfm = umap.fit_transform(X)
df = (
pd.DataFrame(X_tfm, columns=['x', 'y'])
.assign(
category=twenty_train.target
)
)
(
df
.drop(['category'], axis=1)
.assign(
text=twenty_train.data
)
).to_csv("ready.csv", index=False)
Does the error have anything to do with the fact that "GPU" does not appear in your subset?
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian',
'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(
subset='train',
categories=categories,
shuffle=True,
random_state=42
)
print([ex for ex in twenty_train.data if "GPU" in ex])
This prints an empty list.
Yes, this is the issue. Then some access to an illegal index in a colors list or so.