LightXML icon indicating copy to clipboard operation
LightXML copied to clipboard

Code to cluster labels on new datasets?

Open kunaldahiya opened this issue 4 years ago • 7 comments

Hi

Thanks for releasing the code for your paper. I was trying out your code on a new dataset. Things are clear to me except the clustering part: can you please release the code to cluster your labels for a new dataset or describe the exact steps?

kunaldahiya avatar May 12 '21 12:05 kunaldahiya

The code of clustering labels is already in src/cluster.py. For example python ./src/cluster.py --dataset amazon670k --id $1 More information, please refer to #3 and #4

kongds avatar May 12 '21 12:05 kongds

Got it. Thanks!

kunaldahiya avatar May 12 '21 12:05 kunaldahiya

Actually, the given clustering code is not running on any new dataset except for Wikipedia-500K and Amazon-670K. How should I run it on an entirely new dataset?

kunaldahiya avatar May 12 '21 12:05 kunaldahiya

You need tfidf first. Please refer to https://github.com/kongds/LightXML/issues/3#issuecomment-763393334. And the cluster.py we used is directly from AttentionXML.

kongds avatar May 12 '21 13:05 kongds

The code of clustering labels is already in src/cluster.py. For example python ./src/cluster.py --dataset amazon670k --id $1 More information, please refer to #3 and #4

I try to run the src/cluster.py on the amazon670k dataset, but it shows MemoryError: Unable to allocate 2.38 TiB for an array with shape (490449, 667317) and data type int64 my Python version is 3.6 and numpy version is 1.18.5 (same as that in requirements.txt), I try to declare an array with this shape, and it gives the same error. May I ask how you deal with this error during label clustering?

Btw, I wonder if I can skip the label clustering step if the size of my label set is bwtween 1k-2k. Seems you are only doing this step on datasets with hundreds of thousands of labels (https://github.com/kongds/LightXML/issues/3#issuecomment-763340054)

Thanks!

royckchan avatar Oct 06 '21 07:10 royckchan

Hello, For OOM, It can be solve by https://github.com/kongds/LightXML/commit/0a04646535053f24608bf3ca88bc631d18f4d91c to replace the mlb = MultiLabelBinarizer() with mlb = MultiLabelBinarizer(sparse_output=True).

For the label size between 1k and 2k, i think the clustering step can be skipped.

kongds avatar Oct 06 '21 10:10 kongds

Hello, For OOM, It can be solve by 0a04646 to replace the mlb = MultiLabelBinarizer() with mlb = MultiLabelBinarizer(sparse_output=True).

For the label size between 1k and 2k, i think the clustering step can be skipped.

Thanks!

royckchan avatar Oct 07 '21 02:10 royckchan