cnn-text-classification-tf icon indicating copy to clipboard operation
cnn-text-classification-tf copied to clipboard

Support for multiclass, word embeddings, configuration file and new datasets

Open cahya-wirawan opened this issue 7 years ago • 4 comments

Hi,

I added following functionalities:

  • multiclass classification
  • pre-trained word embedding using word2vec and GloVe
  • configuration file in yaml format
  • new dataset 20newsgroup (loaded using sklearn.datasets)
  • loading multiclass text based dataset from local directory

And also path to the movie rating dataset has been moved to the configuration file. Thanks.

cahya-wirawan avatar Mar 08 '17 14:03 cahya-wirawan

Hi @cahya-wirawan Thank you so much for the functionality of multiclass classification you did. I still have issues when loading my own local data, after following I did:

1, saved text files with categories as subfolder names in the folder: /data/bbcdata and there are 5 folders with corresponding txt files in bbcdata folder: "business","entertainment","politics","sport","tech" 2, updated the config.yml file as following

line 16: default: localdata
line 52: container_path: "/data/bbcdata"

Did I missing something to run the ./train.py Could you help me about that? Thank you so much!

Aven

@cahya-wirawan Following is the error I get using local data for multi-class data: Could you help me about this? Thanks a lot!

Loading data...
Traceback (most recent call last):
  File "./train.py", line 72, in <module>
    random_state=cfg["datasets"][dataset_name]["random_state"])
  File "/home/xxliu10/repo/cahya_cnn/cnn-text-classification-tf/data_helpers.py", line 93, in get_datasets_localdata
    random_state=random_state)
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in load_files
    data = [d.decode(encoding, decode_error) for d in data]
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in <listcomp>
    data = [d.decode(encoding, decode_error) for d in data]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 257: invalid start byte

CMWENLIU avatar Jun 14 '17 20:06 CMWENLIU

How much is the expected training time ? and how many steps are needed to get good accuracy results/???

usmaann avatar Nov 07 '18 06:11 usmaann

Hi @cahya-wirawan Thank you so much for the functionality of multiclass classification you did. I still have issues when loading my own local data, after following I did:

1, saved text files with categories as subfolder names in the folder: /data/bbcdata and there are 5 folders with corresponding txt files in bbcdata folder: "business","entertainment","politics","sport","tech" 2, updated the config.yml file as following

line 16: default: localdata
line 52: container_path: "/data/bbcdata"

Did I missing something to run the ./train.py Could you help me about that? Thank you so much!

Aven

@cahya-wirawan Following is the error I get using local data for multi-class data: Could you help me about this? Thanks a lot!

Loading data...
Traceback (most recent call last):
  File "./train.py", line 72, in <module>
    random_state=cfg["datasets"][dataset_name]["random_state"])
  File "/home/xxliu10/repo/cahya_cnn/cnn-text-classification-tf/data_helpers.py", line 93, in get_datasets_localdata
    random_state=random_state)
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in load_files
    data = [d.decode(encoding, decode_error) for d in data]
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in <listcomp>
    data = [d.decode(encoding, decode_error) for d in data]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 257: invalid start byte

Hi are you able to fix this issue? I am facing the same issue

usmaann avatar Nov 13 '18 00:11 usmaann

Hi @cahya-wirawan Thank you so much for the functionality of multiclass classification you did. I still have issues when loading my own local data, after following I did: 1, saved text files with categories as subfolder names in the folder: /data/bbcdata and there are 5 folders with corresponding txt files in bbcdata folder: "business","entertainment","politics","sport","tech" 2, updated the config.yml file as following

line 16: default: localdata
line 52: container_path: "/data/bbcdata"

Did I missing something to run the ./train.py Could you help me about that? Thank you so much! Aven @cahya-wirawan Following is the error I get using local data for multi-class data: Could you help me about this? Thanks a lot!

Loading data...
Traceback (most recent call last):
  File "./train.py", line 72, in <module>
    random_state=cfg["datasets"][dataset_name]["random_state"])
  File "/home/xxliu10/repo/cahya_cnn/cnn-text-classification-tf/data_helpers.py", line 93, in get_datasets_localdata
    random_state=random_state)
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in load_files
    data = [d.decode(encoding, decode_error) for d in data]
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in <listcomp>
    data = [d.decode(encoding, decode_error) for d in data]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 257: invalid start byte

Hi are you able to fix this issue? I am facing the same issue

Anybody can give the solution of this problem?

image

usmaann avatar Nov 19 '18 02:11 usmaann