cnn-text-classification-tf Support for multiclass, word embeddings, configuration file and new datasets

Hi,

I added following functionalities:

multiclass classification
pre-trained word embedding using word2vec and GloVe
configuration file in yaml format
new dataset 20newsgroup (loaded using sklearn.datasets)
loading multiclass text based dataset from local directory

And also path to the movie rating dataset has been moved to the configuration file. Thanks.

Mar 08 '17 14:03 cahya-wirawan

Hi @cahya-wirawan Thank you so much for the functionality of multiclass classification you did. I still have issues when loading my own local data, after following I did:

1, saved text files with categories as subfolder names in the folder: /data/bbcdata and there are 5 folders with corresponding txt files in bbcdata folder: "business","entertainment","politics","sport","tech" 2, updated the config.yml file as following

line 16: default: localdata
line 52: container_path: "/data/bbcdata"

Did I missing something to run the ./train.py Could you help me about that? Thank you so much!

Aven

@cahya-wirawan Following is the error I get using local data for multi-class data: Could you help me about this? Thanks a lot!

Loading data...
Traceback (most recent call last):
  File "./train.py", line 72, in <module>
    random_state=cfg["datasets"][dataset_name]["random_state"])
  File "/home/xxliu10/repo/cahya_cnn/cnn-text-classification-tf/data_helpers.py", line 93, in get_datasets_localdata
    random_state=random_state)
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in load_files
    data = [d.decode(encoding, decode_error) for d in data]
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in <listcomp>
    data = [d.decode(encoding, decode_error) for d in data]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 257: invalid start byte

Jun 14 '17 20:06 CMWENLIU

How much is the expected training time ? and how many steps are needed to get good accuracy results/???

Nov 07 '18 06:11 usmaann

Hi @cahya-wirawan Thank you so much for the functionality of multiclass classification you did. I still have issues when loading my own local data, after following I did:

1, saved text files with categories as subfolder names in the folder: /data/bbcdata and there are 5 folders with corresponding txt files in bbcdata folder: "business","entertainment","politics","sport","tech" 2, updated the config.yml file as following
line 16: default: localdata
line 52: container_path: "/data/bbcdata"
Did I missing something to run the ./train.py Could you help me about that? Thank you so much!

Aven

@cahya-wirawan Following is the error I get using local data for multi-class data: Could you help me about this? Thanks a lot!
Loading data...
Traceback (most recent call last):
  File "./train.py", line 72, in <module>
    random_state=cfg["datasets"][dataset_name]["random_state"])
  File "/home/xxliu10/repo/cahya_cnn/cnn-text-classification-tf/data_helpers.py", line 93, in get_datasets_localdata
    random_state=random_state)
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in load_files
    data = [d.decode(encoding, decode_error) for d in data]
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in <listcomp>
    data = [d.decode(encoding, decode_error) for d in data]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 257: invalid start byte

Hi are you able to fix this issue? I am facing the same issue

Nov 13 '18 00:11 usmaann

Hi @cahya-wirawan Thank you so much for the functionality of multiclass classification you did. I still have issues when loading my own local data, after following I did: 1, saved text files with categories as subfolder names in the folder: /data/bbcdata and there are 5 folders with corresponding txt files in bbcdata folder: "business","entertainment","politics","sport","tech" 2, updated the config.yml file as following
line 16: default: localdata
line 52: container_path: "/data/bbcdata"
Did I missing something to run the ./train.py Could you help me about that? Thank you so much! Aven @cahya-wirawan Following is the error I get using local data for multi-class data: Could you help me about this? Thanks a lot!
Loading data...
Traceback (most recent call last):
  File "./train.py", line 72, in <module>
    random_state=cfg["datasets"][dataset_name]["random_state"])
  File "/home/xxliu10/repo/cahya_cnn/cnn-text-classification-tf/data_helpers.py", line 93, in get_datasets_localdata
    random_state=random_state)
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in load_files
    data = [d.decode(encoding, decode_error) for d in data]
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in <listcomp>
    data = [d.decode(encoding, decode_error) for d in data]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 257: invalid start byte
Hi are you able to fix this issue? I am facing the same issue

Anybody can give the solution of this problem?

Nov 19 '18 02:11 usmaann

cnn-text-classification-tf cnn-text-classification-tf copied to clipboard

Support for multiclass, word embeddings, configuration file and new datasets

cnn-text-classification-tf
cnn-text-classification-tf copied to clipboard