Bugfix unicode compatibility

Open wradstok opened this issue 5 years ago • 1 comments

Hi,

When operating on my data set I was getting the following error:

> Traceback (most recent call last):
>   File "codes/run.py", line 361, in <module>
>     main(parse_args())
>   File "codes/run.py", line 211, in main
>     train_triples = read_triple(os.path.join(args.data_path, 'train.txt'), entity2id, relation2id)
>   File "codes/run.py", line 127, in read_triple
>     triples.append((entity2id[h], relation2id[r], entity2id[t]))
> KeyError: 'Găgăuzia\xa0'

To fix the issue, I added calls to unicodedata.normalize() before loading triples to normalize such weird characters (non-breaking spaces). I also fixed two minor grammatical mistakes in error messages.

Jul 20 '20 18:07 wradstok

I mistakenly thought I was dealing with a unicode issue due to the error I received. Upon investigating closer I realized that when the entity/relations are loaded and the entire line is split, trailing spaces are removed because the names are at the end. However, when loading triples this only occurs on the tail entities. Fixed by mapping str.split() on all components when loading triples.

Jul 23 '20 08:07 wradstok