dgl-ke icon indicating copy to clipboard operation
dgl-ke copied to clipboard

Invalid separator for raw triplet data

Open ugocottin opened this issue 4 years ago • 4 comments

Using version 0.1.2 Following the documentation here,

The recommended delimiter includes \t, |, , and ;

When I give the following mapping to DGLKE using --raw_data flag:

0,"Beijing"
1,"China"
2,"Paris"
3,"France"
…

I get an error invalid literal for int() with base 10: '0,"Beijing"'

At line 71 of file utils.py, the function load_taw_triplet_data(…):

if emap_f is not None:
        eid_map = {}
        id2e_map = {}
        with open(emap_f, 'r') as f:
            reader = csv.reader(f, delimiter='\t')
            for row in reader:
                eid_map[row[1]] = int(row[0])
                id2e_map[int(row[0])] = row[1]

    if rmap_f is not None:
        rid_map = {}
        id2r_map = {}
        with open(rmap_f, 'r') as f:
            reader = csv.reader(f, delimiter='\t')
            for row in reader:
                rid_map[row[1]] = int(row[0])
                id2r_map[int(row[0])] = row[1]

Why the documentation say that we can use different separators, and only the \t character is allowed here ?

ugocottin avatar Apr 23 '21 12:04 ugocottin

emap_f and rmap_f is the entity id mapping file and relation id mapping file. Are you using raw_udd_{htr} ?

classicsong avatar Apr 24 '21 15:04 classicsong

Yes I used raw_udd_{hrt}.

I just misread the documentation, the recommended delimiters are for the knowledge graph files.

I just don't understand why the given example mapping for entities and relations are separated with , and the only allowed separator is \t

ugocottin avatar May 04 '21 11:05 ugocottin

Yes I used raw_udd_{hrt}.

I just misread the documentation, the recommended delimiters are for the knowledge graph files.

I just don't understand why the given example mapping for entities and relations are separated with , and the only allowed separator is \t

This should a bug in the raw_udd_ parser. Need to fix it.

classicsong avatar May 05 '21 07:05 classicsong

@classicsong I'm seeing this issue and I used the custom separator: |

example mapping file

$ head  entities.tsv 
0|XXXX
1|Ton
2|Convertible
..

Is this a matter of adding the separator argument to the signature and passing it along?

Places where this proposal would change the code:

add in entrypoint

  • https://github.com/awslabs/dgl-ke/blob/master/python/dglke/infer_emb_sim.py#L27 (add to arg parser) and pass it as an arg to necessary load_* methods in the infer_emb_sim.py module main function

  • https://github.com/awslabs/dgl-ke/blob/master/python/dglke/infer_score.py#L27

  • and pass along (where necessary to load_* methods in main.

add in signatures

  • https://github.com/awslabs/dgl-ke/blob/master/python/dglke/utils.py#L60
  • https://github.com/awslabs/dgl-ke/blob/master/python/dglke/utils.py#L150
  • https://github.com/awslabs/dgl-ke/blob/master/python/dglke/utils.py#L161

pass separator values

  • https://github.com/awslabs/dgl-ke/blob/master/python/dglke/utils.py#L65
  • https://github.com/awslabs/dgl-ke/blob/master/python/dglke/utils.py#L74
  • https://github.com/awslabs/dgl-ke/blob/master/python/dglke/utils.py#L154
  • https://github.com/awslabs/dgl-ke/blob/master/python/dglke/utils.py#L166

@classicsong if you can give an indication of whether this general approach looks correct, I can look into making a change for the separator for these entrypoints

lroberts7 avatar Apr 13 '23 14:04 lroberts7