ELMoForManyLangs icon indicating copy to clipboard operation
ELMoForManyLangs copied to clipboard

How to get the embedding for each word in the sentance?

Open ghost opened this issue 6 years ago • 3 comments

Hi,

I am struggling to get the embedding for individual words. I used this command:

python -m elmoformanylangs test --input_format conll --input input.conllu --model ar.model --output_prefix ./output/ --output_format hdf5 --output_layer -1

And it dumbs hdf5 encoded onto the disk, as said. However, as far as I understand, the file encoded a dict where the key is tab speerated sentence, and the value is its representation.

But when I print the key:


f = h5py.File(filename, 'r')

for key in list(f.keys()):
    print(key)

I can see that f.keys() contain only a one string key of all sentences in the input file. 1) Why? And how to get individual sentence representation? 2) How to get individual word representation?

This is example of my input with 2 sentences :

1	ik	ik	PRON	VNW|pers|pron|nomin|vol|1|ev	Case=Nom|Person=1|PronType=Prs	2	nsubj	2:nsubj	_
2	zie	zien	VERB	WW|pv|tgw|ev	Number=Sing|Tense=Pres|VerbForm=Fin	0	root	0:root	_
3	hem	hem	PRON	VNW|pers|pron|obl|vol|3|ev|masc	Case=Acc|Person=3|PronType=Prs	2	obj	2:obj|4:nsubj:xsubj	_
4	fietsen	fietsen	VERB	WW|inf|vrij|zonder	VerbForm=Inf	2	xcomp	2:xcomp	_
1	Jan	Jan	PROPN	N|eigen|ev|basis|zijd|stan	Gender=Com|Number=Sing	2	nsubj	2:nsubj	_
2	komt	komen	VERB	WW|pv|tgw|met-t	Number=Sing|Tense=Pres|VerbForm=Fin	0	root	0:root	_
3	vandaag	vandaag	ADV	BW	_	2	advmod	2:advmod	_
4	en	en	CCONJ	VG|neven	_	5	cc	5.1:cc	_
5	Piet	Piet	PROPN	N|eigen|ev|basis|zijd|stan	Gender=Com|Number=Sing	2	conj	5.1:nsubj	_ 

ghost avatar Jan 23 '19 15:01 ghost

Hi, the value of f[key] is a numpy array of (seq_len, dim) (If you use the recently patch and output all the layers, it will be (n_layer, seq_len, dim)). You can get embeddings for each word by numpy.split along the seq_len dimension.

Oneplus avatar Jan 24 '19 09:01 Oneplus

So the sentence embedding is not averaged, I understand now. However, in f[Key] , the key should be the sentence itself, right?

Another problem that I mentioned in my issue is regarding the input format, I suspect that I am doing something wrong because when I print the length of f.keys(), it returns 1 even that my input contains more than one sentence. So this loop is executed only once and treat all my sentences as a single one.

for key in list(f.keys()):
    print(key)

Am I doing something wrong?

The

ghost avatar Jan 24 '19 23:01 ghost

So the sentence embedding is not averaged, I understand now. However, in f[Key] , the key should be the sentence itself, right?

Yes, the key should be the sentence itself.

Am I doing something wrong?

Please check if your input file follows the conll format (https://github.com/HIT-SCIR/ELMoForManyLangs#use-elmoformanylangs-in-command-line) and specify the input format as conll

Oneplus avatar Jan 25 '19 00:01 Oneplus