AdaGram.jl icon indicating copy to clipboard operation
AdaGram.jl copied to clipboard

Dirichlet process gone bad: stick is broken in wrong place

Open rafis opened this issue 7 years ago • 3 comments

I have trained a model on text8 corpus with the following config. (Please notice that this example sometimes work and show accurate result with other configs.)

./run.sh train.jl --epochs 5 --alpha 0.05 --prototypes 10 --min-freq 20 --remove-top-k 70 --window 5 text8 text8.dic text8.model

When I check apple word, first the amount senses (meanings):

julia> expected_pi(vm, dict.word2id["apple"])
10-element Array{Float64,1}:
 0.197259
 0.216447
 0.58626
 3.24536e-5
 1.54719e-6
 7.37607e-8
 3.51647e-9
 1.67644e-10
 7.99224e-12
 4.00096e-13

We have 3 senses and 7 free slots - nothing unusual. Then I ask to describe each sense:

julia> nearest_neighbors(vm, dict, "apple", 1, 10)
10-element Array{Tuple{AbstractString,Int64,Float32},1}:
 ("macintosh",2,0.6276491f0)
 ("intel",2,0.5980226f0)
 ("ibm",2,0.59220535f0)
 ("compaq",1,0.5730073f0)
 ("inc",2,0.572671f0)
 ("store",2,0.56161773f0)
 ("raskin",1,0.56127656f0)
 ("corp",1,0.55665475f0)
 ("ceo",1,0.54154074f0)
 ("ceo",2,0.54141444f0)

julia> nearest_neighbors(vm, dict, "apple", 2, 10)
10-element Array{Tuple{AbstractString,Int64,Float32},1}:
 ("apples",1,0.76360685f0)
 ("sweet",1,0.70247304f0)
 ("juice",1,0.6916403f0)
 ("cakes",1,0.6847711f0)
 ("fermented",1,0.681853f0)
 ("olive",1,0.6792287f0)
 ("fruit",1,0.6718393f0)
 ("peas",1,0.6700381f0)
 ("berries",1,0.66832954f0)
 ("roasted",1,0.66814494f0)

julia> nearest_neighbors(vm, dict, "apple", 3, 10)
10-element Array{Tuple{AbstractString,Int64,Float32},1}:
 ("macintosh",1,0.9284175f0)
 ("computers",1,0.8870821f0)
 ("pc",1,0.88180965f0)
 ("compatible",1,0.8577318f0)
 ("amiga",1,0.83944887f0)
 ("ibm",1,0.8265453f0)
 ("desktop",1,0.8234609f0)
 ("portable",1,0.81334895f0)
 ("pcs",1,0.8022719f0)
 ("dos",1,0.8022494f0)

As you can see the first and the third senses actually we same, why did AdaGram broken it into 2 different senses?

rafis avatar Mar 10 '18 17:03 rafis

Those are two quite different senses, aren't they? Apple Inc (the company) vs Apple computers (the product). (Although 'ibm' appears in the nearest neighbour list for both senses, I think those also differ by being related to IBM the company and IBM PCs)

When this "worked" for you, what senses did you get?

rversteegen avatar Mar 11 '18 09:03 rversteegen

Oh, and I see two different senses of 'macintosh' also appear in the nearest neighbour lists. It seems to be mistaken into splitting macintosh into two senses (in addition to Macintosh apples).

rversteegen avatar Mar 11 '18 09:03 rversteegen

I have seen this behavior before as well, and was wondering if my corpus is not large enough or something else is wrong. Actually, sometimes I find that two senses of a word are near enough that they appear in each other's nearest neighbors list.

glicerico avatar May 25 '18 04:05 glicerico