AdaGram.jl
AdaGram.jl copied to clipboard
Dirichlet process gone bad: stick is broken in wrong place
I have trained a model on text8 corpus with the following config. (Please notice that this example sometimes work and show accurate result with other configs.)
./run.sh train.jl --epochs 5 --alpha 0.05 --prototypes 10 --min-freq 20 --remove-top-k 70 --window 5 text8 text8.dic text8.model
When I check apple word, first the amount senses (meanings):
julia> expected_pi(vm, dict.word2id["apple"])
10-element Array{Float64,1}:
0.197259
0.216447
0.58626
3.24536e-5
1.54719e-6
7.37607e-8
3.51647e-9
1.67644e-10
7.99224e-12
4.00096e-13
We have 3 senses and 7 free slots - nothing unusual. Then I ask to describe each sense:
julia> nearest_neighbors(vm, dict, "apple", 1, 10)
10-element Array{Tuple{AbstractString,Int64,Float32},1}:
("macintosh",2,0.6276491f0)
("intel",2,0.5980226f0)
("ibm",2,0.59220535f0)
("compaq",1,0.5730073f0)
("inc",2,0.572671f0)
("store",2,0.56161773f0)
("raskin",1,0.56127656f0)
("corp",1,0.55665475f0)
("ceo",1,0.54154074f0)
("ceo",2,0.54141444f0)
julia> nearest_neighbors(vm, dict, "apple", 2, 10)
10-element Array{Tuple{AbstractString,Int64,Float32},1}:
("apples",1,0.76360685f0)
("sweet",1,0.70247304f0)
("juice",1,0.6916403f0)
("cakes",1,0.6847711f0)
("fermented",1,0.681853f0)
("olive",1,0.6792287f0)
("fruit",1,0.6718393f0)
("peas",1,0.6700381f0)
("berries",1,0.66832954f0)
("roasted",1,0.66814494f0)
julia> nearest_neighbors(vm, dict, "apple", 3, 10)
10-element Array{Tuple{AbstractString,Int64,Float32},1}:
("macintosh",1,0.9284175f0)
("computers",1,0.8870821f0)
("pc",1,0.88180965f0)
("compatible",1,0.8577318f0)
("amiga",1,0.83944887f0)
("ibm",1,0.8265453f0)
("desktop",1,0.8234609f0)
("portable",1,0.81334895f0)
("pcs",1,0.8022719f0)
("dos",1,0.8022494f0)
As you can see the first and the third senses actually we same, why did AdaGram broken it into 2 different senses?
Those are two quite different senses, aren't they? Apple Inc (the company) vs Apple computers (the product). (Although 'ibm' appears in the nearest neighbour list for both senses, I think those also differ by being related to IBM the company and IBM PCs)
When this "worked" for you, what senses did you get?
Oh, and I see two different senses of 'macintosh' also appear in the nearest neighbour lists. It seems to be mistaken into splitting macintosh into two senses (in addition to Macintosh apples).
I have seen this behavior before as well, and was wondering if my corpus is not large enough or something else is wrong. Actually, sometimes I find that two senses of a word are near enough that they appear in each other's nearest neighbors list.