Embeddings.jl icon indicating copy to clipboard operation
Embeddings.jl copied to clipboard

Basic example in readme fails (Word2Vec download 404s)

Open SebastianCallh opened this issue 3 years ago • 4 comments

Running the example in the readme

using Embeddings
const embtable = load_embeddings(Word2Vec) # or load_embeddings(FastText_Text) or ...

fails with

ERROR: HTTP.ExceptionRequest.StatusError(404, "GET", "/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz", HTTP.Messages.Response:
"""
HTTP/1.1 404 Not Found
x-amz-request-id: 7CJ4RS3EZ3VHMSR4
x-amz-id-2: JQ2JTqHhFeLJ7JtP5pJM+AzcR3Kq8kKB4Hy5Tars31NaRlk3Xo++mRiLVYHArclGUSZQm5Ztv/o=
Content-Type: application/xml
Transfer-Encoding: chunked
Date: Thu, 27 Oct 2022 15:01:28 GMT
Server: AmazonS3

""")

Are the word2vec embeddings available elsewhere? Otherwise this should probably be addressed in the readme.

SebastianCallh avatar Oct 27 '22 15:10 SebastianCallh

A good question, I suspect they must be available somewhere else. They are so often used, though they are old now.

oxinabox avatar Oct 28 '22 17:10 oxinabox

I just added the weights to hugging face: https://huggingface.co/LoganKilpatrick/GoogleNews-vectors-negative300/blob/main/GoogleNews-vectors-negative300.bin.gz

logankilpatrick avatar Jan 03 '23 21:01 logankilpatrick

Thanks for opening this issue and the replies so far. I copied the new URL and inserted at this line: https://github.com/JuliaText/Embeddings.jl/blob/306c04bead62b32873dedbc2609c74c4ca34306b/src/word2vec.jl#L17

Unfortunately, when I try load_embeddings(Word2Vec), I get the following error message.

7-Zip (a) [64] 17.04 : Copyright (c) 1999-2021 Igor Pavlov : 2017-08-28
p7zip Version 17.04 (locale=en_GB.UTF-8,Utf16=on,HugeFiles=on,64 bits,16 CPUs Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz (A0652),ASM,AES-NI)

Scanning the drive for archives:
1 file, 36239 bytes (36 KiB)                        

Extracting archive: /home/nikos/.julia/datadeps/word2vec 300d/GoogleNews-vectors-negative300.bin.gz
ERROR: /home/nikos/.julia/datadeps/word2vec 300d/GoogleNews-vectors-negative300.bin.gz
/home/nikos/.julia/datadeps/word2vec 300d/GoogleNews-vectors-negative300.bin.gz
Open ERROR: Can not open the file as [gzip] archive


ERRORS:
Is not archive
    
Can't open as archive: 1
Files: 0
Size:       0
Compressed: 0

I downloaded the file manually from the new URL and this works. Once, downloaded I opened the file with Archive manager in Ubuntu and this worked too.

ngiann avatar Feb 06 '23 11:02 ngiann

hmm that's weird, 7zip is normally very reliable

oxinabox avatar Feb 28 '23 16:02 oxinabox