Word2Vec.jl icon indicating copy to clipboard operation
Word2Vec.jl copied to clipboard

Reading a binary file throws an error as reading from unicode is not handled.

Open jayend-manika opened this issue 8 years ago • 5 comments

When requested to read from a binary, which has unicode, it results in ERROR: UnicodeError: invalid character index.

To reproduce, load the test file from Google https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing

jayend-manika avatar Feb 25 '17 18:02 jayend-manika

encoding attribute is there in the python version. That may not be exposed. Need to check.

sambitdash avatar Aug 22 '17 01:08 sambitdash

I think there is a different reason for this. The original google-files seem to have a slightly different format and the parser for the binary file reads one byte too far.

Removing the read(f, UInt8) # new line here solves the issue (but presumably, the files created with this package can't be loaded in this case anymore)

I solved it by including the additional loading option :google to the existing :text and :binary where this read is removed.

Paethon avatar Jun 07 '18 12:06 Paethon

PR #8 fixes this

Paethon avatar Sep 18 '18 09:09 Paethon

I think there is a different reason for this. The original google-files seem to have a slightly different format and the parser for the binary file reads one byte too far.

Removing the read(f, UInt8) # new line here solves the issue (but presumably, the files created with this package can't be loaded in this case anymore)

I solved it by including the additional loading option :google to the existing :text and :binary where this read is removed.

would you please write the code of what you are saying I got confused honestly ,

alabrashJr avatar Mar 18 '19 15:03 alabrashJr

so I did the implementation by my self, and I sharing it with you,

https://gist.github.com/alabrashJr/d71cf74bc9713bb0a5bb12ccd331a405

alabrashJr avatar Mar 21 '19 13:03 alabrashJr