wordvectors icon indicating copy to clipboard operation
wordvectors copied to clipboard

fasttext file format seems wrong

Open adodge opened this issue 6 years ago • 2 comments

Thank you very much for this project. It seems very useful.

I don't seem to be able to use the fasttext files, at least not the Russian or Turkish ones. When attempting to load them with fasttext, I get this error:

$ fasttext print-word-vectors ru.bin
terminate called after throwing an instance of 'std::invalid_argument'
  what():  ru.bin has wrong file format!
Aborted

On closer inspection, the files are missing the fasttext magic number in their header. Fasttext binary files are expected to start with 0x2F4F16BA, and this one doesn't.

Were they created by some other software, or perhaps an older version of fasttext that had a different file format?

Thank you.

adodge avatar Mar 01 '18 18:03 adodge

I did a little poking around in the fasttext history, and, yes, they had a different file format a year ago.

  • There's no magic number or version at the top of the file.
  • There's no "pruneidx_size" value in the header for the dictionary object.
  • There's no "quant" boolean before each of the two matrix objects.

This is a script that will convert one of the old fasttext files to something the current version can read:

fasttext_file_update.py.txt

$ echo merhaba | fasttext print-word-vectors tr.bin2
merhaba 0.12206 0.066014 0.093112 -0.043492 0.5207 0.057019 0.20127 0.20933 0.057977 -0.29209 0.087561 0.05825 0.50264 -0.17409 0.19332 -0.08724 0.35125 0.045985 0.21882 0.1872 0.16603 0.21172 0.17046 0.062976 -0.022134 -0.50327 -0.064927 0.1336 0.10681 -0.1902 0.030359 -0.075208 -0.19389 0.40742 0.078176 0.11845 -0.057126 0.52497 0.11417 0.36205 -0.055332 -0.2492 0.46497 0.72146 0.42214 0.082853 0.035755 -0.1644 -0.23566 0.1037 -0.079192 0.15678 -0.14464 -0.023746 0.11418 0.21951 -0.20679 -0.11682 -0.020332 -0.07834 0.27913 -0.59613 -0.15867 0.15623 0.066335 0.078509 -0.0045359 -0.15227 -0.025417 -0.14899 -0.25298 0.2158 -0.26728 0.071114 -0.86768 -0.39044 -0.36575 0.053666 0.38771 0.3328 0.085293 -0.12563 0.13022 -0.21437 0.31115 0.013396 0.02462 -0.25962 -0.51704 -0.55816 0.43276 0.25894 -0.55603 0.3785 -0.13968 0.0031102 0.23232 0.11755 0.17286 -0.14933 0.19528 0.36565 -0.19717 0.066704 -0.20812 -0.32329 -0.09979 -0.34596 0.12763 -0.26259 -0.13747 -0.056275 0.47636 -0.068787 0.05284 -0.16213 -0.57922 -0.15148 0.31464 0.23883 -0.43305 0.21852 -0.082744 0.26875 -0.28505 -0.379 -0.24597 -0.11538 0.22466 -0.17107 0.047522 0.31911 0.15056 0.21347 0.16531 -0.078537 0.14234 0.090975 -0.4294 0.067041 0.085503 0.41908 0.18248 0.18221 0.10699 -0.21135 0.1343 -0.05573 -0.16256 -0.39946 0.086395 -0.030858 -0.66857 0.58846 0.17388 0.56812 -0.088791 -0.024312 -0.054497 -0.075219 -0.0048822 -0.17311 0.070715 0.080788 0.14496 0.45174 0.071725 -0.14704 0.56277 0.058342 0.67329 0.22379 -0.13657 -0.11677 0.31955 0.21028 -0.24803 -0.34743 0.0019436 0.26037 0.49244 0.2648 -0.07083 -0.26863 -0.24654 -0.025958 -0.27783 -0.045067 -0.068344 0.16087 0.11595 -0.044365 0.029121 0.12629 0.28304 0.23161 -0.17879 -0.092399 -0.38922 -0.24235

adodge avatar Mar 01 '18 22:03 adodge

somehow it does not work also


Traceback (most recent call last):
  File "fast_convert.py", line 57, in <module>
    m,n = struct.unpack("@qq", M[offset:offset+span])
struct.error: unpack requires a string argument of length 16

yaziciemre avatar Jan 31 '19 12:01 yaziciemre