flyvec
flyvec copied to clipboard
input data format
I am confused about the input data format, i.e., encodings.npy and offsets.npy
is each element in encodings.npy a one-hot vector?
Can you provide a detailed demo of them?
Until I can get around to posting an example about the training, let me just clarify things below:
Say you start with a single txt file where each line is a phrase you want to use to train the model. Each phrase is disjoint and unrelated, so you don't want the sliding window to learn to associate words at the end of one phrase with the beginning of another.
You also have a tokenizer that can break each line up into an array of int
s (specifically, np.int32
s).
encodings.npy
is the tokenized version of the entire text file concatenated into a single array. However, this array discards information about disjoint phrases. offsets.npy
is a (much smaller) array recording the token-index where each new phrase will start.
Thanks very much for your kind reply.
Assume that my raw training corpus contains the following lines:
i am a student <\n> i like to eat apple
So I can obtain a vocabulary like that:
i:1 ; am:2 …
According to your description, I processed the raw data into two input files, i.e.,
encodings.npy as follows:
1 2 3 4 <\n> 1 5 6 7 8
offset.npy as follows:
0 5
Is my understanding correct?
Besides, in your ICRL2021 paper, the CPU version implementation is also reported. Can I run this code in CPU mode?
Your understanding is correct!
Re: the CPU version -- the training code was developed by a person I no longer have contact with, but I did make an effort to wrap it up cleanly in Python. Because the GPU usage depends on environment variables, I expect the code will work on CPU but I honestly haven't tested it myself yet. Please post here if you run into issues and I can debug.
Thanks to your tokenizer.py
, I prepared my input file successfully.
However, when I started training, the following problem appeared:
OSError: /usr/local/lib/python3.6/site-packages/flyvec/src/model_descriptor.so: cannot open shared object file: No such file or directory
The training code requires *.so
file in the BIN
directory. Then I checked the installation directory and found only .cu
files there.
And I also checked my system environment. The detailed version information is listed as follows:
#nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243
# g++ -v Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/root/rpmbuild/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/root/rpmbuild/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux Thread model: posix gcc version 4.8.5 20150623 (Red Hat 4.8.5-5) (GCC)
Whelp, looks like you might need CUDA (specifically, nvcc
) to build the code after all. Did you run flyvec_compile
?
I neglected to recognize this command. Then I run flyvec_compile
, and it throws several errors like that:
cu_special_reduction.cu(249): error: initialization with "{...}" is not allowed for object of type "dim3"
I consulted the relevant documentation. It seems that the dim3
object is usually initialized with dimBlock
or dimGrid
, rather than the {...}
form. I am not sure of the cause of the error, as I am not familiar with Cuda programming.
Maybe it's because I didn't configure the environment variables properly?
What is the output of nvcc --version
?
[root@wizare/]# nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2018 NVIDIA Corporation Built on Sat_Aug_25_21:08:01_CDT_2018 Cuda compilation tools, release 10.0, V10.0.130
Are you able to bump nvcc to 11.0? That is the version running on the system I have. If not, I'm going to have to become a bit more intimate with CUDA...
Okay, I'll have a try ~ And what's your g++ version?
g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Hi @bhoov , can you share your preprocessing code. It's hard to understand from the instructions what needs to be done and when.
Thanks to your
tokenizer.py
, I prepared my input file successfully. However, when I started training, the following problem appeared: OSError: /usr/local/lib/python3.6/site-packages/flyvec/src/model_descriptor.so: cannot open shared object file: No such file or directoryThe training code requires
*.so
file in theBIN
directory. Then I checked the installation directory and found only.cu
files there.And I also checked my system environment. The detailed version information is listed as follows:
#nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243
# g++ -v Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/root/rpmbuild/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/root/rpmbuild/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux Thread model: posix gcc version 4.8.5 20150623 (Red Hat 4.8.5-5) (GCC)
I had the same problem. Have you found a solution? How do I get the *.so
file?
Have you run bash short_make
? (located inside flyvec/src
)
Have you run
bash short_make
? (located insideflyvec/src
)
Thank you for your help. I used another method(CMake
) instead of short make
, but your approach is much simpler.