flyvec input data format

I am confused about the input data format, i.e., encodings.npy and offsets.npy

is each element in encodings.npy a one-hot vector?

Can you provide a detailed demo of them?

May 17 '21 12:05 wizare

Until I can get around to posting an example about the training, let me just clarify things below:

Say you start with a single txt file where each line is a phrase you want to use to train the model. Each phrase is disjoint and unrelated, so you don't want the sliding window to learn to associate words at the end of one phrase with the beginning of another.

You also have a tokenizer that can break each line up into an array of ints (specifically, np.int32s).

encodings.npy is the tokenized version of the entire text file concatenated into a single array. However, this array discards information about disjoint phrases. offsets.npy is a (much smaller) array recording the token-index where each new phrase will start.

May 17 '21 13:05 bhoov

Thanks very much for your kind reply. Assume that my raw training corpus contains the following lines: i am a student <\n> i like to eat apple

So I can obtain a vocabulary like that: i：1 ； am：2 …

According to your description, I processed the raw data into two input files, i.e., encodings.npy as follows: 1 2 3 4 <\n> 1 5 6 7 8

offset.npy as follows: 0 5

Is my understanding correct?

Besides, in your ICRL2021 paper, the CPU version implementation is also reported. Can I run this code in CPU mode?

May 17 '21 15:05 wizare

Your understanding is correct!

Re: the CPU version -- the training code was developed by a person I no longer have contact with, but I did make an effort to wrap it up cleanly in Python. Because the GPU usage depends on environment variables, I expect the code will work on CPU but I honestly haven't tested it myself yet. Please post here if you run into issues and I can debug.

May 17 '21 15:05 bhoov

Thanks to your tokenizer.py, I prepared my input file successfully. However, when I started training, the following problem appeared: OSError: /usr/local/lib/python3.6/site-packages/flyvec/src/model_descriptor.so: cannot open shared object file: No such file or directory

The training code requires *.so file in the BIN directory. Then I checked the installation directory and found only .cu files there.

And I also checked my system environment. The detailed version information is listed as follows: #nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243

# g++ -v Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/root/rpmbuild/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/root/rpmbuild/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux Thread model: posix gcc version 4.8.5 20150623 (Red Hat 4.8.5-5) (GCC)

May 18 '21 06:05 wizare

Whelp, looks like you might need CUDA (specifically, nvcc) to build the code after all. Did you run flyvec_compile?

May 18 '21 12:05 bhoov

I neglected to recognize this command. Then I run flyvec_compile, and it throws several errors like that: cu_special_reduction.cu(249): error: initialization with "{...}" is not allowed for object of type "dim3"

I consulted the relevant documentation. It seems that the dim3 object is usually initialized with dimBlock or dimGrid, rather than the {...} form. I am not sure of the cause of the error, as I am not familiar with Cuda programming. Maybe it's because I didn't configure the environment variables properly?

May 19 '21 02:05 wizare

What is the output of nvcc --version ?

May 19 '21 02:05 bhoov

May 19 '21 02:05 wizare

Are you able to bump nvcc to 11.0? That is the version running on the system I have. If not, I'm going to have to become a bit more intimate with CUDA...

May 19 '21 02:05 bhoov

Okay, I'll have a try ~ And what's your g++ version?

May 19 '21 02:05 wizare

g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

May 19 '21 02:05 bhoov

Hi @bhoov , can you share your preprocessing code. It's hard to understand from the instructions what needs to be done and when.

Jan 18 '22 06:01 theartpiece

Thanks to your tokenizer.py, I prepared my input file successfully. However, when I started training, the following problem appeared: OSError: /usr/local/lib/python3.6/site-packages/flyvec/src/model_descriptor.so: cannot open shared object file: No such file or directory

The training code requires *.so file in the BIN directory. Then I checked the installation directory and found only .cu files there.

And I also checked my system environment. The detailed version information is listed as follows: #nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243

# g++ -v Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/root/rpmbuild/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/root/rpmbuild/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux Thread model: posix gcc version 4.8.5 20150623 (Red Hat 4.8.5-5) (GCC)

I had the same problem. Have you found a solution? How do I get the *.so file?

Aug 26 '22 05:08 SizhaoXu

Have you run bash short_make? (located inside flyvec/src)

Aug 26 '22 14:08 bhoov

Have you run bash short_make? (located inside flyvec/src)

Thank you for your help. I used another method(CMake) instead of short make, but your approach is much simpler.

Aug 29 '22 03:08 SizhaoXu

flyvec flyvec copied to clipboard

input data format

flyvec
flyvec copied to clipboard