flyvec icon indicating copy to clipboard operation
flyvec copied to clipboard

input data format

Open wizare opened this issue 3 years ago • 15 comments

I am confused about the input data format, i.e., encodings.npy and offsets.npy

is each element in encodings.npy a one-hot vector?

Can you provide a detailed demo of them?

wizare avatar May 17 '21 12:05 wizare

Until I can get around to posting an example about the training, let me just clarify things below:

Say you start with a single txt file where each line is a phrase you want to use to train the model. Each phrase is disjoint and unrelated, so you don't want the sliding window to learn to associate words at the end of one phrase with the beginning of another.

You also have a tokenizer that can break each line up into an array of ints (specifically, np.int32s).

encodings.npy is the tokenized version of the entire text file concatenated into a single array. However, this array discards information about disjoint phrases. offsets.npy is a (much smaller) array recording the token-index where each new phrase will start.

bhoov avatar May 17 '21 13:05 bhoov

Thanks very much for your kind reply. Assume that my raw training corpus contains the following lines: i am a student <\n> i like to eat apple

So I can obtain a vocabulary like that: i:1 ; am:2 …

According to your description, I processed the raw data into two input files, i.e., encodings.npy as follows: 1 2 3 4 <\n> 1 5 6 7 8

offset.npy as follows: 0 5

Is my understanding correct?

Besides, in your ICRL2021 paper, the CPU version implementation is also reported. Can I run this code in CPU mode?

wizare avatar May 17 '21 15:05 wizare

Your understanding is correct!

Re: the CPU version -- the training code was developed by a person I no longer have contact with, but I did make an effort to wrap it up cleanly in Python. Because the GPU usage depends on environment variables, I expect the code will work on CPU but I honestly haven't tested it myself yet. Please post here if you run into issues and I can debug.

bhoov avatar May 17 '21 15:05 bhoov

Thanks to your tokenizer.py, I prepared my input file successfully. However, when I started training, the following problem appeared: OSError: /usr/local/lib/python3.6/site-packages/flyvec/src/model_descriptor.so: cannot open shared object file: No such file or directory

The training code requires *.so file in the BIN directory. Then I checked the installation directory and found only .cu files there.

And I also checked my system environment. The detailed version information is listed as follows: #nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243

# g++ -v Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/root/rpmbuild/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/root/rpmbuild/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux Thread model: posix gcc version 4.8.5 20150623 (Red Hat 4.8.5-5) (GCC)

wizare avatar May 18 '21 06:05 wizare

Whelp, looks like you might need CUDA (specifically, nvcc) to build the code after all. Did you run flyvec_compile?

bhoov avatar May 18 '21 12:05 bhoov

I neglected to recognize this command. Then I run flyvec_compile, and it throws several errors like that: cu_special_reduction.cu(249): error: initialization with "{...}" is not allowed for object of type "dim3"

I consulted the relevant documentation. It seems that the dim3 object is usually initialized with dimBlock or dimGrid, rather than the {...} form. I am not sure of the cause of the error, as I am not familiar with Cuda programming. Maybe it's because I didn't configure the environment variables properly?

wizare avatar May 19 '21 02:05 wizare

What is the output of nvcc --version ?

bhoov avatar May 19 '21 02:05 bhoov

[root@wizare/]# nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2018 NVIDIA Corporation Built on Sat_Aug_25_21:08:01_CDT_2018 Cuda compilation tools, release 10.0, V10.0.130

wizare avatar May 19 '21 02:05 wizare

Are you able to bump nvcc to 11.0? That is the version running on the system I have. If not, I'm going to have to become a bit more intimate with CUDA...

bhoov avatar May 19 '21 02:05 bhoov

Okay, I'll have a try ~ And what's your g++ version?

wizare avatar May 19 '21 02:05 wizare

g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

bhoov avatar May 19 '21 02:05 bhoov

Hi @bhoov , can you share your preprocessing code. It's hard to understand from the instructions what needs to be done and when.

theartpiece avatar Jan 18 '22 06:01 theartpiece

Thanks to your tokenizer.py, I prepared my input file successfully. However, when I started training, the following problem appeared: OSError: /usr/local/lib/python3.6/site-packages/flyvec/src/model_descriptor.so: cannot open shared object file: No such file or directory

The training code requires *.so file in the BIN directory. Then I checked the installation directory and found only .cu files there.

And I also checked my system environment. The detailed version information is listed as follows: #nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243

# g++ -v Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/root/rpmbuild/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/root/rpmbuild/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux Thread model: posix gcc version 4.8.5 20150623 (Red Hat 4.8.5-5) (GCC)

I had the same problem. Have you found a solution? How do I get the *.so file?

SizhaoXu avatar Aug 26 '22 05:08 SizhaoXu

Have you run bash short_make? (located inside flyvec/src)

bhoov avatar Aug 26 '22 14:08 bhoov

Have you run bash short_make? (located inside flyvec/src)

Thank you for your help. I used another method(CMake) instead of short make, but your approach is much simpler.

SizhaoXu avatar Aug 29 '22 03:08 SizhaoXu