xlnet icon indicating copy to clipboard operation
xlnet copied to clipboard

Extract Contextual Word Embeddings

Open Hazoom opened this issue 5 years ago • 17 comments

Add the ability to extract contextual word embeddings from a given list of sentences using XLNet, same like in BERT. The script extracts a fixed length vector for each token in the sentence.

First, one needs to create an input text file as following:

# Sentence A and Sentence B are separated by the ||| delimiter for sentence pair tasks.
# For single sentence inputs, put one sentence per line and DON'T use the delimiter.
echo 'I love New York. ||| New York is a city' > data/corpus.txt

After that, the script extract_features.py can be used like this, which will create vectors of length 64 for each token in the sentence:

INIT_CKPT_DIR=models/xlnet_cased_L-24_H-1024_A-16
OUTPUT_DIR=data
MODEL_DIR=experiment/extract_features

python extract_features.py \
    --input_file=data/corpus.txt \
    --init_checkpoint=${INIT_CKPT_DIR}/xlnet_model.ckpt \
    --spiece_model_file=${INIT_CKPT_DIR}/spiece.model \
    --use_tpu=False \
    --num_core_per_host=1 \
    --output_file=${OUTPUT_DIR}/output.json \
    --model_dir=${MODEL_DIR} \
    --num_hosts=1 \
    --max_seq_length=64 \
    --eval_batch_size=8 \
    --predict_batch_size=8 \
    --model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \
    --summary_type=mean

Or use the scripts/gpu_extract_features.sh script for running it easily.

This will create a JSON file (one line per line of input) containing the contextual word embeddings from XLNet.

#39

Hi @zihangdai @kimiyoung, can you please take a look?

Hazoom avatar Jul 11 '19 07:07 Hazoom

Add the ability to extract contextual word embeddings from a given list of sentences using XLNet, same like in BERT. The script extracts a fixed length vector for each token in the sentence and one pooled vector from all the word embeddings, with the given pooling strategy parameter.

First, one needs to create an input text file as following:

# Sentence A and Sentence B are separated by the ||| delimiter for sentence pair tasks.
# For single sentence inputs, put one sentence per line and DON'T use the delimiter.
echo 'I love New York. ||| New York is a city' > data/corpus.txt

After that, the script extract_features.py can be used like this, which will create vectors of length 64 and one pooled vector with mean strategy:

INIT_CKPT_DIR=models/xlnet_cased_L-24_H-1024_A-16
OUTPUT_DIR=data
MODEL_DIR=experiment/extract_features

python extract_features.py \
    --input_file=data/corpus.txt \
    --init_checkpoint=${INIT_CKPT_DIR}/xlnet_model.ckpt \
    --spiece_model_file=${INIT_CKPT_DIR}/spiece.model \
    --use_tpu=False \
    --num_core_per_host=1 \
    --output_file=${OUTPUT_DIR}/output.json \
    --model_dir=${MODEL_DIR} \
    --num_hosts=1 \
    --max_seq_length=64 \
    --eval_batch_size=8 \
    --predict_batch_size=8 \
    --model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \
    --summary_type=mean

Or use the scripts/gpu_extract_features.sh script for running it easily.

This will create a JSON file (one line per line of input) containing the contextual word embeddings from XLNet, including one pooled vector.

#39

Hello, i write corpus.txt 1 sentence. When i run python extract_features.py \ --input_file=data/corpus.txt \ --init_checkpoint=${INIT_CKPT_DIR}/xlnet_model.ckpt \ --spiece_model_file=${INIT_CKPT_DIR}/spiece.model \ --use_tpu=False \ --num_core_per_host=1 \ --output_file=${OUTPUT_DIR}/output.json \ --model_dir=${MODEL_DIR} \ --num_hosts=1 \ --max_seq_length=64 \ --eval_batch_size=8 \ --predict_batch_size=8 \ --model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \ --summary_type=mean in different times, sentence feature is different. they should be same i think.

3NFBAGDU avatar Jul 11 '19 10:07 3NFBAGDU

@3NFBAGDU Thanks for pointing this out. From what I saw, it happens only in the pooled vector, and I only use the original pooling code of XLNet, I think it's because of the dropout, which is random of course. I will fix the script to not perform dropout, which is the expected behavior if prediction mode and in addition will remove the output of the pooled vector. I think it's better for the client to perform pooling on client side.

Hazoom avatar Jul 11 '19 11:07 Hazoom

Hi, thank you for answering. as i have tested it, Euclidean distance is better than cosine distance for words. cosine distance allways > 0.89. I have trained my model in 1.6M sentences.

And do you have any idea how to get sentence_embedding vector from here?

3NFBAGDU avatar Jul 11 '19 13:07 3NFBAGDU

Hi, thank you for answering. as i have tested it, Euclidean distance is better than cosine distance for words. cosine distance allways > 0.89. I have trained my model in 1.6M sentences.

And do you have any idea how to get sentence_embedding vector from here?

Thanks for sharing the results. In order to get sentence embedding, you can perform pooling by one of the existing pooling strategies, like: max pooling, mean pooling, max-mean pooling, attention pooling, etc. For example, to perform max pooling, you just need to sum all word vectors into one vector for dimension 1024.

Please note that some word are the padding tokens (actually most of them), so you should ignore them in the pooling strategy and perform the pooling with only real tokens.

Hazoom avatar Jul 11 '19 13:07 Hazoom

Hi, if i give 'Hello, how are you', there should be this output: {'token' : {"Hello", values} : ...., {token = "how", values:[...]}}., but i have output {token: '', values:[]} , i have token allways empty. Is this my sentencepiece model fault ?

3NFBAGDU avatar Jul 12 '19 09:07 3NFBAGDU

Hi, if i give 'Hello, how are you', there should be this output: {'token' : {"Hello", values} : ...., {token = "how", values:[...]}}., but i have output {token: '', values:[]} , i have token allways empty. Is this my sentencepiece model fault ?

Apparently, when giving the sentence Hello, how are you? sentence piece model tokenize it such that the first token is empty. I added code that ignore those empty tokens. Thanks for noticing.

Hazoom avatar Jul 12 '19 21:07 Hazoom

estimator.predict() works to slowly. I want to predict some text in my model in every 2 seconds. Everytime I call the estimator.predict() function, it loads the model all over again. I want to load the model just once and after that use estimator.predict() every 2 seconds on this same model to get the faster prediction. Can you help me ?

3NFBAGDU avatar Jul 25 '19 12:07 3NFBAGDU

Hi @zihangdai @kimiyoung , since issue #39 was closed, can you please merge this to master? Thanks.

Hazoom avatar Sep 15 '19 14:09 Hazoom

@Hazoom Hi sir, How to modify vector dimensions?

JxuHenry avatar Oct 25 '19 06:10 JxuHenry

@Hazoom Hi sir, How to modify vector dimensions?

@JxuHenry I don't think it's possible, the dimension is set by the network's architecture.

Hazoom avatar Oct 25 '19 07:10 Hazoom

@Hazoom Hi sir, How to modify vector dimensions?

@JxuHenry I don't think it's possible, the dimension is set by the network's architecture.

OK,Thank you very much

JxuHenry avatar Oct 26 '19 07:10 JxuHenry

Hi Hazoom, I follow your instructions and run extract_features.py. Does it need a GPU to run this program?

frank-lin-liu avatar Feb 16 '20 15:02 frank-lin-liu

Hi Hazoom, I follow your instructions and run extract_features.py. Does it need a GPU to run this program?

No, it can be run on CPU as well, just a little bit slower than GPU.

Hazoom avatar Feb 16 '20 15:02 Hazoom

Thank you, Hazoom. I use tensorflow v1.15. Is it the tensorflow version you used?

frank-lin-liu avatar Feb 16 '20 15:02 frank-lin-liu

Thank you, Hazoom. I use tensorflow v1.15. Is it the tensorflow version you used?

I used Tensorflow v1.14, but it should be the same, I hope.

Hazoom avatar Feb 16 '20 16:02 Hazoom

It seems that I don't get the expected results. I copied some messages below. Could you please take a look and let me know what the problem is?


2020-02-16 15:03:13.502591: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2020-02-16 15:03:13.523791: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2712000000 Hz 2020-02-16 15:03:13.524337: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c394737b60 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-02-16 15:03:13.524391: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-02-16 15:03:13.527100: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory 2020-02-16 15:03:13.527146: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303) 2020-02-16 15:03:13.527179: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (pl-00193583): /proc/driver/nvidia/version does not exist 2020-02-16 15:03:16.775053: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 131072000 exceeds 10% of system memory. INFO:tensorflow:Running local_init_op. I0216 15:03:18.755628 140438614927168 session_manager.py:500] Running local_init_op. INFO:tensorflow:Done running local_init_op. I0216 15:03:18.981006 140438614927168 session_manager.py:502] Done running local_init_op. INFO:tensorflow:Predicting submission for example_cnt: 0 I0216 15:03:26.952376 140438614927168 extract_features.py:427] Predicting submission for example_cnt: 0

frank-lin-liu avatar Feb 16 '20 16:02 frank-lin-liu

Hi Hazoom, I succeeded in using the scripts to get an output.json for one sentence "Hello World". I observed that the embeddings has 6 tokens, "he", "ll", "o", "world", and the last two tokens are , is this tokenization normal? If we use pooling strategy to calculate the sentence embedding. do we need to remove the 's embeddings?

mqhe avatar Jun 19 '20 02:06 mqhe