tapas icon indicating copy to clipboard operation
tapas copied to clipboard

Question about tokenizing table values

Open NielsRogge opened this issue 3 years ago • 1 comments

Hi,

I'm currently testing my implementation of TapasTokenizer (in the Transformers library, each model has a corresponding tokenizer that can be used to prepare data for the model). When testing on a batch of SQA data (from the dev set), I spotted a misalignment, namely there's a column with values such as 1.0, 2.0, ... In the original implementation, 1.0 is tokenized into ["1", "."]. In my implementation, this is tokenized into ["1", ".", "0"]. Any reason why this is the case? Here's the table:

sqa_dev

And here are the tokens + ids of the original implementation (left) and those of my implementation (right):

[CLS] 101 [CLS] 101
what 2054 what 2054
tracks 3162 tracks 3162
appear 3711 appear 3711
on 2006 on 2006
the 1996 the 1996
album 2201 album 2201
life 2166 life 2166
goes 3632 goes 3632
on 2006 on 2006
( 1006 ( 1006
tr 19817 tr 19817
##ae 6679 ##ae 6679
album 2201 album 2201
) 1007 ) 1007
? 1029 ? 1029
[SEP] 102 [SEP] 102
[EMPTY] 1 [EMPTY] 1
title 2516 title 2516
producers 6443 producers 6443
guest 4113 guest 4113
performers 9567 performers 9567
length 3091 length 3091
1 1015 1 1015
. 1012 . 1012
throw 5466 0 1014 ===> misalignment starts here
away 2185 throw 5466
##s 2015 away 2185
maj 16686 ##s 2015
& 1004 maj 16686
so 2061 & 1004
##sa 3736 so 2061
gorilla 23526 ##sa 3736
zoe 11199 gorilla 23526
& 1004 zoe 11199
yun 22854 & 1004
##g 2290 yun 22854
jo 8183 ##g 2290
##c 2278 jo 8183
3 1017 ##c 2278
: 1024 3 1017
11 2340 : 1024
2 1016 11 2340
. 1012 2 1016
i 1045 . 1012
' 1005 0 1014
m 1049 i 1045
a 1037 ' 1005
gangs 18542 m 1049
##ta 2696 a 1037
drew 3881 gangs 18542
[EMPTY] 1 ##ta 2696
4 1018 drew 3881
: 1024 [EMPTY] 1
16 2385 4 1018
3 1017 : 1024
. 1012 16 2385
life 2166 3 1017
(and so on)

FYI: I'm reading in the table as a Pandas dataframe before tokenizing.

NielsRogge avatar Nov 20 '20 08:11 NielsRogge

There are certainly many ways to do tokenization and we wanted to be consistent with BERT. Therefore we are relying on the bert tokenization code:

from official.nlp.bert import tokenization

self._basic_tokenizer = tokenization.BasicTokenizer(do_lower_case=True)
self._wp_tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=True)

  def tokenize(self, text):
    if text_utils.format_text(text) == constants.EMPTY_TEXT:
      return [Token(_EMPTY, _EMPTY)]
    tokens = []
    for token in self._basic_tokenizer.tokenize(text):
      for piece in self._wp_tokenizer.tokenize(token):
        tokens.append(Token(token, piece))
    return tokens

I understand the Bert tokenizer purposefully doesn't split on punctuation within a word. So, I think, it would split 1.0 into ["1.0"] but 1. into ["1", "."]. I think this is WAI from our side.

So, I assume your tokenizer is implemented to split on every punctuation mark and that's what the discrepancy is coming from?

ghost avatar Nov 30 '20 11:11 ghost