albert icon indicating copy to clipboard operation
albert copied to clipboard

[ALBERT] question : unnecessary(or code error) in tokenization.py?

Open prokok opened this issue 5 years ago • 3 comments

Problem description

From encode_pieces function part in tokenization script (specficially, line 122 ~ 126), I can't identify a single case of executing line 122 referred by '>' code down below. (ran through my sample corpus(1GB) but cannot find any...)

  if not sample:
    pieces = sp_model.EncodeAsPieces(text)
  else:
    pieces = sp_model.SampleEncodeAsPieces(text, 64, 0.1)
  new_pieces = []
  for piece in pieces:
    piece = printable_text(piece)
    if len(piece) > 1 and piece[-1] == "," and piece[-2].isdigit():
      cur_pieces = sp_model.EncodeAsPieces(
          six.ensure_binary(piece[:-1]).replace(SPIECE_UNDERLINE, b""))
>     if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:
        if len(cur_pieces[0]) == 1:
          cur_pieces = cur_pieces[1:]
        else:
         cur_pieces[0] = cur_pieces[0][1:]
      cur_pieces.append(piece[-1])
      new_pieces.extend(cur_pieces)
    else:
      new_pieces.append(piece)

Steps/code/corpus to reproduce

I was able to find the case executing line 119 is executed running sample corpus(1GB) : piece = '▁2011' or '▁11,' and etc. but none of them executes line 122.

Infos I found

####1) 'piece[0] != SPIECE_UNDERLINE' in line 122 below always has to be TRUE (?)

> if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:

Since piece has to be a character, piece[0] is a single character(including the case when piece = '▁') and SPIECE_UNDERLINE is unicded_encoded_byte object referring '▁'

piece[0] != SPIECE_UNDERLINE
# always True

####2) cur_pieces[0][0] refers b'\xe2' in b'\xe2\x96\x81'. Since cur_pieces is the instance of sp_model.EncodeAsPieces, cur_pieces[0] is equal b'\xe2\x96\x81+somecharacters' because cur_pieces[0] refers the first token. (sentencepiece model always append '▁' to the first token) Then cur_pieces[0][0] is referring b'\xe2' which is equal to 226.

cur_pieces[0][0]
#226
ord(b'\xe2')
#226

So cur_pieces[0][0] == SPIECE_UNDERLINE is always false in my cases.

Question

Is there any explanation or reason implementing line 122 ~ 126? Can I know sample sentences or words executing that code block?

Versions

Darwin-18.5.0-x86_64-i386-64bit
Python 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
tensorflow 1.13.1

prokok avatar Nov 13 '19 09:11 prokok

To be frank, the whole section starting from the line if len(piece) > 1 and piece[-1] == "," and piece[-2].isdigit(): looks confusing to me. Why do we need special treament for a piece which ends with a digit and a comma. I really think some comment is needed here.

xwk avatar Jan 28 '20 00:01 xwk

We also are running into this peculiarity around the comma and are wondering if this is unintentional or intentional. We noticed that SentencePiece has a function of EncodeAsIds() which encodes a string directly to token IDs. We were hoping to employ this single function for our string to token IDs ransformation in our training and inference pipelines. However, the token IDs from this method are different then from Albert. The result is that SentencePiece cannot be used directly without other further pre-processing. Was this intentional? If so, why? And, are there other deviations that we should be aware of?

Example of 20,: Albert:

encode_ids(spm_tokenizer, "2,")
[172, 15]

SentencePiece:

spm_tokenizer.EncodeAsIds("2,")
[1604]
spm_tokenizer.EncodeAsIds("2 ,")
[172, 13, 15]

Note: The 13 token is the SPIECE_UNDERLINE.

You will notice that Sentence Piece has the vocab for the piece of 20, and directly encodes that to ID of 1604. However, Albert first splits and then removes the spiece token.

I am most worried about if I am missing a broader idea from this byte encoding. Any thoughts would be much appreciated!

jeisinge avatar Mar 07 '20 18:03 jeisinge

FYI - the reason for the tokenization is listed at https://github.com/google-research/bert/blob/master/README.md#tokenization . Please disregard my previous comment.

jeisinge avatar Mar 10 '20 17:03 jeisinge