RETRO-pytorch icon indicating copy to clipboard operation
RETRO-pytorch copied to clipboard

TrainingWrapper does not support line breaks

Open 0x7o opened this issue 3 years ago • 8 comments

Notebook When training RETRO with the standard methods, TrainingWrapper does not add line breaks to the dataset. This can have a bad effect on many NLP tasks.

Input *.txt:

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.

Second Citizen:
Would you proceed especially against Caius Marcius?

All:
Against him first: he's a very dog to the commonalty.

Model output after traing:

some - - on my head, were even so salts to death strike That which may bet with tears I have found to life, which sweeter than now to dony : be known betwixcombed oaths ring yet in Corioli turnseth from him Dear life redeems doth thinkment for faith ; Or shall be slack than death within this face, PETRUCHIO : Now, wind and house or free thee better now. KATHARINA : Now, in mine honourable fellow : in your chat with me to be it, alive, I think, If to use than my wife, if this rebellious earth Have you will break out The strange s of yours cro

0x7o avatar May 15 '22 08:05 0x7o

@0x7o ohh interesting, this must be some issue with the default BERT tokenizer, i'll take a look next week

lucidrains avatar May 15 '22 19:05 lucidrains

Judging by the error below, the script handles the file as a whole, not in batches

Token indices sequence length is longer than the specified maximum sequence length for this model (3449121 > 512). Running this sequence through the model will result in indexing errors

0x7o avatar May 16 '22 08:05 0x7o

@lucidrains, tokenizer distorts the text. I think the problem is the difference between bert-cased and bert-uncased Here is an example of a dataset from the program code:

} \ n else if ( GameManager. _ instance. won & &! GameManager. _ instance. keepPlaying ) { " won " } \ n else { " running " } \ n''') \ n \ n def get _ score ( self ) : \ n return self. execute ('GameManager. _ instance. score') \ n \ n def get _ board ( self ) : \ n # Chrome refuses to serialize the Grid object directly through the debugger. \ n grid = json. loads ( self. execute ('JSON. stringify ( GameManager. _ instance. grid )') ) \ n \ n board = [ [ 0 ] * 4 for _ in range ( 4 ) ] \ n for row in grid ['cells'] : \ n for cell in row : \ n if cell is None : \ n continue \ n pos = cell ['x'], cell ['y'] \ n tval = cell ['value'] \ n board [ pos [ 1 ] ] [ pos [ 0 ] ] = int ( round ( math. log ( tval, 2 ) ) ) \ n \ n return board \ n \ n def execute _ move ( self, move ) : \ n # We use UDLR ordering ; 2048 uses URDL ordering \ n move = [ 0, 2, 3, 1 ] [ move ] \ n self. execute ('GameManager. _ instance. move ( % d )'% move ) \ n \ nclass Keyboard2048Control ( Generic2048Control ) : \ n'''Control 2048 by accessing the DOM and using key events. \ n \ n This is relatively slow, and may be prone to race conditions if your \ n browser is slow. However, it is more generally compatible with various \ n clones of 2048.'''\ n \ n def setup ( self ) : \ n self. execute ( \ n'''\ n var elems = document. getElementsByTagName ('div') ; \ n for ( var i in elems ) \ n if ( elems [ i ]. className = ='tile - container') { \ n tileContainer = elems [ i ] ; \ n break ; \ n } \ n''') \ n \ n def get _ score ( self ) : \ n score = self. execute ('''\ n var scoreContainer = document. querySelector ( ". score - container " ) ; \ n var scoreText ='' ; \ n var scoreChildren = scoreContainer. childNodes ; \ n for ( var i = 0 ; i < scoreChildren. length ; + + i ) { \ n if ( scoreChildren [ i ]. nodeType = = Node. TEXT _ NODE ) { \ n scoreText + = scoreChildren [ i ]. textContent ; \ n } \ n } \ n scoreText ; \ n''') \ n \ n return int ( score ) \ n \ n def get _ board ( self ) : \ n res = self. execute ( \ n'''\ n var res = [ ] ; \ n var tiles = tileContainer. children ; \ n for ( var i = 0 ; i < tiles. length ; i + + ) \ n res. push ( tiles [ i ]. className ) ; \ n res \ n''') \ n board = [ [ 0 ] * 4 for _ in range ( 4 ) ] \ n for tile in res : \ n tval = pos = None \ n for k in tile. split ( ) : \ n m = re. match ( r'^ tile - ( \ d + ) $ ', k ) \ n if m : \ n tval = int ( m. group ( 1 ) ) \ n m = re. match ( r'^ tile - position - ( \ d + ) - ( \ d + ) $ ', k ) \ n if m : \ n pos = int ( m. group ( 1 ) ), int ( m. group ( 2 ) ) \ n board [ pos [ 1 ] - 1 ] [ pos [ 0 ] - 1 ] = int ( round ( math. log ( tval, 2 ) ) ) \ n \ n return board \ n \ n def execute _ move ( self, move ) : \ n key = [ 38, 40, 37, 39 ] [ move ] \ n self. send _ key _ event ('keydown ', key ) \ n time. sleep ( 0. 01 ) \ n self. send _ key _ event ('keyup ', key ) \ n time. sleep ( 0. 05 ) \ n \ nclass Hybrid2048Control ( Fast2048Control, Keyboard2048Control ) : \ n'''Control 2048 by hooking the GameManager and using keyboard inputs. \ n \ n This is safe and fast, and correctly generates keyboard events for compatibility. \ n'''\ n \ n setup = Fast2048Control. setup \ n get _ status = Keyboard2048Control. get _ status \ n get _ score = Fast2048Control. get _ score \ n get _ board = Fast2048Control. get _ board \ n execute _ move = Keyboard2048Control. execute _ move \ n # Preprocess cornell movie dialogs dataset \ n \ nfrom multiprocessing import Pool \ nimport argparse \ nimport pickle \ nimport random \ nimport os \ nfrom urllib. request import urlretrieve \ nfrom zipfile import ZipFile \ nfrom pathlib import Path \ nfrom tqdm import tqdm \ nfrom model. utils import Tokenizer, Vocab, PAD _ TOKEN, SOS _ TOKEN, EOS _ TOKEN \ n \ nproject _ dir = Path ( _ _ file _ _ ). resolve ( ). parent \ ndatasets _ dir = project _ dir. joinpath ('datasets /') \ ncornell _ dir = datasets _ dir. joinpath ('cornell /') \ n \ n # Tokenizer \ ntokenizer = Tokenizer ('spacy') \ n \ ndef prepare _ cornell _ data ( ) : \ n " " " Download and unpack dialogs " " " \ n \ n zip _ url ='http : / / www. mpi - sws. org / ~ cristian / data / cornell _ movie _ dialogs _ corpus. zip'\ n zipfile _ path = datasets _ dir. joinpath ('cornell. zip') \ n \ n if not datasets _ dir. exists ( ) : \ n datasets _ dir. mkdir ( ) \ n \ n # Prepare Dialog data \ n if not cornell _ dir. exists ( ) : \ n print ( f'Downloading { zip _ url } to { zipfile _ path }') \ n urlretrieve ( zip _ url, zipfile _ path ) \ n print ( f'Successfully downloaded { zipfile _ path }') \ n \ n zip _ ref = ZipFile ( zipfile _ path,'r') \ n zip _ ref. extractall ( datasets _ dir ) \ n zip _ ref. close ( ) \ n \ n datasets _ dir. joinpath ('cornell movie - dialogs corpus'). rename ( cornell _ dir ) \ n \ n else : \ n print ('Cornell Data prepared!') \ n \ n \ ndef loadLines ( fileName, \ n fields = [ " lineID ", " characterID ", " movieID ", " character ", " text " ], \ n delimiter = " + + + $ + + + " ) : \ n " " " \ n Args : \ n fileName ( str ) : file to load \ n field ( set < str > ) : fields to extract \ n Return : \ n dict < dict < str > > : the extracted fields for each line \ n " " " \ n lines = { } \ n \ n with open ( fileName,'r ', encoding ='iso - 8859 - 1') as f : \ n for line in f : \ n values = line. split ( delimiter ) \ n \ n # Extract fields \ n lineObj = { } \ n for i, field in enumerate ( fields ) : \ n lineObj [ field ] = values [ i ] \ n \ n lines [ lineObj ['lineID'] ] = lineObj \ n \ n return lines \ n \ n \ ndef loadConvers

0x7o avatar May 18 '22 09:05 0x7o

@0x7o i think the solution may be to add the newline token https://discuss.huggingface.co/t/feat-tokenizers-how-to-make-models-aware-of-structuring-linebreaks/3711 , however, without training BERT from scratch including the newline token, it may suffer

i can't think of a solution besides finding a model out there that doesn't do away with newlines (treat it as whitespace)

lucidrains avatar May 18 '22 19:05 lucidrains

the code will also have to be modularized to accept different models and their encoders, as a lot of the logic is specific to BERT-base

lucidrains avatar May 18 '22 19:05 lucidrains

Notebook When training RETRO with the standard methods, TrainingWrapper does not add line breaks to the dataset. This can have a bad effect on many NLP tasks.

Input *.txt:

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.

Second Citizen:
Would you proceed especially against Caius Marcius?

All:
Against him first: he's a very dog to the commonalty.

Model output after traing:

some - - on my head, were even so salts to death strike That which may bet with tears I have found to life, which sweeter than now to dony : be known betwixcombed oaths ring yet in Corioli turnseth from him Dear life redeems doth thinkment for faith ; Or shall be slack than death within this face, PETRUCHIO : Now, wind and house or free thee better now. KATHARINA : Now, in mine honourable fellow : in your chat with me to be it, alive, I think, If to use than my wife, if this rebellious earth Have you will break out The strange s of yours cro

@0x7o Notebook is not accessible (says link don't exist). Can you please share working link. Its very important for me. Thanks

aakashgoel12 avatar Feb 22 '23 18:02 aakashgoel12

@0x7o would you share your notebook from above? If not, that's cool, or if its long gone, get that.

Thank you, -steve

sdake avatar Jun 30 '23 12:06 sdake

I appreciate your interest immensely, but I no longer have access to this notebook as it has been irretrievably deleted.

0x7o avatar Jun 30 '23 12:06 0x7o