punctuator2 Adapting from previous model state

print "Building model..."
        net = models.GRUstage2(
            rng=rng,
            x=x,
            minibatch_size=MINIBATCH_SIZE,
            n_hidden=num_hidden,
            x_vocabulary=word_vocabulary,
            y_vocabulary=punctuation_vocabulary,
            stage1_model_file_name=prev_model_file_name
        )

Number of parameters is 13667593
/theano/scan_module/scan.py", line 475, in scan
    actual_slice = seq['input'][k - mintap]
TypeError: 'NoneType' object has no attribute '__getitem_

On adapting to previously trained model file over a new data set , I find above error, I dont supply any pause info so its by default null

May 22 '17 08:05 aliabbasjp

Hi,

thanks for reporting!

It should actually work if you don't make any changes to the original main2.py script (I see you have removed the "p=p" line). Current implementation is rather memory inefficient -- even when there are no <sil=0.000> tags present in the text, the processed dataset will have all-zeros pause arrays in it which are fed to the network during training. Haven't had a good excuse to optimize that yet.

May 22 '17 21:05 ottokart

@ottokart p=None as I dont have pause info.

May 23 '17 07:05 aliabbasjp

@aliabbasjp yes, I understand, but try with unmodified code and follow the instructions in the readme (slighlty modified below):

Data preparation. In <adaptation_data_dir> there should be *.test.txt, *.dev.txt and other *.txt (for training) files just as in <data_dir>. Pause info not needed (no <sil=0.000> tags necessary):

python data.py <data_dir> <adaptation_data_dir>

The first stage can be trained with:

python main.py <model_name> <hidden_layer_size> <learning_rate>

Adaptation stage can be trained with:

python main2.py <adapted_model_name> <hidden_layer_size> <learning_rate> <first_stage_model_path>

Let me know, if you still get some error or if some parts of the readme are unclearly written.

May 23 '17 07:05 ottokart

ok I was manage to get it trained!

but on testing with punctuator.py I get following error, as I dont supply pause info on test set either.

    if use_pauses:
        print "Using pauses"
    
        p = T.matrix('p')

        print "Loading model parameters..."
        net, _ = models.load(model_file, 1, x, p)

        print "Building model..."
        predict = theano.function(
            inputs=[x, p],
            outputs=net.y
        )

Above code is used for loading the model gives following error:

self.inv_finder[c]))
TypeError: Missing required input: p

How can I give a dummy p in test set when I wont be having pause info during evaluation?

May 23 '17 08:05 aliabbasjp

@ottokart Basically I only need target domain adaptation without pause features. So I need to send an empty array of pauses to run punctuator.py, how to do that?

May 23 '17 09:05 aliabbasjp

added dummy pauses into punctuator.py script. Can you pull the new version and try something like:

cat data.dev.txt | python punctuator.py <model_path> <model_output_path> 1

That 1 in the end is important.

May 23 '17 13:05 ottokart

I hv used the pre-train model: Demo-Europarl-EN.pcl for punctuation prediction and the result as follow:

PUNCTUATION PRECISION RECALL F-SCORE

b',COMMA 71.89999999999999 75.5 73.7 '

b'.PERIOD 74.2 32.9 45.6 '

b'?QUESTIONMARK 58.3 11.3 18.9 '

b'!EXCLAMATIONMARK nan 0.0 nan '

b':COLON 55.2 26.700000000000003 36.0 '

b';SEMICOLON 33.300000000000004 3.8 6.9 '

b'-DASH 40.6 9.700000000000001 15.7 '

Overall 72.0 55.300000000000004 62.5

Err: 5.86%

SER: 60.7%

With the following config under ubuntu : 16.04

Theano Version: 1.0.4+10.g9feed7868

Python 3.6.8 :: Anaconda, Inc.

Can you advice, how can I improve the performance to reach the baseline model.

Thanks a lot

Dick

May 02 '19 04:05 dickhung