jack Move Vocab & Embedding setup code into readers

I wrote some code that uses a jack reader (its model, pre and post-processing), but an own training loop---see below. This should be useful for others as a starting point for, say, using multi-task learning objectives etc. Personally, I think it's a good idea to think of this as the core way of interfacing jack. Everything else (a CLI training script etc) should be light wrapper around this.

    jack_train = ...some jack QASetting, Answer pairs

    emb_file = '../bloomsbury_data/external/embeddings/GloVe/glove.6B.50d.txt'
    embeddings = load_embeddings(emb_file, 'glove')

    config = load_config("jtr/conf/jack.yaml")
    config['name'] = "blurb"
    config['repr_dim_input'] = 50

    vocab = Vocab(emb=embeddings, init_from_embeddings=True)
    shared_resources = SharedResources(vocab=vocab, config=config)
    input_module = XQAInputModule(shared_resources)
    model_module = FastQAModule(shared_resources)
    output_module = XQAOutputModule(shared_resources)
    reader = TFReader(shared_resources, input_module, model_module, output_module)

    tf.reset_default_graph()
    reader.setup_from_data(jack_train, is_training=True)

    loss = reader.model_module.tensors[Ports.loss]
    optimizer = tf.train.AdagradOptimizer(learning_rate=0.01)
    min_op = optimizer.minimize(loss)

    sess = model_module.tf_session
    sess.run(tf.global_variables_initializer())

    for epoch in range(0, 10):
        for batch in reader.input_module.batch_generator(jack_train, 1, False):
            feed_dict = reader.model_module.convert_to_feed_dict(batch)
            loss_value, _ = sess.run((loss, min_op), feed_dict=feed_dict)
        if epoch % 5 == 1:
            print(loss_value)

I think we should work on streamlining this. What I dislike about the above is the handling of vocabs and embeddings. This is reader-specific, and should be handled internally. So I'd rather it looked like:


    config = load_config("jtr/conf/jack.yaml")
    config['name'] = "blurb"
    config['repr_dim_input'] = 50
    config['emb_file'] = '../bloomsbury_data/external/embeddings/GloVe/glove.6B.50d.txt'
 
    shared_resources = SharedResources(config=config)
    input_module = XQAInputModule(shared_resources)
    model_module = FastQAModule(shared_resources)
    output_module = XQAOutputModule(shared_resources)
    reader = TFReader(shared_resources, input_module, model_module, output_module)

and then the input module (or model module) takes care of setting up vocabs and embeddings. Because don't have this separation, right now, vocab and embedding code creeps into learning code, even though it's specific to readers. The above would fix this.

To fix this, the reader-owners need to move embedding/vocab setup code into their modules.

Any thoughts or objections?

Nov 12 '17 12:11 undwie

This is also useful for writing more fine-grained tests that check things during training

I quickly turned your code here into a test: https://github.com/uclmr/jack/blob/master/tests/jack/readers/test_fastqa_loop.py

Nov 12 '17 17:11 pminervini

I think vocabs and emebddings could simply be created and loaded in the shared resources and not the modules. In the modules it would introduce unnecessary overhead for most implementations, because embedding handling is quite reader agnostic.

Nov 13 '17 10:11 dirkweissenborn

I'd disagree. I have already seen readers that a) need or not need character embeddings, b) several vocabs for different parts (questions, answers, support), c) only use pre-trained embeddings etc. All of these seem like reasonable decisions that are reader-specific (or input module specific)

This doesn't stop us from factoring out a lot of functionality, such that readers/input modules only use a single call to some create_vocab etc. function in case they really share functionality.

Nov 13 '17 10:11 riedelcastro

I agree with the vocab part, but the embeddings can be provided separately already through the shared resources.

Nov 13 '17 10:11 dirkweissenborn

I'd prefer if the reader was in charge of that too. This makes setting up readers more straightforward for the user. Right now I have to think about what I need to prepare (the embeddings), and what the reader is doing to populate the shared resources. And I need to look into the reader to decide whether it needs embeddings or not (if it doesn't, I shouldn't load embeddings). I would like the API usage to be

Setup configuration dictionary
Load the reader with this configuration

I also feel that even embedding loading can get convoluted (for example, if you decide to preload the embeddings for the training set, or if you want to load other types of pre-trained models etc).

Nov 13 '17 11:11 riedelcastro

I don't have a strong opinion here and understand your point. It makes it also clear to the user where these resources are coming from when they are not auto-magically created somewhere else first. We had exactly this problem before when people did not understand where embeddings and vocabs actually come from and how they are created. Anyway, this is a major change so we should push that rather sooner than later.

Nov 13 '17 11:11 dirkweissenborn

@riedelcastro @pminervini , anyone working on this?

Nov 22 '17 15:11 dirkweissenborn

vocab and embeddings are now decoupled. The setup code should still move into the readers themselves though.

Jul 17 '18 14:07 dirkweissenborn

jack jack copied to clipboard

Move Vocab & Embedding setup code into readers

jack
jack copied to clipboard