jack icon indicating copy to clipboard operation
jack copied to clipboard

Move Vocab & Embedding setup code into readers

Open undwie opened this issue 7 years ago • 8 comments

I wrote some code that uses a jack reader (its model, pre and post-processing), but an own training loop---see below. This should be useful for others as a starting point for, say, using multi-task learning objectives etc. Personally, I think it's a good idea to think of this as the core way of interfacing jack. Everything else (a CLI training script etc) should be light wrapper around this.

    jack_train = ...some jack QASetting, Answer pairs

    emb_file = '../bloomsbury_data/external/embeddings/GloVe/glove.6B.50d.txt'
    embeddings = load_embeddings(emb_file, 'glove')

    config = load_config("jtr/conf/jack.yaml")
    config['name'] = "blurb"
    config['repr_dim_input'] = 50

    vocab = Vocab(emb=embeddings, init_from_embeddings=True)
    shared_resources = SharedResources(vocab=vocab, config=config)
    input_module = XQAInputModule(shared_resources)
    model_module = FastQAModule(shared_resources)
    output_module = XQAOutputModule(shared_resources)
    reader = TFReader(shared_resources, input_module, model_module, output_module)

    tf.reset_default_graph()
    reader.setup_from_data(jack_train, is_training=True)

    loss = reader.model_module.tensors[Ports.loss]
    optimizer = tf.train.AdagradOptimizer(learning_rate=0.01)
    min_op = optimizer.minimize(loss)

    sess = model_module.tf_session
    sess.run(tf.global_variables_initializer())

    for epoch in range(0, 10):
        for batch in reader.input_module.batch_generator(jack_train, 1, False):
            feed_dict = reader.model_module.convert_to_feed_dict(batch)
            loss_value, _ = sess.run((loss, min_op), feed_dict=feed_dict)
        if epoch % 5 == 1:
            print(loss_value)

I think we should work on streamlining this. What I dislike about the above is the handling of vocabs and embeddings. This is reader-specific, and should be handled internally. So I'd rather it looked like:


    config = load_config("jtr/conf/jack.yaml")
    config['name'] = "blurb"
    config['repr_dim_input'] = 50
    config['emb_file'] = '../bloomsbury_data/external/embeddings/GloVe/glove.6B.50d.txt'
 
    shared_resources = SharedResources(config=config)
    input_module = XQAInputModule(shared_resources)
    model_module = FastQAModule(shared_resources)
    output_module = XQAOutputModule(shared_resources)
    reader = TFReader(shared_resources, input_module, model_module, output_module)

and then the input module (or model module) takes care of setting up vocabs and embeddings. Because don't have this separation, right now, vocab and embedding code creeps into learning code, even though it's specific to readers. The above would fix this.

To fix this, the reader-owners need to move embedding/vocab setup code into their modules.

Any thoughts or objections?

undwie avatar Nov 12 '17 12:11 undwie

This is also useful for writing more fine-grained tests that check things during training

I quickly turned your code here into a test: https://github.com/uclmr/jack/blob/master/tests/jack/readers/test_fastqa_loop.py

pminervini avatar Nov 12 '17 17:11 pminervini

I think vocabs and emebddings could simply be created and loaded in the shared resources and not the modules. In the modules it would introduce unnecessary overhead for most implementations, because embedding handling is quite reader agnostic.

dirkweissenborn avatar Nov 13 '17 10:11 dirkweissenborn

I'd disagree. I have already seen readers that a) need or not need character embeddings, b) several vocabs for different parts (questions, answers, support), c) only use pre-trained embeddings etc. All of these seem like reasonable decisions that are reader-specific (or input module specific)

This doesn't stop us from factoring out a lot of functionality, such that readers/input modules only use a single call to some create_vocab etc. function in case they really share functionality.

riedelcastro avatar Nov 13 '17 10:11 riedelcastro

I agree with the vocab part, but the embeddings can be provided separately already through the shared resources.

dirkweissenborn avatar Nov 13 '17 10:11 dirkweissenborn

I'd prefer if the reader was in charge of that too. This makes setting up readers more straightforward for the user. Right now I have to think about what I need to prepare (the embeddings), and what the reader is doing to populate the shared resources. And I need to look into the reader to decide whether it needs embeddings or not (if it doesn't, I shouldn't load embeddings). I would like the API usage to be

  1. Setup configuration dictionary
  2. Load the reader with this configuration

I also feel that even embedding loading can get convoluted (for example, if you decide to preload the embeddings for the training set, or if you want to load other types of pre-trained models etc).

riedelcastro avatar Nov 13 '17 11:11 riedelcastro

I don't have a strong opinion here and understand your point. It makes it also clear to the user where these resources are coming from when they are not auto-magically created somewhere else first. We had exactly this problem before when people did not understand where embeddings and vocabs actually come from and how they are created. Anyway, this is a major change so we should push that rather sooner than later.

dirkweissenborn avatar Nov 13 '17 11:11 dirkweissenborn

@riedelcastro @pminervini , anyone working on this?

dirkweissenborn avatar Nov 22 '17 15:11 dirkweissenborn

vocab and embeddings are now decoupled. The setup code should still move into the readers themselves though.

dirkweissenborn avatar Jul 17 '18 14:07 dirkweissenborn