yoyodyne icon indicating copy to clipboard operation
yoyodyne copied to clipboard

WIP: single embeddings matrix design

Open Adamits opened this issue 2 years ago • 46 comments

Throwing this WIP up to store all vocabs in a single embeddings matrix shared between source, target, and features. This will fix the current pointer-generator issues when we have disjoint vocabularies. I will get back to implementing it later this week.

  • Computes a single vocab_map
  • Retains target_vocabulary and source_vocabulary for logging.
  • Moves embedding initialization onto the model
  • Passes a single embedding layer to each module to share.

TODO

  • I did not do anything with features. I still need to merge these into the single vocab, being careful not to break anything that requires separate features.
  • This does not handle the case that someone may want to have unmerged embeddings, such that the same unicode codepoint on the source and target side should be considered different. We will need to update the index to mark those characters with special symbols to implement this.

Adamits avatar Oct 02 '23 19:10 Adamits

I created issue #142 to document the plan here.

kylebgorman avatar Oct 04 '23 17:10 kylebgorman

I just rebased a ton of commits and want to make sure I didn't break anything.

EDIT: Hmmm I did not realize that all rebased commits would appear as changes in this PR> I have not resolved complicated git merges in a long time and thought rebasing was the right move, but I don't like how this looks...

Should I scrap this and merge master into my branch instead?

Edit Edit: I have done so, I think merging is better here.

Adamits avatar Apr 09 '24 18:04 Adamits

I'm not a merge vs. rebase N*zi but the way the PR looks when I review is plenty clear as is.

kylebgorman avatar Apr 09 '24 19:04 kylebgorman

Yeah sounds good.

vocab_map would be better just called index and we should throw out the SymbolMap class.

Does this mean we would call it like index.index (consider that in dataset.py wew currently have e.g. self.index.vocab_map)?

Adamits avatar Apr 09 '24 19:04 Adamits

Does this mean we would call it like index.index (consider that in dataset.py wew currently have e.g. self.index.vocab_map)?

You could make it so self.index could so be called, i.e., by perlocating that ability upward...if I'm understanding correctly.

kylebgorman avatar Apr 09 '24 19:04 kylebgorman

Ok I was working on replacing the indexing with one method as you suggested but there is an issue here:

our output space should be target_vocab_size. We could set this to the vocab_size (source + target + features), but this seems theoretically odd, since in many cases most of the output space is invalid, but we will still spill probability mass into it.

If we set it to be the a subset of the vocab (target_vocab_size), then we need to somehow map these output indices back to our vocabulary. Right now I was thinking about how this impacts turning ints back into symbols, but I think this also impacts decoder modules, which need to lookup the target in the shared embedding matrix.

Adamits avatar Apr 09 '24 22:04 Adamits

I think I can make this work with some tricky offsets to map target indices drawn from {0, ..., target_size} to the embedding matrix indices of {0, ..., source+target+features size}. However, this will require providing this info to the model so we can offset predictions when decoding. This seems a bit awkward to me.

Adamits avatar Apr 09 '24 22:04 Adamits

Why not just make it the case that the target vocab are the first or last N elements of the vocabulary and then keep track of what N is? Then, initialize the output space to N.

Alternatively you can keep track of the target vocab separately from the omnibus/shared vocab.

On Tue, Apr 9, 2024 at 6:22 PM Adam @.***> wrote:

I think I can make this work with some tricky offsets to map target indices drawn from {0, ..., target_size} to the embedding matrix indices of {0, ..., source+target+features size}. However, this will require providing this info to the model so we can offset predictions when decoding. This seems a bit awkward to me.

— Reply to this email directly, view it on GitHub https://github.com/CUNY-CL/yoyodyne/pull/140#issuecomment-2046139272, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OKYBRL5XOPVCJ37B5LY4RS4DAVCNFSM6AAAAAA5P4WQOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBWGEZTSMRXGI . You are receiving this because you commented.Message ID: @.***>

kylebgorman avatar Apr 09 '24 22:04 kylebgorman

Still probably some clunkiness here, and I need to fix formatting. But I wanted to put this up in case you want to take a look:

I have tested for Attentive LSTM and pg LSTM, and it solves the indexing issue Michael has. When I come back to this tomorrow I also want to set --tie_embeddings as the default (e.g. defaults to true). Though I guess adding a flag like --no_tie_embeddings is a bit awkward?

Adamits avatar Apr 09 '24 23:04 Adamits

I have tested for Attentive LSTM and pg LSTM, and it solves the indexing issue Michael has. When I come back to this tomorrow I also want to set --tie_embeddings as the default (e.g. defaults to true). Though I guess adding a flag like --no_tie_embeddings is a bit awkward?

Look what we did for --log_wandb and --no_log_wandb here.

kylebgorman avatar Apr 10 '24 00:04 kylebgorman

Look what we did for --log_wandb and --no_log_wandb here.

Nevermind, that's not relevant. I was looking for a flag that defaults to True. Here's a better example.

kylebgorman avatar Apr 10 '24 02:04 kylebgorman

I have cleaned this up and tested it. Everything works except Transducer right now, which seems to be because of the target_vocab_size manipulation that happens in the model here https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/models/transducer.py#L43.

I think its because target_vocab_size gets updated AFTER we initialize the single embedding matrix. This can cause idnex out of bounds errors. I can't work on this until tonight or tomorrow. @bonham79 in the meantime if you have any intuition about a good fix for this please take a look!

Adamits avatar Apr 11 '24 18:04 Adamits

And, as usual, black is failing the test but passing locally. Sorry I can never remember what the culprit is :(

Adamits avatar Apr 11 '24 18:04 Adamits

And, as usual, black is failing the test but passing locally. Sorry I can never remember what the culprit is :(

pip install -r requirements.txt ought to set you to the exact version to use. (They have been making changes to the generated format...and I recently upgraded it because there was a security issue in a dependency.)

kylebgorman avatar Apr 11 '24 18:04 kylebgorman

pip install -r requirements.txt ought to set you to the exact version to use. (They have been making changes to the generated format...and I recently upgraded it because there was a security issue in a dependency.)

Ah thanks, looks like that worked.

Adamits avatar Apr 11 '24 18:04 Adamits

README should also be updated to reflect.

Ok will do. I think just --tied_embedings/--no_... right?

Would you say this simplifies the "symbol math" we have to go through, or is it worse?

Can you be more specific here? What is the symbol math?

It looks to me like you deleted the two embedding initialization static methods in the base model class, and then didn't replace them anywhere else.

I move them from the base module to the base model to follow the design pattern we discussed. I think the two base.py modeling files made it look confusing :D.

Adamits avatar Apr 11 '24 23:04 Adamits

Ok will do. I think just --tied_embedings/--no_... right?

Sure. I might create a section before the section on reserved symbols, call it Tied embeddings, and then say that by default we share embeddings between the source and target, but one can disable this with --no_tied_embeddings. That seems quite pithy.

Then you should also maybe mention the new class of reserved symbols in the following section: {...}.

Would you say this simplifies the "symbol math" we have to go through, or is it worse? Can you be more specific here?

Is the way we build the index simpler (ignoring the additional complexity from tying the embeddings) or more complicated, nor about the same?

I move them from the base module to the base model to follow the design pattern we discussed. I think the two base.py modeling files made it look confusing :D.

Roger that.

kylebgorman avatar Apr 11 '24 23:04 kylebgorman

It's not in your PR description but this will close #142 and I think this will close #156 too.

kylebgorman avatar Apr 16 '24 22:04 kylebgorman

I think I have cleaned this up per your comments.

Is the way we build the index simpler (ignoring the additional complexity from tying the embeddings) or more complicated, nor about the same?

I think we can argue that it is simpler since there is less to fix up after the index is created. However, this does introduce two new sources of complexity:

  1. Source tokens can now be wrapped in {...}---I cannot imagine a scenario where this impacts a user in a noticeable way beyond debugging, but it is worth noting that this type of implicit behavior could be confusing.
  2. We now need to track both the index subspace for feature vocabulary (which we did before) and for the index subspace of the target vocabulary.

Adamits avatar Apr 22 '24 15:04 Adamits

It looks like we need to test if there are features, since I'm getting a breakage if there's not. I'll just modify to reflect...

kylebgorman avatar Apr 22 '24 15:04 kylebgorman

It looks like we need to test if there are features, since I'm getting a breakage if there's not. I'll just modify to reflect...

Oh shoot, thank you for testing.

Adamits avatar Apr 22 '24 15:04 Adamits

Okay, I made a change to indexes.py so it no longer crashes if there are no features. This does seem to work on my Polish tests, but validation accuracy performance is quite poor (though non-zero) on Maori, for which we have no features. The best checkpoint gives me only 11% accuracy. Can you take a look at this (after my commit 38311f530438b1154f75e83102ef3a13f134c613) and see if you can replicate? LMK if you need the command.

kylebgorman avatar Apr 22 '24 15:04 kylebgorman

Enabling this may also be resulting in slightly poorer performance on Polish with a pointer-generator LSTM, just from eyeballing...

kylebgorman avatar Apr 22 '24 16:04 kylebgorman

Enabling this may also be resulting in slightly poorer performance on Polish with a pointer-generator LSTM, just from eyeballing...

Ok I will do more testing. I also just remembered a potential issue: this removes any mechanism to get a pointer-generator without shared embeddings, which was previously the default. That is because our code requires the same symbol on the source or target side to share the same index. But now, if we want this behavior, then they also must point to the same embedding. To support this, id need to change the implementation of how we combine the pointer and generator distributions.

Adamits avatar Apr 22 '24 21:04 Adamits

Ok I will do more testing. I also just remembered a potential issue: this removes any mechanism to get a pointer-generator without shared embeddings, which was previously the default. That is because our code requires the same symbol on the source or target side to share the same index. But now, if we want this behavior, then they also must point to the same embedding. To support this, id need to change the implementation of how we combine the pointer and generator distributions.

Yes, to confirm by default the embedding is shared and that ought to just work with pointer-generator architectures (which require that the embedding be shared, during training). I don't see the problem, since you did that already. That said, it may not be working as intended with pointer-generators. (I get basically the same results with or without tied embeddings on my LSTM experiments, though.)

kylebgorman avatar Apr 22 '24 22:04 kylebgorman

I trained on the polish abstractness data with features (pointer_generator_lstm, lstm encoder, linear features encoder, hidden size 512, 1 layer, emb 128).

Here are the first 3 val accs with the master branch:

0.7530120611190796
0.8885542154312134
0.9036144614219666

And with this PR code:

0.8177710771560669
0.8885542154312134
0.9066265225410461

There will be variance since embedding initialization is a function of the embedding matrix size.

Adamits avatar Apr 22 '24 22:04 Adamits

Ok I will do more testing. I also just remembered a potential issue: this removes any mechanism to get a pointer-generator without shared embeddings, which was previously the default. That is because our code requires the same symbol on the source or target side to share the same index. But now, if we want this behavior, then they also must point to the same embedding. To support this, id need to change the implementation of how we combine the pointer and generator distributions.

Yes, to confirm by default the embedding is shared and that ought to just work with pointer-generator architectures (which require that the embedding be shared, during training). I don't see the problem, since you did that already. That said, it may not be working as intended with pointer-generators. (I get basically the same results with or without tied embeddings on my LSTM experiments, though.)

To be clear, we did not previously default to sharing embeddings, we just ensured characters had the same index, but those indices pointed to separate matrices. I think this is fine, I just mean that we no longer support disjoint embeddings for pointer generators since they require the src/tgt vocab index to be the same, but currently disjoint embeddings by necessity entail seperate src/tgt vocab indices.

Adamits avatar Apr 22 '24 22:04 Adamits

I trained on the polish abstractness data with features (pointer_generator_lstm, lstm encoder, linear features encoder, hidden size 512, 1 layer, emb 128).

Here are the first 3 val accs with the master branch:

0.7530120611190796
0.8885542154312134
0.9036144614219666

And with this PR code:

0.8177710771560669
0.8885542154312134
0.9066265225410461

There will be variance since embedding initialization is a function of the embedding matrix size.

I am getting similar results with that setup. Okay, false alarm on my part I guess. I may have installed from the wrong commit. Shall I merge?

kylebgorman avatar Apr 22 '24 22:04 kylebgorman

To be clear, we did not previously default to sharing embeddings, we just ensured characters had the same index, but those indices pointed to separate matrices. I think this is fine, I just mean that we no longer support disjoint embeddings for pointer generators since they require the src/tgt vocab index to be the same, but currently disjoint embeddings by necessity entail seperate src/tgt vocab indices.

Okay, so we now require shared embeddings for pointer-generators. Do we have to? Could we allow disjoint embeddings if the indexes match up?

kylebgorman avatar Apr 22 '24 22:04 kylebgorman

Sorry to keep jumping around with this, but I let this PR's version keep training on Polish and it went nan around 20 epochs. See if you can replicate?

readonly ARCH=pointer_generator_lstm
# TODO: Set according to language.
readonly LANGUAGE=pol
readonly MODEL_DIR="../models/pointer_generator_bilstm"

readonly TRAIN="${DATA}/${LANGUAGE}/${LANGUAGE}_train.tsv"
cat "${DATA}/${LANGUAGE}/${LANGUAGE}_"[0-7]".tsv" > "${TRAIN}"
trap "rm -f ${TRAIN}" EXIT
readonly VAL="${DATA}/${LANGUAGE}/${LANGUAGE}_8.tsv"

yoyodyne-train \
    --experiment "${LANGUAGE}" \
    --train "${TRAIN}" \
    --val "${VAL}" \
    --model_dir "${MODEL_DIR}" \
    --features_col 3 \
    --arch "${ARCH}" \
    --source_encoder lstm \
    --features_encoder linear \
    --embedding_size 128 \
    --hidden_size 512 \
    --learning_rate .001 \
    --optimizer adam \
    --batch_size 32 \
    --max_epochs 60 \
    --dropout .3 \
    --seed 49 \
    --gradient_clip_val 3 \
    --precision 16 \
    --accelerator gpu

kylebgorman avatar Apr 22 '24 22:04 kylebgorman