yoyodyne
yoyodyne copied to clipboard
WIP: single embeddings matrix design
Throwing this WIP up to store all vocabs in a single embeddings matrix shared between source, target, and features. This will fix the current pointer-generator issues when we have disjoint vocabularies. I will get back to implementing it later this week.
- Computes a single
vocab_map - Retains
target_vocabularyandsource_vocabularyfor logging. - Moves embedding initialization onto the model
- Passes a single embedding layer to each module to share.
TODO
- I did not do anything with features. I still need to merge these into the single vocab, being careful not to break anything that requires separate features.
- This does not handle the case that someone may want to have unmerged embeddings, such that the same unicode codepoint on the source and target side should be considered different. We will need to update the index to mark those characters with special symbols to implement this.
I created issue #142 to document the plan here.
I just rebased a ton of commits and want to make sure I didn't break anything.
EDIT: Hmmm I did not realize that all rebased commits would appear as changes in this PR> I have not resolved complicated git merges in a long time and thought rebasing was the right move, but I don't like how this looks...
Should I scrap this and merge master into my branch instead?
Edit Edit: I have done so, I think merging is better here.
I'm not a merge vs. rebase N*zi but the way the PR looks when I review is plenty clear as is.
Yeah sounds good.
vocab_map would be better just called index and we should throw out the SymbolMap class.
Does this mean we would call it like index.index (consider that in dataset.py wew currently have e.g. self.index.vocab_map)?
Does this mean we would call it like
index.index(consider that in dataset.py wew currently have e.g.self.index.vocab_map)?
You could make it so self.index could so be called, i.e., by perlocating that ability upward...if I'm understanding correctly.
Ok I was working on replacing the indexing with one method as you suggested but there is an issue here:
our output space should be target_vocab_size. We could set this to the vocab_size (source + target + features), but this seems theoretically odd, since in many cases most of the output space is invalid, but we will still spill probability mass into it.
If we set it to be the a subset of the vocab (target_vocab_size), then we need to somehow map these output indices back to our vocabulary. Right now I was thinking about how this impacts turning ints back into symbols, but I think this also impacts decoder modules, which need to lookup the target in the shared embedding matrix.
I think I can make this work with some tricky offsets to map target indices drawn from {0, ..., target_size} to the embedding matrix indices of {0, ..., source+target+features size}. However, this will require providing this info to the model so we can offset predictions when decoding. This seems a bit awkward to me.
Why not just make it the case that the target vocab are the first or last N elements of the vocabulary and then keep track of what N is? Then, initialize the output space to N.
Alternatively you can keep track of the target vocab separately from the omnibus/shared vocab.
On Tue, Apr 9, 2024 at 6:22 PM Adam @.***> wrote:
I think I can make this work with some tricky offsets to map target indices drawn from {0, ..., target_size} to the embedding matrix indices of {0, ..., source+target+features size}. However, this will require providing this info to the model so we can offset predictions when decoding. This seems a bit awkward to me.
— Reply to this email directly, view it on GitHub https://github.com/CUNY-CL/yoyodyne/pull/140#issuecomment-2046139272, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OKYBRL5XOPVCJ37B5LY4RS4DAVCNFSM6AAAAAA5P4WQOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBWGEZTSMRXGI . You are receiving this because you commented.Message ID: @.***>
Still probably some clunkiness here, and I need to fix formatting. But I wanted to put this up in case you want to take a look:
I have tested for Attentive LSTM and pg LSTM, and it solves the indexing issue Michael has. When I come back to this tomorrow I also want to set --tie_embeddings as the default (e.g. defaults to true). Though I guess adding a flag like --no_tie_embeddings is a bit awkward?
I have tested for Attentive LSTM and pg LSTM, and it solves the indexing issue Michael has. When I come back to this tomorrow I also want to set
--tie_embeddingsas the default (e.g. defaults to true). Though I guess adding a flag like--no_tie_embeddingsis a bit awkward?
Look what we did for --log_wandb and --no_log_wandb here.
Look what we did for
--log_wandband--no_log_wandbhere.
Nevermind, that's not relevant. I was looking for a flag that defaults to True. Here's a better example.
I have cleaned this up and tested it. Everything works except Transducer right now, which seems to be because of the target_vocab_size manipulation that happens in the model here https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/models/transducer.py#L43.
I think its because target_vocab_size gets updated AFTER we initialize the single embedding matrix. This can cause idnex out of bounds errors. I can't work on this until tonight or tomorrow. @bonham79 in the meantime if you have any intuition about a good fix for this please take a look!
And, as usual, black is failing the test but passing locally. Sorry I can never remember what the culprit is :(
And, as usual, black is failing the test but passing locally. Sorry I can never remember what the culprit is :(
pip install -r requirements.txt ought to set you to the exact version to use. (They have been making changes to the generated format...and I recently upgraded it because there was a security issue in a dependency.)
pip install -r requirements.txt ought to set you to the exact version to use. (They have been making changes to the generated format...and I recently upgraded it because there was a security issue in a dependency.)
Ah thanks, looks like that worked.
README should also be updated to reflect.
Ok will do. I think just --tied_embedings/--no_... right?
Would you say this simplifies the "symbol math" we have to go through, or is it worse?
Can you be more specific here? What is the symbol math?
It looks to me like you deleted the two embedding initialization static methods in the base model class, and then didn't replace them anywhere else.
I move them from the base module to the base model to follow the design pattern we discussed. I think the two base.py modeling files made it look confusing :D.
Ok will do. I think just
--tied_embedings/--no_...right?
Sure. I might create a section before the section on reserved symbols, call it Tied embeddings, and then say that by default we share embeddings between the source and target, but one can disable this with --no_tied_embeddings. That seems quite pithy.
Then you should also maybe mention the new class of reserved symbols in the following section: {...}.
Would you say this simplifies the "symbol math" we have to go through, or is it worse? Can you be more specific here?
Is the way we build the index simpler (ignoring the additional complexity from tying the embeddings) or more complicated, nor about the same?
I move them from the base module to the base model to follow the design pattern we discussed. I think the two
base.pymodeling files made it look confusing :D.
Roger that.
It's not in your PR description but this will close #142 and I think this will close #156 too.
I think I have cleaned this up per your comments.
Is the way we build the index simpler (ignoring the additional complexity from tying the embeddings) or more complicated, nor about the same?
I think we can argue that it is simpler since there is less to fix up after the index is created. However, this does introduce two new sources of complexity:
- Source tokens can now be wrapped in
{...}---I cannot imagine a scenario where this impacts a user in a noticeable way beyond debugging, but it is worth noting that this type of implicit behavior could be confusing. - We now need to track both the index subspace for feature vocabulary (which we did before) and for the index subspace of the target vocabulary.
It looks like we need to test if there are features, since I'm getting a breakage if there's not. I'll just modify to reflect...
It looks like we need to test if there are features, since I'm getting a breakage if there's not. I'll just modify to reflect...
Oh shoot, thank you for testing.
Okay, I made a change to indexes.py so it no longer crashes if there are no features. This does seem to work on my Polish tests, but validation accuracy performance is quite poor (though non-zero) on Maori, for which we have no features. The best checkpoint gives me only 11% accuracy. Can you take a look at this (after my commit 38311f530438b1154f75e83102ef3a13f134c613) and see if you can replicate? LMK if you need the command.
Enabling this may also be resulting in slightly poorer performance on Polish with a pointer-generator LSTM, just from eyeballing...
Enabling this may also be resulting in slightly poorer performance on Polish with a pointer-generator LSTM, just from eyeballing...
Ok I will do more testing. I also just remembered a potential issue: this removes any mechanism to get a pointer-generator without shared embeddings, which was previously the default. That is because our code requires the same symbol on the source or target side to share the same index. But now, if we want this behavior, then they also must point to the same embedding. To support this, id need to change the implementation of how we combine the pointer and generator distributions.
Ok I will do more testing. I also just remembered a potential issue: this removes any mechanism to get a pointer-generator without shared embeddings, which was previously the default. That is because our code requires the same symbol on the source or target side to share the same index. But now, if we want this behavior, then they also must point to the same embedding. To support this, id need to change the implementation of how we combine the pointer and generator distributions.
Yes, to confirm by default the embedding is shared and that ought to just work with pointer-generator architectures (which require that the embedding be shared, during training). I don't see the problem, since you did that already. That said, it may not be working as intended with pointer-generators. (I get basically the same results with or without tied embeddings on my LSTM experiments, though.)
I trained on the polish abstractness data with features (pointer_generator_lstm, lstm encoder, linear features encoder, hidden size 512, 1 layer, emb 128).
Here are the first 3 val accs with the master branch:
0.7530120611190796
0.8885542154312134
0.9036144614219666
And with this PR code:
0.8177710771560669
0.8885542154312134
0.9066265225410461
There will be variance since embedding initialization is a function of the embedding matrix size.
Ok I will do more testing. I also just remembered a potential issue: this removes any mechanism to get a pointer-generator without shared embeddings, which was previously the default. That is because our code requires the same symbol on the source or target side to share the same index. But now, if we want this behavior, then they also must point to the same embedding. To support this, id need to change the implementation of how we combine the pointer and generator distributions.
Yes, to confirm by default the embedding is shared and that ought to just work with pointer-generator architectures (which require that the embedding be shared, during training). I don't see the problem, since you did that already. That said, it may not be working as intended with pointer-generators. (I get basically the same results with or without tied embeddings on my LSTM experiments, though.)
To be clear, we did not previously default to sharing embeddings, we just ensured characters had the same index, but those indices pointed to separate matrices. I think this is fine, I just mean that we no longer support disjoint embeddings for pointer generators since they require the src/tgt vocab index to be the same, but currently disjoint embeddings by necessity entail seperate src/tgt vocab indices.
I trained on the polish abstractness data with features (pointer_generator_lstm, lstm encoder, linear features encoder, hidden size 512, 1 layer, emb 128).
Here are the first 3 val accs with the master branch:
0.7530120611190796 0.8885542154312134 0.9036144614219666And with this PR code:
0.8177710771560669 0.8885542154312134 0.9066265225410461There will be variance since embedding initialization is a function of the embedding matrix size.
I am getting similar results with that setup. Okay, false alarm on my part I guess. I may have installed from the wrong commit. Shall I merge?
To be clear, we did not previously default to sharing embeddings, we just ensured characters had the same index, but those indices pointed to separate matrices. I think this is fine, I just mean that we no longer support disjoint embeddings for pointer generators since they require the src/tgt vocab index to be the same, but currently disjoint embeddings by necessity entail seperate src/tgt vocab indices.
Okay, so we now require shared embeddings for pointer-generators. Do we have to? Could we allow disjoint embeddings if the indexes match up?
Sorry to keep jumping around with this, but I let this PR's version keep training on Polish and it went nan around 20 epochs. See if you can replicate?
readonly ARCH=pointer_generator_lstm
# TODO: Set according to language.
readonly LANGUAGE=pol
readonly MODEL_DIR="../models/pointer_generator_bilstm"
readonly TRAIN="${DATA}/${LANGUAGE}/${LANGUAGE}_train.tsv"
cat "${DATA}/${LANGUAGE}/${LANGUAGE}_"[0-7]".tsv" > "${TRAIN}"
trap "rm -f ${TRAIN}" EXIT
readonly VAL="${DATA}/${LANGUAGE}/${LANGUAGE}_8.tsv"
yoyodyne-train \
--experiment "${LANGUAGE}" \
--train "${TRAIN}" \
--val "${VAL}" \
--model_dir "${MODEL_DIR}" \
--features_col 3 \
--arch "${ARCH}" \
--source_encoder lstm \
--features_encoder linear \
--embedding_size 128 \
--hidden_size 512 \
--learning_rate .001 \
--optimizer adam \
--batch_size 32 \
--max_epochs 60 \
--dropout .3 \
--seed 49 \
--gradient_clip_val 3 \
--precision 16 \
--accelerator gpu