PEFA WSDM 24
I have a conceptual question regarding the PEFA paper at WSDM 24 . I was able to recreate the results using the NQ320K dataset.
“”" logging.info("Gathering data augmentation..”) P_emb = LabelEmbeddingFactory.create(Y_abs, X_abs, method="pifa", normalized_Y=False)
X_aug, Y_aug = get_data_aug(X_trn, X_doc, X_d2q, Y_trn, Y_doc, Y_d2q, aug_type="v5") logging.info("Running PEFA-XS..”) run_pefa_xs(P_emb, X_aug, Y_aug, X_tst, Y_tst, lambda_erm=lambda_erm) “”” In the above part of code in pefa_xs.py :
- We already have P_emb which is label features from Y_abs and X_abs . As far as I understand X_abs and Y_abs are generated in similar way we get X_trn and Y_trn .
Am I missing something about the natural questions data here ?
In general sense I would have queries and some labels in my corpus for my custom data . Query - Label for train ( This would create X_trn and Y_trn and create P_emb using the PIFA method.)
Query - Label for test . ( This would create X_tst and Y_tst and create the test set on which we run )
My question is what is the data PEFA uses for getting the second component pifa_emb ( In natural questions we are using Y_aug , X_aug to generate that which is nothing but embedding of queries and the label sparse matrix for those queries respectively)
What the difference between X_abs and X_trn ?
@OctoberChang
@NitinAggarwal1 ,
Thanks for your interested in our WSDM 2024 paper.
First, i recommend you take a closer look at our data preprocessing script [1] as well as the original NCI Github Repo [2]:
In summary, as described in the comments of the following code snippet from pefa_xs.py:
X_trn = smat_util.load_matrix(f"{input_emb_dir}/X.trn.npy") # trn set emb from real query text
X_tst = smat_util.load_matrix(f"{input_emb_dir}/X.tst.npy") # tst set emb from real query text
X_abs = smat_util.load_matrix(f"{input_emb_dir}/X.trn.abs.npy") # trn set emb from doc's abstract+title text
X_doc = smat_util.load_matrix(f"{input_emb_dir}/X.trn.doc.npy") # trn set emb from doc's content (first 512 tokens)
X_d2q = smat_util.load_matrix(f"{input_emb_dir}/X.trn.d2q.npy") # trn set emb from docT5query using doc's content
Q1: How to we get the document embedding (derived from its title+abstract)?
-
X_absis the training set embeddings derived from document's abstract + title text -
Y_absis the diagonal doc-to-doc label matrix - Thus, the
P_emb = Y_abs.T.dot(X_abs), which essentially is the document embeddings of abstract+title text.
Q2: How to get additional augmented data source? Similar to NCI paper, we get their pre-processed data augmentation of
-
X_doc: the document embeddings from its full content -
X_d2q: the document embeddings from pseudo queries generated by a Seq2Seq model
Q3: What's the difference between X_trn and X_abs:
-
X_trnis training set query embeddings (derived from query keywords) -
X_absis training set document embeddings derived from the document's abstract + title text.
I hope these FAQs answer most of your questions. If you still have other questions, feel free to ask.
Reference
- [1] https://github.com/amzn/pecos/blob/mainline/examples/pefa-wsdm24/data/proc_nq320k.py
- [2] https://github.com/solidsea98/Neural-Corpus-Indexer-NCI