PEFA WSDM 24

Open NitinAggarwal1 opened this issue 1 year ago • 1 comments

I have a conceptual question regarding the PEFA paper at WSDM 24 . I was able to recreate the results using the NQ320K dataset.

“”" logging.info("Gathering data augmentation..”) P_emb = LabelEmbeddingFactory.create(Y_abs, X_abs, method="pifa", normalized_Y=False)

X_aug, Y_aug = get_data_aug(X_trn, X_doc, X_d2q, Y_trn, Y_doc, Y_d2q, aug_type="v5") logging.info("Running PEFA-XS..”) run_pefa_xs(P_emb, X_aug, Y_aug, X_tst, Y_tst, lambda_erm=lambda_erm) “”” In the above part of code in pefa_xs.py :

We already have P_emb which is label features from Y_abs and X_abs . As far as I understand X_abs and Y_abs are generated in similar way we get X_trn and Y_trn .

Am I missing something about the natural questions data here ?

In general sense I would have queries and some labels in my corpus for my custom data . Query - Label for train ( This would create X_trn and Y_trn and create P_emb using the PIFA method.)

Query - Label for test . ( This would create X_tst and Y_tst and create the test set on which we run )

My question is what is the data PEFA uses for getting the second component pifa_emb ( In natural questions we are using Y_aug , X_aug to generate that which is nothing but embedding of queries and the label sparse matrix for those queries respectively)

What the difference between X_abs and X_trn ?

@OctoberChang

Mar 14 '24 11:03 NitinAggarwal1

@NitinAggarwal1 ,

Thanks for your interested in our WSDM 2024 paper.

First, i recommend you take a closer look at our data preprocessing script [1] as well as the original NCI Github Repo [2]:

In summary, as described in the comments of the following code snippet from pefa_xs.py:

    X_trn = smat_util.load_matrix(f"{input_emb_dir}/X.trn.npy")     # trn set emb from real query text
    X_tst = smat_util.load_matrix(f"{input_emb_dir}/X.tst.npy")     # tst set emb from real query text
    X_abs = smat_util.load_matrix(f"{input_emb_dir}/X.trn.abs.npy") # trn set emb from doc's abstract+title text
    X_doc = smat_util.load_matrix(f"{input_emb_dir}/X.trn.doc.npy") # trn set emb from doc's content (first 512 tokens)
    X_d2q = smat_util.load_matrix(f"{input_emb_dir}/X.trn.d2q.npy") # trn set emb from docT5query using doc's content

Q1: How to we get the document embedding (derived from its title+abstract)?

X_abs is the training set embeddings derived from document's abstract + title text
Y_abs is the diagonal doc-to-doc label matrix
Thus, the P_emb = Y_abs.T.dot(X_abs), which essentially is the document embeddings of abstract+title text.

Q2: How to get additional augmented data source? Similar to NCI paper, we get their pre-processed data augmentation of

X_doc: the document embeddings from its full content
X_d2q: the document embeddings from pseudo queries generated by a Seq2Seq model

Q3: What's the difference between X_trn and X_abs:

X_trn is training set query embeddings (derived from query keywords)
X_absis training set document embeddings derived from the document's abstract + title text.

I hope these FAQs answer most of your questions. If you still have other questions, feel free to ask.

Reference

[1] https://github.com/amzn/pecos/blob/mainline/examples/pefa-wsdm24/data/proc_nq320k.py
[2] https://github.com/solidsea98/Neural-Corpus-Indexer-NCI

Mar 18 '24 04:03 OctoberChang