dssm icon indicating copy to clipboard operation
dssm copied to clipboard

How does your train_data for dssm organazed? Or the data format

Open chouchou1988 opened this issue 8 years ago • 7 comments

I have seen your demo dssm/single/dssm_v3.py, and want to know how your data be organazed. For example, what the format of query.train.pickle ?

chouchou1988 avatar Jun 02 '17 06:06 chouchou1988

yeah,i also wonder how the data format is......

RominYue avatar Jun 05 '17 09:06 RominYue

Same question here

BinQuake avatar Jun 05 '17 21:06 BinQuake

In author's post http://liaha.github.io/models/2016/06/21/dssm-on-tensorflow.html , he says that the model input is 46238:1 24108:1 24016:1 5618:1 8818:1, which stands for tri-letter index: num_occur, but I confused that we must pre-process all the tri-letters to build their indexes? That seems time consuming......

RominYue avatar Jun 06 '17 01:06 RominYue

https://github.com/liaha/dssm/blob/master/single/dssm_v3.py

I think the pull_batch function seems to accept already pre-processed(query/title -> n-gram vector -> one-hot encoded vector) data as input. And the input is simply a one-hot encoded vector like [[1, 0, 1, ...., 0], ..., [1, 0, 1, ...., 0]].

So you need to convert query and document title to one-hot encoded vector before feed to tensor.

In my case, I used scikit-learn for it. http://scikit-learn.org/stable/modules/feature_extraction.html

sehoi avatar Jun 09 '17 05:06 sehoi

Actually after reading the code, I think the original file format doesn't matter. If you are converting your tri-gram data into a sparse matrix then it should be fine. Just change the lines handling input data in the code. The training model handles matrices anyway.

BinQuake avatar Jun 09 '17 20:06 BinQuake

yeah, I also have a question that the data format is like .......

zhongyunuestc avatar Sep 06 '17 12:09 zhongyunuestc

I am saving my data as a matrix, where the rows are the query/document sentences, and the columns is the vocabulary. I am adding a '1' wherever the word in the sentence matches with the word in the vocab.

For example, if my query files contain "this is a cat. hello cat". My vocab comprises of "this, is, a, cat, hello". Then my query matrix is like:

1, 1, 1, 1, 0
0, 0, 0, 1, 1

I am creating a sparse matrix out of it by using scipy.sparse.csr_matrix()

Am I doing this right?

RobbLang avatar Feb 05 '19 17:02 RobbLang