DeepCT icon indicating copy to clipboard operation
DeepCT copied to clipboard

comments/questions

Open cmacdonald opened this issue 3 years ago • 3 comments

  • Whats the difference between collection_pred_1 and collection_pred_2? Is this MSMARCO passage vs document corpora?
  • Providing a version of each corpus with the bert_term_sample_to_json.py already applied would be easier to use.
  • Keeping files as .tsv.zip isnt as helpful as for instance keeping them as .tsv.gz which can be directly opened as a stream

cmacdonald avatar Apr 17 '21 06:04 cmacdonald

Hi Craig,

"What's the difference between collection_pred_1 and collection_pred_2? Is this MSMARCO passage vs document corpora?" -> Sorry for the confusion, they each contain half of MSMARCO passage collection.

"Providing a version of each corpus with the bert_term_sample_to_json.py already applied would be easier to use." -> Thanks for the suggestion. I didn't include those because they are huge as terms are being repeated. I'll try to find those files and add to the data folder.

On Fri, Apr 16, 2021 at 11:19 PM Craig Macdonald @.***> wrote:

  • Whats the difference between collection_pred_1 and collection_pred_2? Is this MSMARCO passage vs document corpora?
  • Providing a version of each corpus with the bert_term_sample_to_json.py already applied would be easier to use.
  • Keeping files as .tsv.zip isnt as helpful as for instance keeping them as .tsv.gz which can be directly opened as a stream

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AdeDZY/DeepCT/issues/14, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHQHGEPRF3RF74Q46KXBUTTJER7FANCNFSM43CXSTLQ .

AdeDZY avatar Apr 21 '21 00:04 AdeDZY

I try to work with gzip files, as they can be read and written in streams (indeed, I patched the bert_term_sample_to_json.py script to write gzip files automatically). Generated using m=100, the output deepctcollection.gz is much smaller than test_results.tsv.zip/gz

$ls -lh
total 8.3G
-rw-r--r-- 1 craigm csstaff 446M Apr 20 11:53 deepctcollection.gz
-rw-r--r-- 1 craigm csstaff 4.0G Apr 16 22:24 test_results.tsv.gz
-rw-r--r-- 1 craigm csstaff 4.0G Nov 26  2019 test_results.tsv.zip
(pyterrier) [craigm@trhead collection_pred_1]$less deepctcollection.gz

I also had to align the docids to account for empty documents, by changing bert_term_sample_to_json.py as follows:

            if not selected_tokens:
                output_file.write(did + '\t' + ' \n') # added by craig
                e += 1
                continue

cmacdonald avatar Apr 21 '21 11:04 cmacdonald

Thanks for providing the numbers! I have updated the data folder with test_results.tsv.gz files.

In addition, I also uploaded the bert_term_sample_to_json.py output for MS MARCO at weighted_documents/.

AdeDZY avatar May 09 '21 04:05 AdeDZY