ir_datasets TREC Podcasts

TREC Podcasts

Open seanmacavaney opened this issue 4 years ago • 1 comments

Dataset Information:

TREC Task in 2020-21. This is a placeholder as I learn more about this task.

Links to Resources:

https://podcastsdataset.byspotify.com/

Dataset ID(s):

Supported Entities

[ ] docs
[ ] queries
[ ] qrels
[ ] scoreddocs
[ ] docpairs

Additional comments/concerns/ideas/etc.

Mar 03 '21 12:03 seanmacavaney

I have a copy of the corpus. I think there are interesting questions here about how to incorporate the fact that it's (essentially) a fixed-length passage retrieval task. I.e., should the documents be individual passages or entire episodes?

Following the lead from msmarco-passage, the individual passages could be the docs. But the dataset itself isn't split up that way-- it's chunks of several sentences that do not necessarily line up with the 2-minute (overlapping) chunks.

So keeping entire episodes as documents may seem more natural. But there's a problem there too: then the qrels do not line up with the doc_ids (since the qrels include the timestamp).

I think what I'll do is have both versions, something like this:

spotify-podcasts (docs) -- full episodes, keeping everything from the original source
spotify-podcasts/chunked (docs) -- 2-minutes chunks, starting on each minute. These will be heavily processed, with fields being doc_id, text, episode_id, and start_timestamp (though doc_id itself is just a concatenation of episode_id and start_timestamp)
spotify-podcasts/chunked/trec-podcasts-{2020,2021} (docs, queries, qrels)

This setup has the following nice qualities:

All source information is available (via spotify-podcasts)
Qrels have doc_ids that line up with the corpus (via spotify-podcasts/chunked)
Should be easy to use in the chunked setting, with a single simple text field

Apr 30 '21 09:04 seanmacavaney

ir_datasets ir_datasets copied to clipboard

TREC Podcasts

ir_datasets
ir_datasets copied to clipboard