ir_datasets
ir_datasets copied to clipboard
TREC Podcasts
Dataset Information:
TREC Task in 2020-21. This is a placeholder as I learn more about this task.
Links to Resources:
- https://podcastsdataset.byspotify.com/
Dataset ID(s):
<propose dataset ID(s), and where they fit in the hierarchy>
Supported Entities
- [ ] docs
- [ ] queries
- [ ] qrels
- [ ] scoreddocs
- [ ] docpairs
Additional comments/concerns/ideas/etc.
I have a copy of the corpus. I think there are interesting questions here about how to incorporate the fact that it's (essentially) a fixed-length passage retrieval task. I.e., should the documents be individual passages or entire episodes?
Following the lead from msmarco-passage, the individual passages could be the docs. But the dataset itself isn't split up that way-- it's chunks of several sentences that do not necessarily line up with the 2-minute (overlapping) chunks.
So keeping entire episodes as documents may seem more natural. But there's a problem there too: then the qrels do not line up with the doc_ids (since the qrels include the timestamp).
I think what I'll do is have both versions, something like this:
spotify-podcasts(docs) -- full episodes, keeping everything from the original sourcespotify-podcasts/chunked(docs) -- 2-minutes chunks, starting on each minute. These will be heavily processed, with fields beingdoc_id,text,episode_id, andstart_timestamp(thoughdoc_iditself is just a concatenation ofepisode_idandstart_timestamp)spotify-podcasts/chunked/trec-podcasts-{2020,2021}(docs, queries, qrels)
This setup has the following nice qualities:
- All source information is available (via
spotify-podcasts) - Qrels have doc_ids that line up with the corpus (via
spotify-podcasts/chunked) - Should be easy to use in the chunked setting, with a single simple text field