ir_datasets icon indicating copy to clipboard operation
ir_datasets copied to clipboard

TREC Podcasts

Open seanmacavaney opened this issue 4 years ago • 1 comments

Dataset Information:

TREC Task in 2020-21. This is a placeholder as I learn more about this task.

Links to Resources:

  • https://podcastsdataset.byspotify.com/

Dataset ID(s):

<propose dataset ID(s), and where they fit in the hierarchy>

Supported Entities

  • [ ] docs
  • [ ] queries
  • [ ] qrels
  • [ ] scoreddocs
  • [ ] docpairs

Additional comments/concerns/ideas/etc.

seanmacavaney avatar Mar 03 '21 12:03 seanmacavaney

I have a copy of the corpus. I think there are interesting questions here about how to incorporate the fact that it's (essentially) a fixed-length passage retrieval task. I.e., should the documents be individual passages or entire episodes?

Following the lead from msmarco-passage, the individual passages could be the docs. But the dataset itself isn't split up that way-- it's chunks of several sentences that do not necessarily line up with the 2-minute (overlapping) chunks.

So keeping entire episodes as documents may seem more natural. But there's a problem there too: then the qrels do not line up with the doc_ids (since the qrels include the timestamp).

I think what I'll do is have both versions, something like this:

  • spotify-podcasts (docs) -- full episodes, keeping everything from the original source
  • spotify-podcasts/chunked (docs) -- 2-minutes chunks, starting on each minute. These will be heavily processed, with fields being doc_id, text, episode_id, and start_timestamp (though doc_id itself is just a concatenation of episode_id and start_timestamp)
  • spotify-podcasts/chunked/trec-podcasts-{2020,2021} (docs, queries, qrels)

This setup has the following nice qualities:

  • All source information is available (via spotify-podcasts)
  • Qrels have doc_ids that line up with the corpus (via spotify-podcasts/chunked)
  • Should be easy to use in the chunked setting, with a single simple text field

seanmacavaney avatar Apr 30 '21 09:04 seanmacavaney