pisa icon indicating copy to clipboard operation
pisa copied to clipboard

Sharding with customized doc-shard mapping

Open saltonz opened this issue 5 years ago • 1 comments

Hi, I am now trying to use PISA on selective search research and confused about the shard operation.

When using the --shard-file option, what format of the shard-file should be for users to customize their own sharding rules?

Thank you!

saltonz avatar Dec 02 '19 20:12 saltonz

I'm assuming you're talking about partition_fwd_index command, right? Note that this is --shard-files plural, so what you're supposed to pass is a list of files, each of which describes a shard. Such a file contains "titles" of documents within that shard. So for, say, Clueweb, that would be the TRECID.

Here's more about the entire process: https://pisa.readthedocs.io/en/latest/sharding.html#partition-fwd-index Let me know if you have more questions, or if something's unclear in the documentation (or if you find bugs there). I'll be happy to help. Also, feel free to contact me directly on slack if you prefer: https://join.slack.com/t/pisa-engine/shared_invite/enQtNjM1NTk3NzIyMjE0LTQ3ZjI1MmU2ZjAyODE4YjNiNTY5YWYzMjg5Njc5ZDM5MzhhZDBiMGE5MTFhMTViN2ZjNzg0OTkzMDAwMDg3YTE

elshize avatar Dec 02 '19 21:12 elshize