pisa
pisa copied to clipboard
Sharding with customized doc-shard mapping
Hi, I am now trying to use PISA on selective search research and confused about the shard operation.
When using the --shard-file option, what format of the shard-file should be for users to customize their own sharding rules?
Thank you!
I'm assuming you're talking about partition_fwd_index
command, right? Note that this is --shard-files
plural, so what you're supposed to pass is a list of files, each of which describes a shard. Such a file contains "titles" of documents within that shard. So for, say, Clueweb, that would be the TRECID.
Here's more about the entire process: https://pisa.readthedocs.io/en/latest/sharding.html#partition-fwd-index Let me know if you have more questions, or if something's unclear in the documentation (or if you find bugs there). I'll be happy to help. Also, feel free to contact me directly on slack if you prefer: https://join.slack.com/t/pisa-engine/shared_invite/enQtNjM1NTk3NzIyMjE0LTQ3ZjI1MmU2ZjAyODE4YjNiNTY5YWYzMjg5Njc5ZDM5MzhhZDBiMGE5MTFhMTViN2ZjNzg0OTkzMDAwMDg3YTE