pyserini
pyserini copied to clipboard
Pyserini.encode : 'texts': batch_info['text'], KeyError: 'text'
I have a jsonl file containing dictionaries with entries like this :
{'id': 'NCT01740609', 'contents': "A Study To Assess The Safety Of PF-06342674 In Healthy Volunteers@&The purpose of this study is to evaluate the safety, tolerability, pharmacokinetics and immunogenicity of single escalating doses PF-06342674.@&None@&COMPLETED@&['Healthy']@&ALL@&False@&18 Years@&None@&['Phase 1', 'RN168', 'Healthy Volunteers']@&None@&None@&None@&Inclusion Criteria:\n\n* Male subjects and female of non-childbearing potential subjects between the ages of 18 and 55.\n* BMI between 18.5 to 32 kg/m2.\n* Total body weight ≥40 kg and ≤120 kg.\n\nExclusion Criteria:\n\n* Previous treatment with an antibody within 6 months prior to Day 1.\n* Pregnant or nursing females; females of childbearing potential.\n* History of sensitivity to heparin or heparin-induced thrombocytopenia.@&ALL@&None@&None@&None@&None@&2014-06@&COMPLETED"}
- The delimiter in this case is @&
- Total number of fields are 20 all separated by the delimiter
I am trying to encode the document (jsonl) using the Dense Encoder :
python -m pyserini.encode input --corpus transformed_data.jsonl --fields 'brief_title', 'brief_summary', 'detailed_description', 'overall_status', 'condition', 'gender', 'gender_based', 'minimum_age', 'maximum_age', 'keyword', 'mesh_term', 'drugs', 'diseases', 'Eligibility', 'sex', 'organ', 'adverse_events', 'serious_affect', 'country', 'completion_date', 'Status' --delimiter "@&" --shard-id 0 --shard-num 1 output --embeddings pyserini_embeddings --to-faiss encoder --encoder castorini/tct_colbert-v2-hnp-msmarco --fields 'brief_title', 'brief_summary', 'detailed_description', 'overall_status', 'condition', 'gender', 'gender_based', 'minimum_age', 'maximum_age', 'keyword', 'mesh_term', 'drugs', 'diseases', 'Eligibility', 'sex', 'organ', 'adverse_events', 'serious_affect', 'country', 'completion_date', 'Status' --batch 32 --device cpu
The error after running the above command is :
Output : 481384it [00:11, 41781.69it/s] 0%| | 0/15044 [00:00<?, ?it/s] Traceback (most recent call last): File "/Applications/anaconda3/envs/pyserini/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/Applications/anaconda3/envs/pyserini/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/Applications/anaconda3/envs/pyserini/lib/python3.8/site-packages/pyserini/encode/main.py", line 138, in
'texts': batch_info['text'], KeyError: 'text'
I tried to inspect the main.py
in pyserini/encode , the parser for "field" argument is :
input_parser.add_argument('--fields', help='fields that contents in jsonl has (in order)', nargs='+', default=['text'], required=False)
After this parsing,
collection_iterator = JsonlCollectionIterator(args.input.corpus, args.input.fields, args.input.docid_field, delimiter)
with embedding_writer:
for batch_info in collection_iterator(batch_size, args.input.shard_id, args.input.shard_num):
kwargs = {
'texts': batch_info['text'],
'titles': batch_info['title'] if 'title' in args.encoder.fields else None,
'expands': batch_info['expand'] if 'expand' in args.encoder.fields else None,
'fp16': args.encoder.fp16,
'max_length': args.encoder.max_length,
'add_sep': args.encoder.add_sep,
}
It means that the collection iterator mandatorily expects "text" field, can store "title" field and "expands" field. Is it possible to expand it to any number of desired fields? Thanks!