pyserini icon indicating copy to clipboard operation
pyserini copied to clipboard

Pyserini.encode : 'texts': batch_info['text'], KeyError: 'text'

Open ashishakkumar opened this issue 11 months ago • 0 comments

I have a jsonl file containing dictionaries with entries like this :

{'id': 'NCT01740609', 'contents': "A Study To Assess The Safety Of PF-06342674 In Healthy Volunteers@&The purpose of this study is to evaluate the safety, tolerability, pharmacokinetics and immunogenicity of single escalating doses PF-06342674.@&None@&COMPLETED@&['Healthy']@&ALL@&False@&18 Years@&None@&['Phase 1', 'RN168', 'Healthy Volunteers']@&None@&None@&None@&Inclusion Criteria:\n\n* Male subjects and female of non-childbearing potential subjects between the ages of 18 and 55.\n* BMI between 18.5 to 32 kg/m2.\n* Total body weight ≥40 kg and ≤120 kg.\n\nExclusion Criteria:\n\n* Previous treatment with an antibody within 6 months prior to Day 1.\n* Pregnant or nursing females; females of childbearing potential.\n* History of sensitivity to heparin or heparin-induced thrombocytopenia.@&ALL@&None@&None@&None@&None@&2014-06@&COMPLETED"}

  • The delimiter in this case is @&
  • Total number of fields are 20 all separated by the delimiter

I am trying to encode the document (jsonl) using the Dense Encoder :

python -m pyserini.encode input --corpus transformed_data.jsonl --fields 'brief_title', 'brief_summary', 'detailed_description', 'overall_status', 'condition', 'gender', 'gender_based', 'minimum_age', 'maximum_age', 'keyword', 'mesh_term', 'drugs', 'diseases', 'Eligibility', 'sex', 'organ', 'adverse_events', 'serious_affect', 'country', 'completion_date', 'Status' --delimiter "@&" --shard-id 0 --shard-num 1 output --embeddings pyserini_embeddings --to-faiss encoder --encoder castorini/tct_colbert-v2-hnp-msmarco --fields 'brief_title', 'brief_summary', 'detailed_description', 'overall_status', 'condition', 'gender', 'gender_based', 'minimum_age', 'maximum_age', 'keyword', 'mesh_term', 'drugs', 'diseases', 'Eligibility', 'sex', 'organ', 'adverse_events', 'serious_affect', 'country', 'completion_date', 'Status' --batch 32 --device cpu

The error after running the above command is :

Output : 481384it [00:11, 41781.69it/s] 0%| | 0/15044 [00:00<?, ?it/s] Traceback (most recent call last): File "/Applications/anaconda3/envs/pyserini/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/Applications/anaconda3/envs/pyserini/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/Applications/anaconda3/envs/pyserini/lib/python3.8/site-packages/pyserini/encode/main.py", line 138, in 'texts': batch_info['text'], KeyError: 'text'

I tried to inspect the main.py in pyserini/encode , the parser for "field" argument is : input_parser.add_argument('--fields', help='fields that contents in jsonl has (in order)', nargs='+', default=['text'], required=False) After this parsing,

collection_iterator = JsonlCollectionIterator(args.input.corpus, args.input.fields, args.input.docid_field, delimiter)
with embedding_writer:
        for batch_info in collection_iterator(batch_size, args.input.shard_id, args.input.shard_num):
            kwargs = {
                'texts': batch_info['text'],
                'titles': batch_info['title'] if 'title' in args.encoder.fields else None,
                'expands': batch_info['expand'] if 'expand' in args.encoder.fields else None,
                'fp16': args.encoder.fp16,
                'max_length': args.encoder.max_length,
                'add_sep': args.encoder.add_sep,
            } 

It means that the collection iterator mandatorily expects "text" field, can store "title" field and "expands" field. Is it possible to expand it to any number of desired fields? Thanks!

ashishakkumar avatar Mar 19 '24 18:03 ashishakkumar