ICLR2023-OpenReviewData
ICLR2023-OpenReviewData copied to clipboard
Cannot crawl the data from the OpenReview website
Hi there, I tried to run the parse_data.py to crawl data from openreview. Unfortunately, it did not work. The following are the error messages. Is anybody can give me a hand? Thank you!
ipython parse_data.py Offset: 0 Data: 0 Offset: 1000 Data: 1000 Offset: 2000 Data: 2000 Offset: 3000 Data: 3000 Offset: 4000 Data: 3809 Number of submissions: 3809 Number of papers (including old): 4874 0%| | 0/4874 [00:00<?, ?it/s] 0%| | 0/4874 [00:00<?, ?it/s]
RemoteTraceback Traceback (most recent call last) RemoteTraceback: """ Traceback (most recent call last): File "/home/dongxingshuai/anaconda3/envs/nlp/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/home/dongxingshuai/research_associate/nlp/ICLR2023-OpenReviewData-main/notebooks/parse_data.py", line 166, in filter_data withdraw = 1 if 'Withdrawn_Submission' in meta_note[0]['invitation'] else 0 IndexError: list index out of range """
The above exception was the direct cause of the following exception:
IndexError Traceback (most recent call last) File ~/research_associate/nlp/ICLR2023-OpenReviewData-main/notebooks/parse_data.py:195 190 # In[59]: 191 192 193 # filter data in a pool of processes 194 with Pool(8) as p: --> 195 filtered_notes = list(tqdm(p.imap(filter_data, notes), total=len(notes))) 198 # In[60]: 199 200 201 # create dataframe 202 ratings = pd.DataFrame(filtered_notes)
File ~/anaconda3/envs/nlp/lib/python3.8/site-packages/tqdm/notebook.py:249, in tqdm_notebook.iter(self) 247 try: 248 it = super(tqdm_notebook, self).iter() --> 249 for obj in it: 250 # return super(tqdm...) will not catch exception 251 yield obj 252 # NB: except ... [ as ...] breaks IPython async KeyboardInterrupt
File ~/anaconda3/envs/nlp/lib/python3.8/site-packages/tqdm/std.py:1182, in tqdm.iter(self) 1179 time = self._time 1181 try: -> 1182 for obj in iterable: 1183 yield obj 1184 # Update and possibly print the progressbar. 1185 # Note: does not call self.update(1) for speed optimisation.
File ~/anaconda3/envs/nlp/lib/python3.8/multiprocessing/pool.py:868, in IMapIterator.next(self, timeout) 866 if success: 867 return value --> 868 raise value
IndexError: list index out of range
Hi, could you give more context for your error? I saw you are using a .py
file. Is it the same as the .ipynb
notebook?
As a quick bugfix, you may also try to re-run your program since as I remember it could be a network error
@fedebotu Thank you very much for your reply.
I converted the .ipynv file to .py. I have tried a few times, all failed due to the same problem.
@DongXingshuai I found out why. Apparently, a meta_note
was not available in a paper (hence the error). Putting a try... except
fixed it!
Here is the updated filted_data
function:
def filter_data(item,
review_keys=['summary_of_the_paper', 'strength_and_weaknesses', 'clarity,_quality,_novelty_and_reproducibility', 'summary_of_the_review'],
decision=True):
"""Filter only ratings, confidence, withdraw status and decisions"""
# parse each note
withdraw = 0
try:
# filter meta note
meta_note = [d for d in item if 'Paper' not in d['invitation']]
# check withdrawn
withdraw = 1 if 'Withdrawn_Submission' in meta_note[0]['invitation'] else 0
except:
# note: simple pass for no meta notes
pass
# decision
if decision:
try:
if withdraw == 0:
decision_note = [d for d in item if 'Decision' in d['invitation']]
decision = decision_note[0]['content']['decision']
else:
decision = ''
except:
decision = ''
# filter reviewer comments
comment_notes = [d for d in item \
if 'Official_Review' in d['invitation'] and 'recommendation' in d['content'].keys()]
comment_notes = sorted(comment_notes, key=lambda d: d['number'])[::-1]
ratings = [int(note['content']['recommendation'].split(':')[0]) for note in comment_notes]
confidences = [int(note['content']['confidence'].split(':')[0]) for note in comment_notes]
review_lengths = [sum(len(note['content'][key].split()) for key in review_keys) for note in comment_notes] # review lengths
data = {'ratings': ratings, 'confidences': confidences, 'withdraw': withdraw, 'review_lengths': review_lengths}
if decision: data['decision'] = decision
return data
```
@fedebotu thank you very much.