ICLR2023-OpenReviewData icon indicating copy to clipboard operation
ICLR2023-OpenReviewData copied to clipboard

Cannot crawl the data from the OpenReview website

Open DongXingshuai opened this issue 1 year ago • 4 comments

Hi there, I tried to run the parse_data.py to crawl data from openreview. Unfortunately, it did not work. The following are the error messages. Is anybody can give me a hand? Thank you!

ipython parse_data.py Offset: 0 Data: 0 Offset: 1000 Data: 1000 Offset: 2000 Data: 2000 Offset: 3000 Data: 3000 Offset: 4000 Data: 3809 Number of submissions: 3809 Number of papers (including old): 4874 0%| | 0/4874 [00:00<?, ?it/s] 0%| | 0/4874 [00:00<?, ?it/s]

RemoteTraceback Traceback (most recent call last) RemoteTraceback: """ Traceback (most recent call last): File "/home/dongxingshuai/anaconda3/envs/nlp/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/home/dongxingshuai/research_associate/nlp/ICLR2023-OpenReviewData-main/notebooks/parse_data.py", line 166, in filter_data withdraw = 1 if 'Withdrawn_Submission' in meta_note[0]['invitation'] else 0 IndexError: list index out of range """

The above exception was the direct cause of the following exception:

IndexError Traceback (most recent call last) File ~/research_associate/nlp/ICLR2023-OpenReviewData-main/notebooks/parse_data.py:195 190 # In[59]: 191 192 193 # filter data in a pool of processes 194 with Pool(8) as p: --> 195 filtered_notes = list(tqdm(p.imap(filter_data, notes), total=len(notes))) 198 # In[60]: 199 200 201 # create dataframe 202 ratings = pd.DataFrame(filtered_notes)

File ~/anaconda3/envs/nlp/lib/python3.8/site-packages/tqdm/notebook.py:249, in tqdm_notebook.iter(self) 247 try: 248 it = super(tqdm_notebook, self).iter() --> 249 for obj in it: 250 # return super(tqdm...) will not catch exception 251 yield obj 252 # NB: except ... [ as ...] breaks IPython async KeyboardInterrupt

File ~/anaconda3/envs/nlp/lib/python3.8/site-packages/tqdm/std.py:1182, in tqdm.iter(self) 1179 time = self._time 1181 try: -> 1182 for obj in iterable: 1183 yield obj 1184 # Update and possibly print the progressbar. 1185 # Note: does not call self.update(1) for speed optimisation.

File ~/anaconda3/envs/nlp/lib/python3.8/multiprocessing/pool.py:868, in IMapIterator.next(self, timeout) 866 if success: 867 return value --> 868 raise value

IndexError: list index out of range

DongXingshuai avatar Sep 07 '23 03:09 DongXingshuai

Hi, could you give more context for your error? I saw you are using a .py file. Is it the same as the .ipynb notebook? As a quick bugfix, you may also try to re-run your program since as I remember it could be a network error

fedebotu avatar Sep 07 '23 07:09 fedebotu

@fedebotu Thank you very much for your reply.

I converted the .ipynv file to .py. I have tried a few times, all failed due to the same problem.

DongXingshuai avatar Sep 07 '23 08:09 DongXingshuai

@DongXingshuai I found out why. Apparently, a meta_note was not available in a paper (hence the error). Putting a try... except fixed it!

Here is the updated filted_data function:

def filter_data(item, 
                review_keys=['summary_of_the_paper', 'strength_and_weaknesses', 'clarity,_quality,_novelty_and_reproducibility', 'summary_of_the_review'],
                decision=True):
    """Filter only ratings, confidence, withdraw status and decisions"""
    # parse each note
    withdraw = 0
    try:
        # filter meta note
        meta_note = [d for d in item if 'Paper' not in d['invitation']]
        # check withdrawn
        withdraw = 1 if 'Withdrawn_Submission' in meta_note[0]['invitation'] else 0
    except:
        # note: simple pass for no meta notes
        pass
    # decision
    if decision:
        try:
            if withdraw == 0:
                decision_note = [d for d in item if 'Decision' in d['invitation']]
                decision = decision_note[0]['content']['decision']
            else:
                decision = ''
        except:
            decision = ''
    # filter reviewer comments
    comment_notes = [d for d in item \
                     if 'Official_Review' in d['invitation'] and 'recommendation' in d['content'].keys()]
    comment_notes = sorted(comment_notes, key=lambda d: d['number'])[::-1]
    ratings = [int(note['content']['recommendation'].split(':')[0]) for note in comment_notes]
    confidences = [int(note['content']['confidence'].split(':')[0]) for note in comment_notes]
    review_lengths = [sum(len(note['content'][key].split()) for key in review_keys) for note in comment_notes] # review lengths

    data = {'ratings': ratings, 'confidences': confidences, 'withdraw': withdraw, 'review_lengths': review_lengths}
    if decision: data['decision'] = decision
    return data
    ```
    

fedebotu avatar Sep 07 '23 12:09 fedebotu

@fedebotu thank you very much.

DongXingshuai avatar Sep 11 '23 00:09 DongXingshuai