shareplum icon indicating copy to clipboard operation
shareplum copied to clipboard

Feature request: Paginated fetches for large lists

Open logan-pugh opened this issue 5 years ago • 3 comments

As we know from issues like #57, #63, and #106, SharePoint Online (and uncustomized on-premises versions) have a 5000 record limit for views and queries.

A workaround for this limitation is to fetch records in batches less than the 5000 item threshold. An example of this using ListItemCollectionPosition can be found here but requires use of indexes for all queried columns which may not always be possible.

A naïve solution that I have working is to use the ID column and Where clauses to create batches of up to row_limit and loop until there are no more results. This could fail if there are gaps in IDs larger than row_limit in the data though.

Regardless, here is a function that works for me. Perhaps this could be improved upon and included as a feature in SharePlum:

def get_sharepoint_items_paged(
    sp_list,
    id_column='ID',
    where_clause=None,
    fields=None,
    view_name=None,
    row_limit=1000,
):
    data = []
    current_id = 0

    def load_next_result_set():
        min_id = current_id
        max_id = current_id + row_limit

        query = dict()
        if where_clause:
            query['Where'] = [
                'And',
                where_clause,
                'And',
                ('Geq', id_column, str(min_id)),
                ('Lt', id_column, str(max_id)),
            ]
        else:
            query['Where'] = [
                'And',
                ('Geq', id_column, str(min_id)),
                ('Lt', id_column, str(max_id)),
            ]

        return sp_list.GetListItems(
            view_name=view_name,
            fields=fields,
            query=query,
            row_limit=row_limit,
        )

    result_set = load_next_result_set()
    while len(result_set) > 0:
        current_id += row_limit
        data.extend(result_set)
        result_set = load_next_result_set()
    return data

id_column defaults to ID (I assume this is present on every list but user may want to use another column) and where_clause is optional (I needed it for my use case), and is simply prepended to the pagination clauses. Otherwise I think the arguments are fairly self-explanatory (passed through to GetListItems as-is). Oh and row_limit defines the batch size (defaulting to 1000).

Thoughts?

logan-pugh avatar Jun 16 '20 20:06 logan-pugh

How would you know before hand how to break up the query? Are you saying you would use the total number of row ids in the list and limit the query to each subset to ensure that you didn't exceed 5000?

jasonrollins avatar Oct 08 '20 03:10 jasonrollins

@jasonrollins No idea what the best approach is to be honest, this is just a hack I threw together to get something working. I am not a SharePoint expert, but this seems like a common enough pain point to try something to address it.

I am no longer using SharePlum myself, but that is more to do with moving away from SharePoint than any kind of knock against this library. Appreciate your work!

logan-pugh avatar Oct 12 '20 19:10 logan-pugh

my revised:

def get_sharepoint_items_paged(
  sp_list,
  where_clause=None,
  fields=None,
  view_name=None,
  page_size=1000,
):
  data = []
  last_id = 0

  def load_next_result_set():

    query = dict()
    if where_clause:
      query['Where'] = [
          'And',
          where_clause,
          ('Gt', 'ID', str(last_id)),
      ]
    else:
      query['Where'] = [
          ('Gt', 'ID', str(last_id)),
      ]
    query['OrderBy'] = ['ID'] # replace order by ID

    return sp_list.GetListItems(
      view_name=view_name,
      fields=fields,
      query=query,
      row_limit=page_size,
    )

  result_set = load_next_result_set()
  while len(result_set) > 0:
    last_id = int(result_set[-1]['ID'])
    # print(last_id)
    data.extend(result_set)
    result_set = load_next_result_set()
  return data

appcorner avatar Jun 09 '22 09:06 appcorner