Office365-REST-Python-Client icon indicating copy to clipboard operation
Office365-REST-Python-Client copied to clipboard

Exceed 5000 list items limit

Open mingyuanhua opened this issue 3 years ago • 10 comments

I am able to access lists via this api, but always obtain empty items. The list is quite large, which contains 50,000+ records. Also get the same results with no data when using Postman; however, there are data if using AJAX.

After a long time research, I find the root cause is the 5000 list items limit in sharepoint API. I am trying to use solve this problem with caml query, but failed when using itemPosition with the error: ClientRequestException: ('-2147024860, Microsoft.SharePoint.SPQueryThrottledException', 'The attempted operation is prohibited because it exceeds the list view threshold.'

Please kindly give me some help and suggestions to achieve the data.

Part of my codes is shown below:

itemPosition = None
while True:
    query_text="<View><ViewFields><FieldRef Name='ID'/></ViewFields><Query></Query><RowLimit>5000</RowLimit></View>"
    caml_query = CamlQuery.parse(query_text)
    caml_query.ListItemCollectionPosition = itemPosition;
    items = list_source.get_items(caml_query)
    ctx.load(items)
    ctx.execute_query()

    itemPosition = items.ListItemCollectionPosition
    print(itemPosition)
    if itemPosition == None:
        break                

mingyuanhua avatar Aug 16 '21 06:08 mingyuanhua

I am curious how to use itemPosition in python, which might solve this issue. @vgrem

mingyuanhua avatar Aug 20 '21 05:08 mingyuanhua

I have having the same issue, I can read the directories but everytime I try to return a list of files I either get an empty list or 500 max limit error @mingyuanhua have you solved this?

stefanstapinski avatar Sep 14 '21 15:09 stefanstapinski

I have having the same issue, I can read the directories but everytime I try to return a list of files I either get an empty list or 500 max limit error @mingyuanhua have you solved this?

Nope. Eventually I connected to the db directly rather than using the API.

mingyuanhua avatar Sep 19 '21 02:09 mingyuanhua

How do you connect DB ? Which DB ?

rangannanaik avatar Dec 21 '21 08:12 rangannanaik

Any solutions if list item has more than 5000 records how to handle ?

rangannanaik avatar Dec 21 '21 08:12 rangannanaik

Any solution for this ?

pecalleja avatar Feb 22 '22 00:02 pecalleja

I could not, I now just do individual calls i.e. (i'm still looking to figure on how to do bulk calls (i.e. use this function but call 100 at once) FILE_URL = '/sites/MySite/Folder/File.xlsm' ctx.web.get_file_by_server_relative_url(FILE_URL).get().execute_query()

nafets33 avatar Feb 22 '22 17:02 nafets33

Is the ListItemCollectionPosition class currently supported? And if so, what type of paging info should we give as input when calling the class? I tried to call

ListItemCollectionPosition("Paged=TRUE&PagedPrev=TRUE&p_ID=114957") , so that it starts from the item with id 114957, and then

qry=CamlQuery(listitem_collection_position=position) qry.ViewXml='.......' items = large_list.get_items(create_paged_query(10)).execute_query()

it results in

ClientRequestException: ('-1, Microsoft.Data.OData.ODataException', "A type named 'ListItemCollectionPosition' could not be resolved by the model. When a model is available, each type name must resolve to a valid type.", "400 Client Error: Bad Request for url.

Any idea how to solve the pagination issue? Currently, i have a large list with more than 50.000 files, so the only way to query it is by using paging. I can use the paged_query function from your examples to query the first 5000 files, but then how could I get the next page of results?

Thanks.

theodoriss avatar Apr 27 '22 15:04 theodoriss

Hi guys,

I have the same problem here, impossible to work with ListItemCollectionPosition. So after some hours, I decided to code a quick workaround. It might not be the most elegant, but it works.

The idea is to have two queries:

  • one that select items between two numerical values of a field (ID / UID should be the best), separated by a batch_size
  • the other checking if there are items left above a numerical value

And then in a while loop, I run the first query and do my stuff (currently pass). But if the len of the items collection is 0, I check if there are more items above the greater numerical values. If no, I quit the function.

Here is the code (which inside a method of a class). Hope it helps!

        def get_id_batch_query(low_value, high_value):
            qry = CamlQuery()
            qry.ViewXml = f"""
            <View Scope='RecursiveAll'>
                <Query><Where><And>
                    <Geq><FieldRef Name='ID' /><Value Type='Integer'>{low_value}</Value></Geq>
                    <Lt><FieldRef Name='ID' /><Value Type='Integer'>{high_value}</Value></Lt>
                </And></Where></Query>
            </View>
            """
            return qry
        
        def get_id_above_query(value):
            qry = CamlQuery()
            qry.ViewXml = f"""
            <View Scope='RecursiveAll'>
                <Query><Where>
                    <Geq><FieldRef Name='ID' /><Value Type='Integer'>{value}</Value></Geq>
                </Where></Query>
            </View>
            """
            return qry

        def more_id_to_get(s_list, value):
            list_qry = get_id_above_query(value)
            items = s_list.get_items(list_qry).execute_query()
            if len(items) == 0:
                return False
            else:
                return True
            

        id_count = 0
        while True:
            list_qry = get_id_batch_query(id_count, id_count+batch_size)
            items = s_list.get_items(list_qry).execute_query()
            
            if len(items) == 0:
                if not more_id_to_get(s_list, id_count):
                    print("No more items to get")
                    print("Stopping above id_count " + str(id_count))
                    return
            else:
                print("Loaded items count: {0}".format(len(items)))
                for index, item in enumerate(items):
                    pass
                    #print(item.properties)
            
            id_count += batch_size

winklefr avatar May 23 '22 16:05 winklefr

    id_count = 0
    while True:
        list_qry = get_id_batch_query(id_count, id_count+batch_size)
        items = s_list.get_items(list_qry).execute_query()

What is the s_list here? Cuz when I try to list folder that contains more than 5000 it returns 0 files there, so there is no point in passing that list. I tried sth like that https://github.com/vgrem/Office365-REST-Python-Client/blob/master/examples/sharepoint/lists/read_large_list.py to load list, but there is no page like that, so now I am not sure what I'm supposed to pass as s_list. My site looks like that: site -> - folder 1 - folder 2 ------ - folder_that_contains_over_5k_files - folder 3

so my list for s_list should be: large_list = ctx.web.lists.get_by_title("folder 2/folder_that_contains_over_5k_files") Am I right?

mpinior avatar Sep 01 '22 08:09 mpinior

Hello,

Sorry not responding sooner.

s_list is the item returned when you connect to sharepoint and request your list:

# Initiate connection and get data as SharePoint list object
            ctx = ClientContext(site_url).with_credentials(UserCredential(username, password))
            sp_lists = ctx.web.lists
            s_list = sp_lists.get_by_title(liste_name)  

It works fine for me, I don't use the inbuilt functions, I only use the basic ones.

But then when I cleaned my code I did it a bit differently. I wrote a "Connector" class with the following method to initiate the connection:

    def get_data_from_source (self, data_id: int):
        """Connect to client source and return raw data

        Args:
            data_id (int): number representing the data in the config file

        Returns:
            depends on client source: raw data, types depends of client source
        """
        
        if self.source == "SP":
            # Get relavant value from config file
            site_url = self.config['Connection']['site_url']
            username = self.config['Connection']['username']
            password = self.config['Connection']['password']
            data_name = self.config['Client_lists']['list_names'][data_id]
            data_info = self.config[data_name]
            
            # Initiate connection and get data as SharePoint list object
            ctx = ClientContext(site_url).with_credentials(UserCredential(username, password))
            sp_lists = ctx.web.lists
            data = sp_lists.get_by_title(data_name) 
            
        return data_name, data_info, data

self.confing being a conf file with my connection info and liste names, etc.

Then I have another method that calls the "data" returned by this method (wich contains, amongst other, the sharepoint list).

Hope it helps.

winklefr avatar Oct 04 '22 09:10 winklefr

Dear @vgrem. Thank you for providing Office365-REST-Python-Client/examples/sharepoint/lists/read_large_list.py, which caters for Sharepoint lists that give the error if retrieved in the normal way.

I have tried to work out how to do the same for folders that contain 10,000 or more files, and I have not succeeded. Can you please provide some hints as to how this might be possible? Thank you very much.

MauriceSnell avatar Nov 01 '22 02:11 MauriceSnell

Greetings @MauriceSnell and all,

here is an important update, in 2.3.14 version a few improvements in terms for working with large lists have been introduced

By large the list with the amount of items which exceeds 5000 items are considered

Namely, to return all the items the following options could be considered:

ClientObjectCollection.get_all() new method

get_all() method to retrieve all the items in a collection, regardless of the size, for example:

def print_progress(items):
    """
    :type items: office365.sharepoint.listitems.collection.ListItemCollection
    """
    print("Items read: {0}".format(len(items)))

page_size = 5000  #corresponds to default list view threshold
ctx = ClientContext(site_url).with_credentials(client_credentials)
large_list = ctx.web.lists.get_by_title(list_title)

all_items = target_list.items.get_all(page_size, print_progress).execute_query()

ClientObjectCollection.paged(n) method

It returns paged result, the following example demonstarte how to enumerate items across all the pages:

page_size = 5000

ctx = ClientContext(site_url).with_credentials(client_credentials)
large_list = ctx.web.lists.get_by_title(list_title)
paged_items = large_list.items.paged(page_size).get().execute_query()
for index, item in enumerate(paged_items):  # type: int, ListItem
    print("{0}: {1}".format(index, item.id))

vgrem avatar Nov 21 '22 19:11 vgrem