simple-salesforce icon indicating copy to clipboard operation
simple-salesforce copied to clipboard

Bulk query only returns the first batch

Open skamensky opened this issue 6 years ago • 8 comments

results.json() in the bulk API file referenced in the link below returns a list of batch ids that the query created. When I queried 50+ columns on the lead object with 300k+ leads, it was broken into two batches. The bulk API only returns the first batch. The line responsible is url_query_results = "{}{}{}".format(url, '/', result.json()[0]) which only uses the first batch

https://github.com/simple-salesforce/simple-salesforce/blob/231aadd3a41570690137575bd12b9d5e0a6bb6ac/simple_salesforce/bulk.py#L156-L161

a simple fix would be this:

        if operation == 'query':
            query_result = []
            for batch_result in result.json():
                url_query_results = "{}{}{}".format(url, '/', batch_result)
                batch_result_json = call_salesforce(url=url_query_results, method='GET',
                                                session=self.session,
                                                headers=self.headers).json()
                query_result.extend(batch_result_json)
            return query_result

skamensky avatar Nov 04 '18 05:11 skamensky

Proper way is to use "queryAll" operator I suppose. This is implemented here as I see: https://github.com/simple-salesforce/simple-salesforce/pull/259 But to use "queryAll" operator you should use API version >= 39.0.

Totorokrut avatar Nov 04 '18 10:11 Totorokrut

Using the queryAll endpoint just includes deleted and archived records. It doesn't include everything in one single batch (as per documentation )

Executes the specified SOQL query. Unlike the Query resource, QueryAll will return records that have been deleted because of a merge or delete. QueryAll will also return information about archived Task and Event records. QueryAll is available in API version 29.0 and later.

The pull request you mentioned still has this issue. See the code below that still uses results.json()[0] instead of looping through the batch Id's and getting all batch data.

https://github.com/simple-salesforce/simple-salesforce/blob/acd426280acb9c462fcf2c6bde982919b3b939b1/simple_salesforce/bulk.py#L157-L162

skamensky avatar Nov 04 '18 14:11 skamensky

I thought there was a query_more example in the documentation where you could check and see if a "page 1" or "next page" existed in the results and then you could query the next "page" of data?

espoelstra avatar Jan 09 '19 16:01 espoelstra

In agreement with @skamensky on this one: the salesforce documentation states that queryAll returns merged and deleted data in addition to active data. However, the use of query_all for this library is misleading because (as it states in the README) query_all is "As a convenience, to retrieve all of the results in a single local method call use."

I think updating the README to reflect the proper usage and implementing @skamensky's proposed changes would work well.

I see there is a fix: https://github.com/simple-salesforce/simple-salesforce/pull/281

Anything holding it from being merged to master?

wyattshapiro avatar Jan 16 '19 23:01 wyattshapiro

We are waiting on test to be written for that PR

andscoop avatar Feb 07 '19 17:02 andscoop

@skamensky Thanks for the suggestion, I updated my pull request for Bulk queryAll #259

petergtam avatar Apr 17 '19 04:04 petergtam

Just a quick question, would we see this update come in soon?

ghost avatar Feb 04 '20 15:02 ghost

The quick fix mentioned works and should be merged asap. I don't see a reason or dependency why this can't be done.

Phlogi avatar Feb 09 '22 10:02 Phlogi