simple-salesforce
simple-salesforce copied to clipboard
Bulk query only returns the first batch
results.json() in the bulk API file referenced in the link below returns a list of batch ids that the query created. When I queried 50+ columns on the lead object with 300k+ leads, it was broken into two batches. The bulk API only returns the first batch.
The line responsible is
url_query_results = "{}{}{}".format(url, '/', result.json()[0])
which only uses the first batch
https://github.com/simple-salesforce/simple-salesforce/blob/231aadd3a41570690137575bd12b9d5e0a6bb6ac/simple_salesforce/bulk.py#L156-L161
a simple fix would be this:
if operation == 'query':
query_result = []
for batch_result in result.json():
url_query_results = "{}{}{}".format(url, '/', batch_result)
batch_result_json = call_salesforce(url=url_query_results, method='GET',
session=self.session,
headers=self.headers).json()
query_result.extend(batch_result_json)
return query_result
Proper way is to use "queryAll" operator I suppose. This is implemented here as I see: https://github.com/simple-salesforce/simple-salesforce/pull/259 But to use "queryAll" operator you should use API version >= 39.0.
Using the queryAll endpoint just includes deleted and archived records. It doesn't include everything in one single batch (as per documentation )
Executes the specified SOQL query. Unlike the Query resource, QueryAll will return records that have been deleted because of a merge or delete. QueryAll will also return information about archived Task and Event records. QueryAll is available in API version 29.0 and later.
The pull request you mentioned still has this issue. See the code below that still uses results.json()[0]
instead of looping through the batch Id's and getting all batch data.
https://github.com/simple-salesforce/simple-salesforce/blob/acd426280acb9c462fcf2c6bde982919b3b939b1/simple_salesforce/bulk.py#L157-L162
I thought there was a query_more
example in the documentation where you could check and see if a "page 1" or "next page" existed in the results and then you could query the next "page" of data?
In agreement with @skamensky on this one: the salesforce documentation states that queryAll returns merged and deleted data in addition to active data. However, the use of query_all for this library is misleading because (as it states in the README) query_all is "As a convenience, to retrieve all of the results in a single local method call use."
I think updating the README to reflect the proper usage and implementing @skamensky's proposed changes would work well.
I see there is a fix: https://github.com/simple-salesforce/simple-salesforce/pull/281
Anything holding it from being merged to master?
We are waiting on test to be written for that PR
@skamensky Thanks for the suggestion, I updated my pull request for Bulk queryAll #259
Just a quick question, would we see this update come in soon?
The quick fix mentioned works and should be merged asap. I don't see a reason or dependency why this can't be done.