RForcecom icon indicating copy to clipboard operation
RForcecom copied to clipboard

Is speed an issue?

Open ax42 opened this issue 9 years ago • 9 comments

Hi

RForceCom feels slow -- it's taking about 4 minutes to execute a query which returns about 18'500 records with 9 variables.

    SELECT Id, Field, CreatedById, CreatedDate, IsDeleted, NewValue, OldValue, OpportunityId,     Opportunity.Name 
                FROM OpportunityFieldHistory 
                WHERE Opportunity.CreatedDate >= ", startDate, "T00:00:00.000Z")

I've not started tracing through this to figure out where the issue could be (SF? Network? XML vs JSON?). What's the best way to start digging into this issue?

ax42 avatar Oct 15 '15 08:10 ax42

Ok, I've done some ghetto profiling, and I think the line

xls <- lapply(lapply(xns, xmlToList), enlist)

in rforcecom.query() is the issue. Using a query which takes overall 4.2 seconds (427 records, so no need to refetch), around 1 second is for the data transfer (which is the same if I use another REST tool), and 2.2 seconds are used for the line above (so 2.2x the time the data transfer takes).

Creating a query which returns more records (~1'400) makes the discrepancy worse:

  • 2.3s for the curl
  • 7.299s for the xmlToList (3.2x the time of the curl)

I'd be interested in trying to improve this. Is there a faster XML library? Does one really need to convert the whole XML to a list? Would JSON be quicker?

ax42 avatar Oct 15 '15 14:10 ax42

More updates:

  • Asking for the results in JSON is not quicker (in fact it's slower to fetch JSON than XML in my tests, by a factor of 2)
  • I think the xml2 library will be faster, busy checking this out

ax42 avatar Oct 15 '15 22:10 ax42

OK, I've done some further benchmarking -- it seems that we can get some good speed increases if we use curl instead or RCurl (~50% speed improvement on a small test) and xml2 instead of XML.

@hiratake55 -- I think refactoring the whole library will be a challenge to do in one shot, what is your appetite to do this one function at a time? It will mean requiring both xml libraries and both curl libraries (as some functions are likely to use the one, and others the other.

ax42 avatar Oct 16 '15 13:10 ax42

Hi @ax42 ,

Thank you for contacting me, I'll check xml2 package fits RForcecom or not.

hiratake55 avatar Oct 17 '15 09:10 hiratake55

@ax42 Have you considered using the Bulk API features of the package? Below are some timings with roughly 650K records and I've pulled 50K in a second or two. Salesforce caches the queries, so they become faster if you repeat them.

Note: rforcecom.bulkQuery is a convenience wrapper I've written around rforcecom.submitBulkQuery and friends, but I'll submit PR today and hopefully have it be included in package as standard function. I'll post its code here as well, so you have it.

# before salesforce caching
# Salesforce runtime = 69 seconds, so 11 second overhead from R
system.time(d <- rforcecom.bulkQuery(session, 
                     soqlQuery = "Select Id from Account", 
                     object = 'Account', 
                     interval_seconds=1, 
                     max_attempts=100, 
                     verbose=FALSE))
   user  system elapsed 
   0.24    0.05    80.52 

dim(d)
[1] 644239      1

# after salesforce caching
# Salesforce runtime = 16 seconds, so 10 second overhead from R
system.time(d <- rforcecom.bulkQuery(session, 
                     soqlQuery = "Select Id from Account", 
                     object = 'Account', 
                     interval_seconds=1, 
                     max_attempts=100, 
                     verbose=FALSE))
   user  system elapsed 
   0.24    0.05    26.17 

dim(d)
[1] 644239      1

rforcecom.bulkQuery

rforcecom.bulkQuery <- function(session,
                               soqlQuery,
                               object,
                               interval_seconds=5,
                               max_attempts=100, 
                               verbose=FALSE){

  job_info <- rforcecom.createBulkJob(session, operation='query', object=object)
  batch_query_info <- rforcecom.submitBulkQuery(session,
                                                jobId=job_info$id,
                                                query=soqlQuery)
  status_complete <- FALSE
  z <- 1
  Sys.sleep(interval_seconds)
  while (z < max_attempts & !status_complete){
    if (verbose){
      message(paste0("Pass #", z))
    }
    Sys.sleep(interval_seconds)
    batch_query_status <- rforcecom.checkBatchStatus(session,
                                                     jobId=batch_query_info$jobId,
                                                     batchId=batch_query_info$id)
    status_complete <- (batch_query_status$state=='Completed')
    z <- z + 1
  }
  if (!status_complete) {
    message(paste('Issue with batches submitted', print(proceed_on_batches)))
    batch_query_details <- NULL
    tryCatch({
      batch_query_details <- rforcecom.getBatchDetails(session,
                                                       jobId=batch_query_info$jobId,
                                                       batchId=batch_query_info$id)
    }, error=function(e){
    })
    # close the job
    close_job_info <- rforcecom.closeBulkJob(session, jobId=job_info$id)
    return(batch_query_details)
  }
  batch_query_details <- rforcecom.getBatchDetails(session,
                                                   jobId=batch_query_info$jobId,
                                                   batchId=batch_query_info$id)

  batch_query_recordset <- rforcecom.getBulkQueryResult(session,
                                                        jobId=batch_query_info$jobId,
                                                        batchId=batch_query_info$id,
                                                        resultId=batch_query_details$result)
  close_job_info <- rforcecom.closeBulkJob(session, jobId=job_info$id)

  return(batch_query_recordset)
}

StevenMMortimer avatar Oct 18 '15 17:10 StevenMMortimer

That's cool, thanks. I'd not seen the bulk API pieces as I seem to be using v0.7 (off CRAN).

One thing the bulk query does not seem to be able to do is deal with foreign keys (e.g. fetching the details of an Account owner). SalesForce returns: Foreign Key Relationships not supported in Bulk Query

So it's probably really useful in some situations (straight dumps) but not in others (complex queries), although it may be faster to pull straight dumps off SF and combine them in R than run complex queries.

ax42 avatar Oct 18 '15 18:10 ax42

Yes, for large joins I would recommend pulling straight dumps of each object and joining in R (I use the dplyr package for joins). For small sets I would recommend using the SOAP-based function rforcecom.query just to cut down on R code. I haven't run performance tests to see who does the join faster on large datasets (R vs. Salesforce), but would be interesting to know.

If you want to experiment with some of the Bulk functions (since it's not on CRAN yet) you can install from the maintainer's Github or mine

Github Install

library(devtools)
install_github('ReportMort/RForcecom')

StevenMMortimer avatar Oct 18 '15 19:10 StevenMMortimer

I've installed the github version from the maintainer, and just copy/pasted the code you kindly provided. I'll try and get some benchmarks done over the next few days (although it seems our SalesForce instance is a lot smaller than yours).

My workflow so far has been to use http://dataloader.io to help me formulate my queries, and then I run them with rforcecom.query, which is why I have been focusing on benchmarking and/or improving that function specifically. Moving to xml2 should already make a huge difference.

The advantage of letting SalesForce do the joins is that you don't have to worry about consistency and you always get your dataset back exactly like you want it (especially if you are calling a bunch of lookup fields in a query). Each approach has its applications in the right place, and having both available in the library is great!

ax42 avatar Oct 18 '15 19:10 ax42

Any update on this, my queries from SFDC are extremely slow thus far.

robertocross avatar Dec 01 '17 19:12 robertocross