RForcecom icon indicating copy to clipboard operation
RForcecom copied to clipboard

Speedup/refactor proposal for rforcecom.query

Open ax42 opened this issue 9 years ago • 6 comments

I have an idea on how to improve up rforcecom.query, but I'm struggling with the mechanics of the XML, so if someone can point me in the right direction (@reportmort?) that would be helpful.

I think the function should run as follows:

  • fetch the first hunk of xml
  • extract all the //records (into a list?)
  • check the //done entry, if it's false
    • fetch the next lot of xml, add the //records into the list(?) above
  • Run through the records, find all that have a "type" attribute (which means they are a lookup field)
    • For all the child records, rename them to ParentName.ChildName (so the Name child in the Opportunity record become Opportunity.Name
    • Shift these records up a level in the tree (append and delete original?)
  • Rerun this step until it returns no records (because the tree could be multiple levels deep)

I think this will be faster / cleaner than the existing code (which builds an empty data.frame and then populates cell-by-cell) as it can also use locators which are hopefully quite optimised.

So far, I've got:

function(session, soqlQuery, queryAll=FALSE){

 # Retrieve XML via REST API

 if(queryAll){
   endpointPath <- rforcecom.api.getSoqlAllEndpoint(session['apiVersion'])
 } else {
   endpointPath <- rforcecom.api.getSoqlEndpoint(session['apiVersion'])
 }
 URL <- paste(session['instanceURL'], endpointPath, curlEscape(soqlQuery), sep="")

 fetchXML <- function(session, URL) {
   h <- basicHeaderGatherer()
   t <- basicTextGatherer()
   OAuthString <- paste("Bearer", session['sessionID'])
   httpHeader <- c("Authorization"=OAuthString, "Accept"="application/xml")
   curlPerform(url=URL, httpheader=httpHeader, headerfunction = h$update, writefunction = t$update, ssl.verifypeer=F)

   # BEGIN DEBUG
   if(exists("rforcecom.debug") && rforcecom.debug){ message(URL) }
   if(exists("rforcecom.debug") && rforcecom.debug){ message(t$value()) }
   # END DEBUG  

   return(t$value())
 }

 system.time(
   xml.in <- fetchXML(session, URL)
 )

 # Parse XML
 x.root <- xmlRoot(xmlParse(xml.in, asText=T))

 records.xml <- x.root['//records']

..but as I said, I'm really struggling with getting the XML stripped apart.

ax42 avatar Oct 23 '15 17:10 ax42

Another advantage of the above is that I think it will work with arbitrarily deep nested queries (e.g. SELECT Id, Name, (SELECT ActivityDate, Description FROM ActivityHistories) FROM Account from Issue #15

ax42 avatar Oct 23 '15 17:10 ax42

@ax42 I wrote a separate function just for parsing query response XML that can handle nested parent-child relationships and follows an algorithm similar to your suggested pseudocode. You can find that function on a branch in my forked repository: https://github.com/ReportMort/RForcecom/blob/query-refactor/R/rforcecom.utils.R

You can download and test this improved version of rforcecom.query by installing from my branch. I may open a pull request if things seem acceptable to you.

library(devtools)
install_github('ReportMort/RForcecom', ref='query-refactor')

This function will obey your global setting for stringsAsFactors = F, so that addresses Issue https://github.com/hiratake55/RForcecom/issues/21 and it also handles the nested relationships described in Issue https://github.com/hiratake55/RForcecom/issues/15. The one thing is that it no longer uses type.convert, so numeric and integer variables stay as characters because the XML is characters. I think we can address the type conversion (including dates) at another time since I'd rather leave the user to decide how to type convert for their specific implementation.

One last thing I didn't test for an improvement in speed, but I think the lower bound on performance is always going to be how fast you can parse the XML, which is tough to get going quickly when it's millions of characters long. That said, I'm not sure how much we can optimize at the moment, but would love your input.

StevenMMortimer avatar Oct 24 '15 20:10 StevenMMortimer

Thanks -- I'll try and poke around with it on Sunday. Splitting things into pieces makes it easier to measure, and then if necessary, refactor them one-by-one.

ax42 avatar Oct 24 '15 23:10 ax42

FWIW, I'll be using line profiling as detailed here: http://adv-r.had.co.nz/Profiling.html One of the quotes from that page is "unlist(x, use.names = FALSE) is much faster than unlist(x)" so there may already be a quick win there.

ax42 avatar Oct 24 '15 23:10 ax42

First report back:

  • Works with the nested query SELECT Id, Name, CreatedDate, (SELECT ActivityDate, Description FROM ActivityHistories) FROM Account where CreatedDate > 2015-08-01T00:00:00.000Z :+1:
  • The query SELECT Id, Amount, Name, Account.Name, Owner.Name, Owner.Alias, Master_Opportunity__r.Name FROM Opportunity WHERE Opportunity.CreatedDate >=2014-06-01T00:00:00.000Z causes the function to choke with Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 1, 0

I'm passing the result of x.root <- xmlRoot(xmlParse(xml.in, asText=T)) to the function, where xml.in is the result payload from the request.

ax42 avatar Oct 25 '15 12:10 ax42

Thanks for checking. I just committed a new version of query_parser() that works for the list of queries shown below. I don't have the object Master_Opportunity__r, but it should work and behave like the references to Account and Owner in that same query you have. The code was a rush job this morning. I can review and clean it up later, but figured you might want something to play around with today if you've got the time.

Seemingly Working Queries:

columnTest <- rforcecom.query(session, soqlQuery = "SELECT Id, Name, CreatedDate FROM Account")
columnTest <- rforcecom.query(session, soqlQuery = "SELECT Id, Name, CreatedDate, (SELECT ActivityDate, Description FROM ActivityHistories) FROM Account where CreatedDate > 2015-08-01T00:00:00.000Z")
columnTest <- rforcecom.query(session, soqlQuery = "SELECT Id, Amount, Name, Account.Name, Owner.Name FROM Opportunity WHERE Opportunity.CreatedDate >=2014-06-01T00:00:00.000Z")
columnTest <- rforcecom.query(session, soqlQuery = "SELECT Id, Amount, Name, Account.Name, Owner.Name, Owner.Alias FROM Opportunity WHERE Opportunity.CreatedDate >=2014-06-01T00:00:00.000Z")

StevenMMortimer avatar Oct 25 '15 17:10 StevenMMortimer