simple-salesforce icon indicating copy to clipboard operation
simple-salesforce copied to clipboard

Simpler output format for successful queries

Open kkgthb opened this issue 5 years ago • 12 comments

I've found that when I want the output of an SOQL query in Python, it's usually to get down and dirty with the actual values of the data. I don't typically need the "metadata" included in Salesforce.queryall()'s output.

I'd love to contribute code to give simple_salesforce an option to return "ready-to-crunch" data, but I'm not sure if there's anyone out there but me looking for something like this.

I haven't finished this (never did get it working w/ OrderedDict output), though it's "good enough" for personal use since I'm fine with processing normal dicts (I now write code like allcontacts = stripJunkSimpleSalesforce(sf.query_all(queryString)), but before I work on "tidying it up" for sharing with the world (e.g. adding some error handling for when there was an error returning the data, figuring out how to return it as an OrderedDict for consistency with the normal output, etc.), I guess I'm curious:

Is a function like this something that the simple_salesforce user community would be interested in seeing as part of native simple_salesforce functionality?

(Either as an added API call that can be wrapped around return results the way I am, or perhaps as something that gets called if an optional parameter is passed to a query() or query_all() command.)

My code's functionality is to:

  1. Strip out excessive nesting, making everything a much simpler "list of dicts, perhaps with some more lists of dicts inside where there were SELECT subqueries"
  2. Make "lookup fields" from a "parent" record simply a key-value pair in the record that has the lookup -- more like the way you type the SOQL query in the first place
  • (i.e. if it's a "Contact" query and you said "Account.Name" in the SOQL query, simply let the "Contact" dict have an "Account.Name" key, rather than burying it in a "Name" property of a separate dict under 'Account')
  1. Strip out any key-value pairs that aren't data as requested in the SOQL query

My code so far is:

def stripJunkSimpleSalesforce(structure):
    from collections import OrderedDict
    import json
    def fixdict(struc, keyChainList, parentListRecord, parentListAddColsAndVals, parentListKillCols, passingContext):
        if 'records' in struc.keys() and type(struc['records']) is list:
            parentListRecord[keyChainList[0]] = struc['records']
            fixlist(parentListRecord[keyChainList[0]], 'fixDict')
        else:
            if 'attributes' in struc.keys():
                struc.pop('attributes')
            for k in struc.keys():
                if k != 'attributes':
                    if type(struc[k]) in [dict, OrderedDict]:
                        parentListAddColsAndVals, parentListKillCols = fixdict(struc[k], keyChainList + [k], parentListRecord, parentListAddColsAndVals, parentListKillCols, 'fixDict')
                    if type(struc[k]) not in [None, list, dict, OrderedDict] and len(keyChainList) > 0:
                        parentListAddColsAndVals['.'.join(keyChainList+[k])] = struc[k]
                        if passingContext == 'fixDict':
                            parentListKillCols.add(keyChainList[0])
        return parentListAddColsAndVals, parentListKillCols
    def fixlist(struc, passingContext):
        if type(struc) is list and len(struc) > 0:
            for x in struc:
                if type(x) in [dict, OrderedDict]:
                    listAddColsAndVals, listKillCols = fixdict(x, [], x, {}, set(), 'fixList')
                    if len(listAddColsAndVals) > 0:
                        for k,v in listAddColsAndVals.items():
                                x[k] = v
                    if len(listKillCols) > 0:
                        for k in listKillCols:
                            if k in x.keys():
                                x.pop(k)
        return
    outerStructure = None
    # I can't get this working for Ordered Dicts, so giving up and having it return a plain dict.  Leaving some junk from earlier attempt behind.
    if type(structure) in [dict, OrderedDict] and 'records' in structure.keys() and type(structure['records']) is list:
        if type(structure) is OrderedDict:
            outerStructure = json.loads(json.dumps(structure))['records']
        else:
            outerStructure = structure['records']
    if type(outerStructure) is list:
        fixlist(outerStructure, 'outermost')
    return outerStructure

kkgthb avatar Nov 07 '18 16:11 kkgthb

I would very much love to see this function in the package. Right now it just returns a huge OrderDict with dirty nested and nested data. Ideally what I want to do is get the data and load it into pandas.DataFrame

jerrytim avatar Nov 12 '18 20:11 jerrytim

You can just do a json.dumps(result) to get the JSON structure back out of the OrderedDict. I have seen a couple examples on StackOverflow of using pandas or another library to ingest that information.

espoelstra avatar Dec 12 '18 15:12 espoelstra

You can just do a json.dumps(result) to get the JSON structure back out of the OrderedDict. I have seen a couple examples on StackOverflow of using pandas or another library to ingest that information.

json.dumps(result) returns a JSON string, but still the junk response is there. I use this function to clean up the response:

def parse_sf_records(nested_dict):
    """Recursively parse the nested dictionaries returned by Salesforce Simple API library,
    :param nested_dict: Nested dictionary object
    :return: Flattened dictionary representing record
    """
    for k, v in nested_dict.items():
        if k == 'attributes':
            nested_dict.pop(k)
        elif isinstance(v, dict):
            clean_dict = parse_sf_records(v)

            for child_key, clean_value in clean_dict.items():
                clean_key = "{}.{}".format(k, child_key)
                nested_dict[clean_key] = clean_value

            nested_dict.pop(k)

    return nested_dict

jerrytim avatar Dec 17 '18 16:12 jerrytim

I agree there needs to be some work on how the responses are formatted, especially with the difference between "results" and "records" depending on whether you did a query or a POST.

I would love to see it more similar to the Requests API for handling responses, have a resp.status_code for the return code of the call, resp.json to get the raw result, and maybe resp.text to nicely format any error messages that were triggered/returned by the request if they weren't JSON.

Your data cleaning function seems pretty useful as well ensuring that things are consistent and usable.

espoelstra avatar Jan 09 '19 16:01 espoelstra

@espoelstra I too would like to go this route. I'm leaning towards all calls returning the requests.response object and then adding convenience flags for some of the methods to return cleaned/scrubbed data. This will give the most flexibility and predictability for users.

andscoop avatar Feb 07 '19 19:02 andscoop

I agree that the ordered dictionary format is unpleasant to work with, to say the least. Here's a quick function I created to convert the ordered dictionary format from Simple Salesforce's query/query_all methods into a Pandas DataFrame:

import pandas as pd

def convert_simple_salesforce_ordered_dictionary_to_pandas_dataframe(ordered_dictionary):

   # Get records list from the ordered dictionary
   ord_dict_records = ordered_dictionary['records']
   dictionary = {}

   # Create the dictionary with the proper formatting to import to the Pandas Dataframe
   # Note, special case for 'attributes' key since it's a nested ordered dictionary
   # Example format: d = {'col1': [1, 2], 'col2': [3, 4]}
   # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
   for key, value in ord_dict_records[0].items():
        if key == 'attributes':
            for attribute_key, attribute_value in value.items():
                dictionary[attribute_key] = []
        else:
            dictionary[key] = []

   # populate the dictionary. The keys being the future pandas dataframe headers
   # and the values being appended to the dictionary value list
   for record in ord_dict_records:
        for key, value in record.items():
            if key == 'attributes':
                for attribute_key, attribute_value in value.items():
                    dictionary[attribute_key].append(attribute_value)
            else:
                dictionary[key].append(value)

    return pd.DataFrame(data=dictionary)

sergeiwallace avatar Mar 05 '19 02:03 sergeiwallace

Adding my code here in case it helps someone writing their own function, definitely could be improved upon upon but does what I need it to. By default the variable parents is set to a list with the value I most commonly use, I just left it as an example. Would be interested to see how people are optimizing for larger queries.

import pandas as pd 

def ordered_dict_to_dataframe(results, parents=['Licensee_Report__r']):
    """ 
     results: OrderedDict from SOQL simple_salesforce query
     parents: List of strings, names of parent objects
     output: data frame of query results
    """
    out_df = pd.DataFrame(results['records'])
    for parent in parents:
        parent_data = out_df[parent].values
        map(lambda x: x.pop('attributes'), parent_data)
        parent_df = pd.DataFrame(list(parent_data))
        out_df = pd.concat([out_df,parent_df ], axis=1)
    out_df.drop(parents+['attributes'], axis=1, inplace=True)
    return out_df

alexforrest avatar Jul 10 '19 21:07 alexforrest

Going to drop this in here - hope it helps someone!

def make_dict(query):
    query = json.loads(json.dumps(query))
    return query

patricktrainer avatar Dec 05 '19 18:12 patricktrainer

Hi all,

The below code will work for "bottom up" queries. That means if you select from a child up. Example: SELECT Account.Name, Opportunity.StageName FROM Opportinity

Top down approach won't work: Example SELECT Name, (SELECT StageName FROM Opportunities) FROM Account

Here is the code:

def recursive_walk(od_field: OrderedDict):
    """
    Recursively flattens each row the results of simple salesforce.
    Only works for bottom up queries.
    :param od_field: results returned by simple salesforce (multiple objects)
    :return: returns a flattened list of dictionaries
    """
    d = {}
    for k in od_field.keys():
        if isinstance(od_field[k], OrderedDict) & (k != 'attributes'):
            if 'attributes' in od_field[k].keys():
                ret_df = recursive_walk(od_field[k])
                d = {**d, **ret_df}
        else:
            if k != 'attributes':
                obj = ''.join([char for char in od_field['attributes']['type'] if char.isupper()])
                d[f'{obj}.{k}'] = od_field[k]
    return d

def transform_sf_result_set_rec(query_results: OrderedDict):
    """
    Recursively flattens the results of simple salesforce. It needs flattening when  selecting
    multiple objects.
    :param query_results:
    :return:
    """
    data = []
    for res in query_results:
        d = recursive_walk(res)
        data.append(d)
    return data

This is how you call it. And make sure to pass only the ['records'] of your result.

# dict_results  < -- This is the result of your SOQL
array_dicts = transform_sf_result_set_rec(dict_results['records'])
pd.DataFrame(array_dicts)

orion512 avatar Mar 12 '20 22:03 orion512

Hi all,

The below code will work for "bottom up" queries. That means if you select from a child up. Example: SELECT Account.Name, Opportunity.StageName FROM Opportinity

Hey @orion512 - thanks so much for sharing this, it's helped me a lot. One thing I've noticed is that if two fields are selected from the same relationship, only one will be returned in the dataframe.

For example, applying the function to the following query will only return Customer Number and Created By columns.

SELECT CustomerNumber, CreatedBy.Name, LastModifiedBy.Name FROM Customer

The raw query results look like this -

CreatedBy.Name OrderedDict([('attributes', OrderedDict([('type', 'User'), ('url', '/services/data/v42.0/sobjects/User/00524000001mXXXXX')])), ('Name', 'Example Name One')])

LastModifiedBy.Name OrderedDict([('attributes', OrderedDict([('type', 'User'), ('url', '/services/data/v42.0/sobjects/User/0051p00000AXXXXX')])), ('Name', 'Example Name Two')])

Do you know how the function could be amended to resolve this?

p-hather avatar Dec 03 '20 18:12 p-hather

If you are still looking for a solution, I wrote this today. Works for relationship / bottom up queries. 🍻

def query_all_as_dataframe(self, query, include_deleted=False, include_parent_relationship=False, include_attributes=False, **kwargs):
        attributes = search(r"(?<=SELECT|select)(.*)(?=FROM|from)", query).group()
        relationships = findall(r"([A-z]+\.[A-z]+)", attributes)

        records = list(self.query_all_iter(query, include_deleted=include_deleted, **kwargs))
        dataframe = DataFrame(records)

        rm = []
        for relationship in relationships:
            object_name, attr = relationship.split(".")

            dataframe[f"{object_name}_{attr}"] = DataFrame(list(dataframe[object_name]))[0].apply(lambda x: x if x == None else x[attr])
            if object_name not in rm: rm.append(object_name)
            
        if not include_attributes: dataframe = dataframe.drop(columns=["attributes"])
        if not include_parent_relationship: dataframe = dataframe.drop(columns=rm)
        
        return dataframe

uname-n avatar Jan 26 '23 22:01 uname-n

Stumbled upon this thread because I ran into the same issue.

I used this post to come up with a workable solution => FreeCodeCamp

This works well for relationship / bottom-up queries.


from collections.abc import MutableMapping

# generator function
def _flatten_dict_gen(d, parent_key, sep, root_key="Root", drop_keys=[]):
    for k, v in d.items():
        if k not in drop_keys:
            new_key = f"{parent_key}{sep}{k}" if parent_key else f"{root_key}{sep}{k}"
            if isinstance(v, MutableMapping):
                yield from flatten_dict(v, new_key, sep=sep, drop_keys=drop_keys).items()
            else:
                yield new_key, v


def flatten_dict(d: MutableMapping, parent_key: str = '', sep: str = '.', root_key="", drop_keys=[]):
    return dict(_flatten_dict_gen(d, parent_key, sep, root_key=root_key, drop_keys=drop_keys))

calling the function

# convert to dict and flatten
flattened_records = [
    flatten_dict(x, root_key='RootObject', drop_keys=['attributes']) 
    for x in query_results['records']
]

enel-ihopper avatar Mar 17 '23 19:03 enel-ihopper