pyorient icon indicating copy to clipboard operation
pyorient copied to clipboard

Slowdown when querying large number of edges (~1000) and saving

Open sray1 opened this issue 9 years ago • 2 comments

OrientDB v 2.1.11, Python 2.7

Hi,

We are trying to perform a query of all connected in/out edges (numbering ~5000) from a set of vertices in our db and would like to save these edges' information like rid and various properties using this code:

query_string = """ SELECT @rid as edge_id, out.coding.code as outcode, out.coding.system as outsystem, out.primaryCoding.system as subject_system, out.primaryCoding.code as subject_code, out.primaryName as subject, predicate, in.primaryCoding.system as object_system, in.primaryCoding.code as object_code, in.primaryName as object, in.coding.code as incode, in.coding.system as insystem, semmedData FROM ( SELECT expand({}_SemmedRelationship) FROM V WHERE (primaryCoding in {}) {})"""
coding_str = ''.join([' OR coding CONTAINS ' + x for x in vertex_ids])

# pull out all in and out semmed relationships from code(s) (vertices)
out_query = query_string.format("out", vertex_ids, coding_str).replace("\n","").replace("'","")
in_query = query_string.format("in", vertex_ids, coding_str).replace("\n","").replace("'","")    

print "Performing out query ... "
print out_query
results_out = client.query(out_query,50000)
out_common_edges = [results_out[x]._OrientRecord__o_storage for x in range(len(results_out))] 
print len(out_common_edges), " found"

Here is the profile of the query represented above when run on the orientdb console:

orientdb {db=NewNewYork2}> explain SELECT @rid as edge_id, out.coding.code as outcode, out.coding.system as outsystem, out.primaryCoding.system as subject_system, out.primaryCoding.code as subject_code, out.primaryName as subject, predicate, in.primaryCoding.system as object_system, in.primaryCoding.code as object_code, in.primaryName as object, in.coding.code as incode, in.coding.system as insystem, semmedData FROM ( SELECT expand(out_SemmedRelationship) FROM V WHERE (primaryCoding in [#11:3580077, #11:3340220, #11:2161528, #11:4019406, #11:3410703, #11:2834440, #11:1760609, #11:1918014, #11:1807250, #11:2411355, #11:3161405, #11:3414125, #11:1498397, #11:2119476, #11:3879648, #11:2709654, #11:3589947, #11:2093892, #11:3262158, #11:1930206, #11:3180706, #11:2460316, #11:3953474, #11:3626507, #11:3475792, #11:1783170, #11:3976802, #11:3592963]) OR coding CONTAINS #11:3580077 OR coding CONTAINS #11:3340220 OR coding CONTAINS #11:2161528 OR coding CONTAINS #11:4019406 OR coding CONTAINS #11:3410703 OR coding CONTAINS #11:2834440 OR coding CONTAINS #11:1760609 OR coding CONTAINS #11:1918014 OR coding CONTAINS #11:1807250 OR coding CONTAINS #11:2411355 OR coding CONTAINS #11:3161405 OR coding CONTAINS #11:3414125 OR coding CONTAINS #11:1498397 OR coding CONTAINS #11:2119476 OR coding CONTAINS #11:3879648 OR coding CONTAINS #11:2709654 OR coding CONTAINS #11:3589947 OR coding CONTAINS #11:2093892 OR coding CONTAINS #11:3262158 OR coding CONTAINS #11:1930206 OR coding CONTAINS #11:3180706 OR coding CONTAINS #11:2460316 OR coding CONTAINS #11:3953474 OR coding CONTAINS #11:3626507 OR coding CONTAINS #11:3475792 OR coding CONTAINS #11:1783170 OR coding CONTAINS #11:3976802 OR coding CONTAINS #11:3592963)
Profiled command '{documentReads:5631,current:#14:28708870,recordReads:5631,limit:-1,fetchingFromTargetElapsed:939,expandElapsed:0,evaluated:5631,user:#5:3,elapsed:1021.9517,resultType:collection,resultSize:5631}' in 1.030000 sec(s): {"@type":"d","@version":0,"documentReads":5631,"current":"#14:28708870","recordReads":5631,"limit":-1,"fetchingFromTargetElapsed":939,"expandElapsed":0,"evaluated":5631,"user":"#5:3","elapsed":1021.9517,"resultType":"collection","resultSize":5631,"@fieldTypes":"documentReads=l,current=x,recordReads=l,fetchingFromTargetElapsed=l,expandElapsed=l,evaluated=l,user=x,elapsed=f"}

The issue is that there seems to be a slowdown issue when saving the query results via pyorient in results_out, where this would take about 10 mins or more. My questions are: is it possible do querying of large number of edges and save them and if so, do I need to do some optimization to this process to my code (i.e. split the querying into smaller batches and then saving the results)?

sray1 avatar Jun 07 '16 22:06 sray1

hi @sray1 How heavy is the slowdown? I mean from what you pasted the query with orient console took 1021 seconds ~15 min, What it takes to do that with pyorient ? Some overhead is due to the parsing into a python object, but should be not so big.

mogui avatar Jun 23 '16 13:06 mogui

Hi @mogui,

I figured out the reason for the slowdown. For the edge class SemmedRelationship we use for in our db, there is the property SemmedData which is type EmbeddedSet and has Linked_Class to class SemmedDatum. SemmedDatum has 21 properties of strings and integers. Occasionally, some of the edges will have up to 35000 items in SemmedData as you can see below:

orientdb {db=NewNewYork2}> SELECT @rid AS edge_id, out.coding.code AS outcode, out.coding.system AS outsystem,     out.primaryCoding.system AS subject_system, out.primaryCoding.code AS subject_code,     out.primaryName AS subject, predicate, in.primaryCoding.system AS object_system,     in.primaryCoding.code AS object_code, in.primaryName AS object, in.coding.code AS incode,     in.coding.system AS insystem, semmedData.size() AS sDsize, semmedData FROM (SELECT expand(out_SemmedRelationship) FROM V     WHERE (primaryCoding IN [#11:3648344, #11:4112572, #11:1846185, #11:4047548, #11:2548544, #11:3582502, #11:3135954, #11:2876836, #11:3298224, #11:3046971, #11:3059364, #11:1896646, #11:2797524, #11:1854760, #11:2502210, #11:3436577, #11:2117886, #11:1487787, #11:2397909, #11:2179130, #11:2544438, #11:2430850, #11:2382907, #11:2270295, #11:2114527, #11:2899698, #11:2414091, #11:1652791, #11:3083110, #11:2686014, #11:1925427, #11:2625708, #11:2922123, #11:3163160])  OR coding CONTAINS #11:3648344 OR coding CONTAINS #11:4112572 OR coding CONTAINS #11:1846185 OR coding CONTAINS #11:4047548 OR coding CONTAINS #11:2548544 OR coding CONTAINS #11:3582502 OR coding CONTAINS #11:3135954 OR coding CONTAINS #11:2876836 OR coding CONTAINS #11:3298224 OR coding CONTAINS #11:3046971 OR coding CONTAINS #11:3059364 OR coding CONTAINS #11:1896646 OR coding CONTAINS #11:2797524 OR coding CONTAINS #11:1854760 OR coding CONTAINS #11:2502210 OR coding CONTAINS #11:3436577 OR coding CONTAINS #11:2117886 OR coding CONTAINS #11:1487787 OR coding CONTAINS #11:2397909 OR coding CONTAINS #11:2179130 OR coding CONTAINS #11:2544438 OR coding CONTAINS #11:2430850 OR coding CONTAINS #11:2382907 OR coding CONTAINS #11:2270295 OR coding CONTAINS #11:2114527 OR coding CONTAINS #11:2899698 OR coding CONTAINS #11:2414091 OR coding CONTAINS #11:1652791 OR coding CONTAINS #11:3083110 OR coding CONTAINS #11:2686014 OR coding CONTAINS #11:1925427 OR coding CONTAINS #11:2625708 OR coding CONTAINS #11:2922123 OR coding CONTAINS #11:3163160) order by sDsize desc limit 20

----+------+------+------+-------+--------+---------+---------+---------+---------+---------+---------+------------+------------+----------------+------------------
#   |@CLASS|incode|sDsize|outcode|insystem|outsystem|predicate|semmedDat|object_co|edge_id  |subject_c|subject_syst|object_syste|object          |subject           
----+------+------+------+-------+--------+---------+---------+---------+---------+---------+---------+------------+------------+----------------+------------------
0   |null  |[2]   |35525 |[6]    |[2]     |[6]      |PROCES...|[35525]  |116154003|#14:14...|90688005 |http://sn...|http://sn...|Patient (person)|Chronic renal f...
1   |null  |[2]   |35525 |[5]    |[2]     |[5]      |PROCES...|[35525]  |116154003|#14:14...|46177005 |http://sn...|http://sn...|Patient (person)|End stage renal...
2   |null  |[2]   |10073 |[5]    |[2]     |[5]      |PROCES...|[10073]  |116154003|#14:89...|266553002|http://sn...|http://sn...|Patient (person)|Renal failure u...
3   |null  |[2]   |10073 |[5]    |[2]     |[5]      |PROCES...|[10073]  |116154003|#14:89...|42399005 |http://sn...|http://sn...|Patient (person)|Renal failure s...
4   |null  |[3]   |2981  |[6]    |[3]     |[6]      |PROCES...|[2981]   |67822003 |#14:50...|90688005 |http://sn...|http://sn...|Child (person)  |Chronic renal f...
5   |null  |[3]   |2981  |[6]    |[3]     |[6]      |PROCES...|[2981]   |410601007|#14:50...|90688005 |http://sn...|http://sn...|Childhood age...|Chronic renal f...
6   |null  |[3]   |2981  |[5]    |[3]     |[5]      |PROCES...|[2981]   |67822003 |#14:50...|46177005 |http://sn...|http://sn...|Child (person)  |End stage renal...
7   |null  |[3]   |2981  |[5]    |[3]     |[5]      |PROCES...|[2981]   |410601007|#14:50...|46177005 |http://sn...|http://sn...|Childhood age...|End stage renal...
8   |null  |[3]   |1538  |[6]    |[3]     |[6]      |PROCES...|[1538]   |371565004|#14:10...|90688005 |http://sn...|http://sn...|Rattus norveg...|Chronic renal f...
9   |null  |[3]   |1538  |[6]    |[3]     |[6]      |PROCES...|[1538]   |68485006 |#14:10...|90688005 |http://sn...|http://sn...|Old World rat...|Chronic renal f...
10  |null  |[3]   |1538  |[5]    |[3]     |[5]      |PROCES...|[1538]   |371565004|#14:10...|46177005 |http://sn...|http://sn...|Rattus norveg...|End stage renal...
11  |null  |[3]   |1538  |[5]    |[3]     |[5]      |PROCES...|[1538]   |68485006 |#14:10...|46177005 |http://sn...|http://sn...|Old World rat...|End stage renal...
12  |null  |[3]   |680   |[6]    |[3]     |[6]      |PROCES...|[680]    |125676002|#14:17...|90688005 |http://sn...|http://sn...|Person (person) |Chronic renal f...
13  |null  |[3]   |680   |[6]    |[3]     |[6]      |PROCES...|[680]    |257496000|#14:17...|90688005 |http://sn...|http://sn...|Person (person) |Chronic renal f...
14  |null  |[3]   |680   |[5]    |[3]     |[5]      |PROCES...|[680]    |125676002|#14:17...|46177005 |http://sn...|http://sn...|Person (person) |End stage renal...
15  |null  |[3]   |680   |[5]    |[3]     |[5]      |PROCES...|[680]    |257496000|#14:17...|46177005 |http://sn...|http://sn...|Person (person) |End stage renal...
16  |null  |[2]   |623   |[6]    |[2]     |[6]      |PROCES...|[623]    |385435006|#14:13...|90688005 |http://sn...|http://sn...|Individual (p...|Chronic renal f...
17  |null  |[2]   |623   |[5]    |[2]     |[5]      |PROCES...|[623]    |385435006|#14:13...|46177005 |http://sn...|http://sn...|Individual (p...|End stage renal...
18  |null  |[3]   |588   |[5]    |[3]     |[5]      |PROCES...|[588]    |371565004|#14:17...|266553002|http://sn...|http://sn...|Rattus norveg...|Renal failure u...
19  |null  |[3]   |588   |[5]    |[3]     |[5]      |PROCES...|[588]    |68485006 |#14:17...|266553002|http://sn...|http://sn...|Old World rat...|Renal failure u...
----+------+------+------+-------+--------+---------+---------+---------+---------+---------+---------+------------+------------+----------------+------------------

20 item(s) found. Query executed in 6.193 sec(s).

If I run this same query through pyorient, it would take a very long time to save the results pertaining to edges over approximately 1000 embedded items in SemmedData. If I filter out these edges with the most items, the query pertaining to pyorient reduces to only a few seconds or similar to the console performance. Fortunately, the edges with the most items are those which are the least useful ones for our analysis so filtering these works well for us. Please feel free to close this issue.

sray1 avatar Jun 23 '16 23:06 sray1