lieu
lieu copied to clipboard
__cmp__ errors triggered in address_dupe_pairs (dedupe.py)
This appears to be a variation on issue #9 meaning it's a type mismatch being triggered in postal/utils/enum.py
but I am hoping you can provide some input on how to track down the root cause in order to remedy things.
That or some suggestions for how to trap this sort of thing and drop the non-result, because in the example below the EMR process ran for ~2 hours before finally failing.
The input data should be clean so I am not sure how to debug these errors...
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1532454729366_0001/container_1532454729366_0001_02_000003/pyspark.zip/pyspark/worker.py", line 174, in main
process()
File "/mnt/yarn/usercache/hadoop/appcache/application_1532454729366_0001/container_1532454729366_0001_02_000003/pyspark.zip/pyspark/worker.py", line 169, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/mnt/yarn/usercache/hadoop/appcache/application_1532454729366_0001/container_1532454729366_0001_02_000003/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/local/lib/python2.7/site-packages/lieu/spark/dedupe.py", line 28, in <lambda>
.filter(lambda ((uid1, uid2), (address_dupe_status, is_sub_building_dupe)): address_dupe_status in (duplicate_status.EXACT_DUPLICATE, duplicate_status.LIKELY_DUPLICATE) and is_sub_building_dup\
e) \
File "/usr/local/lib64/python2.7/site-packages/postal/utils/enum.py", line 16, in __cmp__
return self.value.__cmp__(other)
TypeError: long.__cmp__(x,y) requires y to be a 'long', not a 'tuple'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:100)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:99)
After adding more verbose logging to utils/enum.py
this is what I see:
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1532544793717_0001/container_1532544793717_0001_02_000004/pyspark.zip/pyspark/worker.py", line 174, in main
process()
File "/mnt/yarn/usercache/hadoop/appcache/application_1532544793717_0001/container_1532544793717_0001_02_000004/pyspark.zip/pyspark/worker.py", line 169, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/mnt/yarn/usercache/hadoop/appcache/application_1532544793717_0001/container_1532544793717_0001_02_000004/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/local/lib/python2.7/site-packages/lieu/spark/dedupe.py", line 28, in <lambda>
.filter(lambda ((uid1, uid2), (address_dupe_status, is_sub_building_dupe)): address_dupe_status in (duplicate_status.EXACT_DUPLICATE, duplicate_status.LIKELY_DUPLICATE) and is_s\
ub_building_dupe) \
File "/usr/local/lib64/python2.7/site-packages/postal/utils/enum.py", line 20, in __cmp__
raise Exception, err
Exception: failed to __cmp__ 'EXACT_DUPLICATE' this: '9' that: '(EXACT_DUPLICATE, 1.0)' error: 'long.__cmp__(x,y) requires y to be a 'long', not a 'tuple''
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
More example errors and details:
----
dupe class None == NEEDS_REVIEW
a1 '{'house_number': u'400', 'house': u'Daubendiek Karen', 'lon': -121.502926, 'phone': u'+1 916 321 4500', 'postcode': u'95814', 'country': u'US', 'lat': 38.579005, 'road': u'Capitol Mall'}' a2 '{'house_number': u'400', 'house': u'(Simpson Timber Company)', 'lon': -121.502922, 'phone': u'+1 916 492 9616', 'postcode': u'95814', 'country': u'US', 'lat': 38.579029, 'road': u'Capitol Mall'}'
p1 'Country Code: 1 National Number: 9163214500' p2 'Country Code: 1 National Number: 9164929616'
have True
same False
different True
-----
Traceback (most recent call last):
File "/usr/bin/dedupe_geojson", line 420, in <module>
is_dupe = dupe_func(canonical, other, dupe_pairs, dupes, **dupe_func_kw)
File "/usr/bin/dedupe_geojson", line 113, in is_name_address_dupe
fuzzy_street_name=fuzzy_street_names)
File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 420, in dupe_class_and_sim
name_fuzzy_dupe_class = PhoneNumberDeduper.revised_dupe_class(name_fuzzy_dupe_class, a1, a2)
File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 352, in revised_dupe_class
raise Exception, e
TypeError: long.__cmp__(x,y) requires y to be a 'long', not a 'NoneType'
More:
Traceback (most recent call last):
File "/usr/bin/dedupe_geojson", line 420, in <module>
is_dupe = dupe_func(canonical, other, dupe_pairs, dupes, **dupe_func_kw)
File "/usr/bin/dedupe_geojson", line 113, in is_name_address_dupe
fuzzy_street_name=fuzzy_street_names)
File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 424, in dupe_class_and_sim
name_fuzzy_dupe_class = PhoneNumberDeduper.revised_dupe_class(name_fuzzy_dupe_class, a1, a2)
File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 328, in revised_dupe_class
if dupe_class == None:
File "/usr/lib64/python2.7/site-packages/postal/utils/enum.py", line 16, in __cmp__
return self.value.__cmp__(other)
TypeError: long.__cmp__(x,y) requires y to be a 'long', not a 'NoneType'
This was triggering by the following:
if dupe_class == None:
print "DUPE CLASS IS NONE"
return duplicate_status.NEEDS_REVIEW
Now trying if dupe_class.value == None
but it does suggest that something, somewhere is creating a postal.EnumValue(foo)
instance where foo
is None
...
Once, more with type(dupe_class) == types.NoneType
because...
Traceback (most recent call last):
File "/usr/bin/dedupe_geojson", line 420, in <module>
is_dupe = dupe_func(canonical, other, dupe_pairs, dupes, **dupe_func_kw)
File "/usr/bin/dedupe_geojson", line 113, in is_name_address_dupe
fuzzy_street_name=fuzzy_street_names)
File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 424, in dupe_class_and_sim
name_fuzzy_dupe_class = PhoneNumberDeduper.revised_dupe_class(name_fuzzy_dupe_class, a1, a2)
File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 328, in revised_dupe_class
if dupe_class.value == None:
AttributeError: 'NoneType' object has no attribute 'value'
type(dupe_class) == types.NoneType
appears to have fixed (or at least) trapped the problem.