lieu __cmp__ errors triggered in address_dupe

This appears to be a variation on issue #9 meaning it's a type mismatch being triggered in postal/utils/enum.py but I am hoping you can provide some input on how to track down the root cause in order to remedy things.

That or some suggestions for how to trap this sort of thing and drop the non-result, because in the example below the EMR process ran for ~2 hours before finally failing.

The input data should be clean so I am not sure how to debug these errors...

        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532454729366_0001/container_1532454729366_0001_02_000003/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532454729366_0001/container_1532454729366_0001_02_000003/pyspark.zip/pyspark/worker.py", line 169, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532454729366_0001/container_1532454729366_0001_02_000003/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/lib/python2.7/site-packages/lieu/spark/dedupe.py", line 28, in <lambda>
    .filter(lambda ((uid1, uid2), (address_dupe_status, is_sub_building_dupe)): address_dupe_status in (duplicate_status.EXACT_DUPLICATE, duplicate_status.LIKELY_DUPLICATE) and is_sub_building_dup\
e) \
  File "/usr/local/lib64/python2.7/site-packages/postal/utils/enum.py", line 16, in __cmp__
    return self.value.__cmp__(other)
TypeError: long.__cmp__(x,y) requires y to be a 'long', not a 'tuple'

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:100)
        at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:99)

Jul 25 '18 01:07 thisisaaronland

After adding more verbose logging to utils/enum.py this is what I see:

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532544793717_0001/container_1532544793717_0001_02_000004/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532544793717_0001/container_1532544793717_0001_02_000004/pyspark.zip/pyspark/worker.py", line 169, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532544793717_0001/container_1532544793717_0001_02_000004/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/lib/python2.7/site-packages/lieu/spark/dedupe.py", line 28, in <lambda>
    .filter(lambda ((uid1, uid2), (address_dupe_status, is_sub_building_dupe)): address_dupe_status in (duplicate_status.EXACT_DUPLICATE, duplicate_status.LIKELY_DUPLICATE) and is_s\
ub_building_dupe) \
  File "/usr/local/lib64/python2.7/site-packages/postal/utils/enum.py", line 20, in __cmp__
    raise Exception, err
Exception: failed to __cmp__ 'EXACT_DUPLICATE' this: '9' that: '(EXACT_DUPLICATE, 1.0)' error: 'long.__cmp__(x,y) requires y to be a 'long', not a 'tuple''

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

Jul 25 '18 22:07 thisisaaronland

More example errors and details:

----
dupe class None == NEEDS_REVIEW
a1 '{'house_number': u'400', 'house': u'Daubendiek Karen', 'lon': -121.502926, 'phone': u'+1 916 321 4500', 'postcode': u'95814', 'country': u'US', 'lat': 38.579005, 'road': u'Capitol Mall'}' a2 '{'house_number': u'400', 'house': u'(Simpson Timber Company)', 'lon': -121.502922, 'phone': u'+1 916 492 9616', 'postcode': u'95814', 'country': u'US', 'lat': 38.579029, 'road': u'Capitol Mall'}'
p1 'Country Code: 1 National Number: 9163214500' p2 'Country Code: 1 National Number: 9164929616'
have True
same False
different True
-----
Traceback (most recent call last):
  File "/usr/bin/dedupe_geojson", line 420, in <module>
    is_dupe = dupe_func(canonical, other, dupe_pairs, dupes, **dupe_func_kw)
  File "/usr/bin/dedupe_geojson", line 113, in is_name_address_dupe
    fuzzy_street_name=fuzzy_street_names)
  File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 420, in dupe_class_and_sim
    name_fuzzy_dupe_class = PhoneNumberDeduper.revised_dupe_class(name_fuzzy_dupe_class, a1, a2)
  File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 352, in revised_dupe_class
    raise Exception, e
TypeError: long.__cmp__(x,y) requires y to be a 'long', not a 'NoneType'

Jul 28 '18 02:07 thisisaaronland

More:

Traceback (most recent call last):
  File "/usr/bin/dedupe_geojson", line 420, in <module>
    is_dupe = dupe_func(canonical, other, dupe_pairs, dupes, **dupe_func_kw)
  File "/usr/bin/dedupe_geojson", line 113, in is_name_address_dupe
    fuzzy_street_name=fuzzy_street_names)
  File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 424, in dupe_class_and_sim
    name_fuzzy_dupe_class = PhoneNumberDeduper.revised_dupe_class(name_fuzzy_dupe_class, a1, a2)
  File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 328, in revised_dupe_class
    if dupe_class == None:
  File "/usr/lib64/python2.7/site-packages/postal/utils/enum.py", line 16, in __cmp__
    return self.value.__cmp__(other)
TypeError: long.__cmp__(x,y) requires y to be a 'long', not a 'NoneType'

This was triggering by the following:

	if dupe_class == None:
            print "DUPE CLASS IS NONE"
            return duplicate_status.NEEDS_REVIEW

Now trying if dupe_class.value == None but it does suggest that something, somewhere is creating a postal.EnumValue(foo) instance where foo is None...

Jul 31 '18 16:07 thisisaaronland

Once, more with type(dupe_class) == types.NoneType because...

Traceback (most recent call last):
  File "/usr/bin/dedupe_geojson", line 420, in <module>
    is_dupe = dupe_func(canonical, other, dupe_pairs, dupes, **dupe_func_kw)
  File "/usr/bin/dedupe_geojson", line 113, in is_name_address_dupe
    fuzzy_street_name=fuzzy_street_names)
  File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 424, in dupe_class_and_sim
    name_fuzzy_dupe_class = PhoneNumberDeduper.revised_dupe_class(name_fuzzy_dupe_class, a1, a2)
  File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 328, in revised_dupe_class
    if dupe_class.value == None:
AttributeError: 'NoneType' object has no attribute 'value'

Jul 31 '18 18:07 thisisaaronland

type(dupe_class) == types.NoneType appears to have fixed (or at least) trapped the problem.

Aug 01 '18 16:08 thisisaaronland

lieu
lieu copied to clipboard

cmp errors triggered in address_dupe_pairs (dedupe.py)

lieu lieu copied to clipboard

__cmp__ errors triggered in address_dupe_pairs (dedupe.py)

lieu
lieu copied to clipboard

cmp errors triggered in address_dupe_pairs (dedupe.py)