splink
splink copied to clipboard
bug: "can't take logarithm of zero"
What happens?
I think we need to add in some safeguards when calculating log2(*bayes_factors) AS match_weight.
I didn't include a reproducible example, but I think you can see how this would come about:
- if any of the BFs are 0
- or, if they are so small that with floating point errors the multiplication of them is 0
then the arg passed to log2 will be zero.
Two options?
- add in a tiny delta like
LOG2(.00000000000000000000000001 + bf1*bf2*bf3...) - use
CASE WHEN <args> > 0 THEN LOG2(<args>> ELSE "-inf" ENDor similar
OutOfRangeException: Out of Range Error: cannot take logarithm of zero
The above exception was the direct cause of the following exception:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
| /Users/nc/code/scg/atlas/.venv/lib/python3.11/site-packages/splink/linker.py:1798 in predict │
│ │
│ 1795 │ │ for sql in sqls: │
│ 1796 │ │ │ self._enqueue_sql(sql["sql"], sql["output_table_name"]) │
│ 1797 │ │ │
│ ❱ 1798 │ │ predictions = self._execute_sql_pipeline(input_dataframes) │
│ 1799 │ │ self._predict_warning() │
│ 1800 │ │ │
│ 1801 │ │ [b.drop_materialised_id_pairs_dataframe() for b in exploding_br_with_id_tables] │
│ │
│ /Users/nc/code/scg/atlas/.venv/lib/python3.11/site-packages/splink/linker.py:650 in │
│ _execute_sql_pipeline │
│ │
│ 647 │ │ │ │ │ use_cache, │
│ 648 │ │ │ │ ) │
│ 649 │ │ │ except Exception as e: │
│ ❱ 650 │ │ │ │ raise e │
│ 651 │ │ │ finally: │
│ 652 │ │ │ │ self._pipeline.reset() │
│ 653 │
│ │
│ /Users/nc/code/scg/atlas/.venv/lib/python3.11/site-packages/splink/linker.py:644 in │
│ _execute_sql_pipeline │
│ │
│ 641 │ │ │ output_tablename_templated = self._pipeline.queue[-1].output_table_name │
│ 642 │ │ │ │
│ 643 │ │ │ try: │
│ ❱ 644 │ │ │ │ dataframe = self._sql_to_splink_dataframe_checking_cache( │
│ 645 │ │ │ │ │ sql_gen, │
│ 646 │ │ │ │ │ output_tablename_templated, │
│ 647 │ │ │ │ │ use_cache, │
│ │
│ /Users/nc/code/scg/atlas/.venv/lib/python3.11/site-packages/splink/linker.py:910 in │
│ _sql_to_splink_dataframe_checking_cache │
│ │
│ 907 │ │ │ │ print(df_pd) # noqa: T201 │
│ 908 │ │ │
│ 909 │ │ else: │
│ ❱ 910 │ │ │ splink_dataframe = self._execute_sql_against_backend( │
│ 911 │ │ │ │ sql, output_tablename_templated, table_name_hash │
│ 912 │ │ │ ) │
│ 913 │ │ │ self._intermediate_table_cache.executed_queries.append(splink_dataframe) │
│ │
│ /Users/nc/code/scg/atlas/.venv/lib/python3.11/site-packages/splink/duckdb/linker.py:220 in │
│ _execute_sql_against_backend │
│ │
│ 217 │ │ AS │
│ 218 │ │ ({sql}) │
│ 219 │ │ """ │
│ ❱ 220 │ │ self._log_and_run_sql_execution(sql, templated_name, physical_name) │
│ 221 │ │ │
│ 222 │ │ return DuckDBDataFrame(templated_name, physical_name, self) │
│ 223 │
│ │
│ /Users/nc/code/scg/atlas/.venv/lib/python3.11/site-packages/splink/linker.py:724 in │
│ _log_and_run_sql_execution │
│ │
│ 721 │ │ │ except Exception: │
│ 722 │ │ │ │ pass │
│ 723 │ │ │ │
│ ❱ 724 │ │ │ raise SplinkException( │
│ 725 │ │ │ │ f"Error executing the following sql for table " │
│ 726 │ │ │ │ f"`{templated_name}`({physical_name}):\n{final_sql}" │
│ 727 │ │ │ │ f"\n\nError was: {e}" │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
SplinkException: Error executing the following sql for table
`__splink__df_predict`(__splink__df_predict_d7948ec58):
CREATE TABLE __splink__df_predict_d7948ec58 AS
(
WITH __splink__df_concat_with_tf AS (
SELECT
*
FROM __splink__df_concat_with_tf_7a64735cd
), __splink__df_blocked AS (
SELECT
"l"."raw__politician__id" AS "raw__politician__id_l",
"r"."raw__politician__id" AS "raw__politician__id_r",
"l"."politician__first_name" AS "politician__first_name_l",
"r"."politician__first_name" AS "politician__first_name_r",
"l"."tf_politician__first_name" AS "tf_politician__first_name_l",
"r"."tf_politician__first_name" AS "tf_politician__first_name_r",
"l"."politician__first_name_sorted" AS "politician__first_name_sorted_l",
"r"."politician__first_name_sorted" AS "politician__first_name_sorted_r",
"l"."politician__first_name_gender" AS "politician__first_name_gender_l",
"r"."politician__first_name_gender" AS "politician__first_name_gender_r",
"l"."tf_politician__first_name_gender" AS "tf_politician__first_name_gender_l",
"r"."tf_politician__first_name_gender" AS "tf_politician__first_name_gender_r",
"l"."politician__last_name" AS "politician__last_name_l",
"r"."politician__last_name" AS "politician__last_name_r",
"l"."tf_politician__last_name" AS "tf_politician__last_name_l",
"r"."tf_politician__last_name" AS "tf_politician__last_name_r",
"l"."politician__last_name_sorted" AS "politician__last_name_sorted_l",
"r"."politician__last_name_sorted" AS "politician__last_name_sorted_r",
"l"."politician__party" AS "politician__party_l",
"r"."politician__party" AS "politician__party_r",
"l"."politician__state" AS "politician__state_l",
"r"."politician__state" AS "politician__state_r",
"l"."politician__city" AS "politician__city_l",
"r"."politician__city" AS "politician__city_r",
"l"."src" AS "src_l",
"r"."src" AS "src_r",
"l"."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" AS
"DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_l",
"r"."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" AS
"DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_r",
'0' AS match_key
FROM __splink__df_concat_with_tf AS l
INNER JOIN __splink__df_concat_with_tf AS r
ON (
(
(
l.src <> r.src
)
AND (
l.politician__first_name_sorted = r.politician__first_name_sorted
)
)
AND (
l."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" <>
r."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)"
)
)
WHERE
l."raw__politician__id" < r."raw__politician__id"
UNION ALL
SELECT
"l"."raw__politician__id" AS "raw__politician__id_l",
"r"."raw__politician__id" AS "raw__politician__id_r",
"l"."politician__first_name" AS "politician__first_name_l",
"r"."politician__first_name" AS "politician__first_name_r",
"l"."tf_politician__first_name" AS "tf_politician__first_name_l",
"r"."tf_politician__first_name" AS "tf_politician__first_name_r",
"l"."politician__first_name_sorted" AS "politician__first_name_sorted_l",
"r"."politician__first_name_sorted" AS "politician__first_name_sorted_r",
"l"."politician__first_name_gender" AS "politician__first_name_gender_l",
"r"."politician__first_name_gender" AS "politician__first_name_gender_r",
"l"."tf_politician__first_name_gender" AS "tf_politician__first_name_gender_l",
"r"."tf_politician__first_name_gender" AS "tf_politician__first_name_gender_r",
"l"."politician__last_name" AS "politician__last_name_l",
"r"."politician__last_name" AS "politician__last_name_r",
"l"."tf_politician__last_name" AS "tf_politician__last_name_l",
"r"."tf_politician__last_name" AS "tf_politician__last_name_r",
"l"."politician__last_name_sorted" AS "politician__last_name_sorted_l",
"r"."politician__last_name_sorted" AS "politician__last_name_sorted_r",
"l"."politician__party" AS "politician__party_l",
"r"."politician__party" AS "politician__party_r",
"l"."politician__state" AS "politician__state_l",
"r"."politician__state" AS "politician__state_r",
"l"."politician__city" AS "politician__city_l",
"r"."politician__city" AS "politician__city_r",
"l"."src" AS "src_l",
"r"."src" AS "src_r",
"l"."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" AS
"DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_l",
"r"."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" AS
"DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_r",
'1' AS match_key
FROM __splink__df_concat_with_tf AS l
INNER JOIN __splink__df_concat_with_tf AS r
ON (
(
(
l.src <> r.src
)
AND (
l.politician__last_name_sorted = r.politician__last_name_sorted
)
)
AND (
l."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" <>
r."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)"
)
)
WHERE
l."raw__politician__id" < r."raw__politician__id"
AND NOT (
COALESCE(
(
(
(
l.src <> r.src
)
AND (
l.politician__first_name_sorted = r.politician__first_name_sorted
)
)
AND (
l."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" <>
r."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)"
)
),
FALSE
)
)
UNION ALL
SELECT
"l"."raw__politician__id" AS "raw__politician__id_l",
"r"."raw__politician__id" AS "raw__politician__id_r",
"l"."politician__first_name" AS "politician__first_name_l",
"r"."politician__first_name" AS "politician__first_name_r",
"l"."tf_politician__first_name" AS "tf_politician__first_name_l",
"r"."tf_politician__first_name" AS "tf_politician__first_name_r",
"l"."politician__first_name_sorted" AS "politician__first_name_sorted_l",
"r"."politician__first_name_sorted" AS "politician__first_name_sorted_r",
"l"."politician__first_name_gender" AS "politician__first_name_gender_l",
"r"."politician__first_name_gender" AS "politician__first_name_gender_r",
"l"."tf_politician__first_name_gender" AS "tf_politician__first_name_gender_l",
"r"."tf_politician__first_name_gender" AS "tf_politician__first_name_gender_r",
"l"."politician__last_name" AS "politician__last_name_l",
"r"."politician__last_name" AS "politician__last_name_r",
"l"."tf_politician__last_name" AS "tf_politician__last_name_l",
"r"."tf_politician__last_name" AS "tf_politician__last_name_r",
"l"."politician__last_name_sorted" AS "politician__last_name_sorted_l",
"r"."politician__last_name_sorted" AS "politician__last_name_sorted_r",
"l"."politician__party" AS "politician__party_l",
"r"."politician__party" AS "politician__party_r",
"l"."politician__state" AS "politician__state_l",
"r"."politician__state" AS "politician__state_r",
"l"."politician__city" AS "politician__city_l",
"r"."politician__city" AS "politician__city_r",
"l"."src" AS "src_l",
"r"."src" AS "src_r",
"l"."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" AS
"DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_l",
"r"."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" AS
"DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_r",
'2' AS match_key
FROM __splink__df_concat_with_tf AS l
INNER JOIN __splink__df_concat_with_tf AS r
ON (
(
(
l.src <> r.src
) AND (
l.politician__state = r.politician__state
)
)
AND (
l."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" <>
r."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)"
)
)
WHERE
l."raw__politician__id" < r."raw__politician__id"
AND NOT (
COALESCE(
(
(
(
l.src <> r.src
)
AND (
l.politician__first_name_sorted = r.politician__first_name_sorted
)
)
AND (
l."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" <>
r."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)"
)
),
FALSE
)
OR COALESCE(
(
(
(
l.src <> r.src
)
AND (
l.politician__last_name_sorted = r.politician__last_name_sorted
)
)
AND (
l."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" <>
r."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)"
)
),
FALSE
)
)
), __splink__df_comparison_vectors AS (
SELECT
"raw__politician__id_l",
"raw__politician__id_r",
"politician__first_name_l",
"politician__first_name_r",
"politician__first_name_sorted_l",
"politician__first_name_sorted_r",
"politician__first_name_gender_l",
"politician__first_name_gender_r",
CASE
WHEN "politician__first_name_l" IS NULL OR "politician__first_name_r" IS NULL
THEN -1
WHEN politician__first_name_l[1] = politician__first_name_r[1]
AND (
LENGTH(politician__first_name_r) = 1 OR LENGTH(politician__first_name_l) = 1
)
THEN 5
WHEN "politician__first_name_l" = "politician__first_name_r"
THEN 4
WHEN ARE_ALIASES(politician__first_name_l, politician__first_name_r)
THEN 3
WHEN (
DAMERAU_LEVENSHTEIN("politician__first_name_l", "politician__first_name_r") <= 1
)
OR (
(
DAMERAU_LEVENSHTEIN("politician__first_name_l", "politician__first_name_r") <= 2
)
AND (
STRLEN(politician__first_name_l) >= 5
)
)
OR (
(
(
CONTAINS(politician__first_name_l, politician__first_name_r)
AND LENGTH(politician__first_name_r) >= 3
)
OR (
CONTAINS(politician__first_name_r, politician__first_name_l)
AND LENGTH(politician__first_name_l) >= 3
)
)
)
OR (
"politician__first_name_sorted_l" = "politician__first_name_sorted_r"
)
THEN 2
WHEN politician__first_name_gender_l = politician__first_name_gender_r
THEN 1
ELSE 0
END AS gamma_politician__first_name,
"tf_politician__first_name_l",
"tf_politician__first_name_r",
"tf_politician__first_name_gender_l",
"tf_politician__first_name_gender_r",
"politician__last_name_l",
"politician__last_name_r",
"politician__last_name_sorted_l",
"politician__last_name_sorted_r",
CASE
WHEN "politician__last_name_l" IS NULL OR "politician__last_name_r" IS NULL
THEN -1
WHEN "politician__last_name_l" = "politician__last_name_r"
THEN 2
WHEN (
DAMERAU_LEVENSHTEIN("politician__last_name_l", "politician__last_name_r") <= 1
)
OR (
(
DAMERAU_LEVENSHTEIN("politician__last_name_l", "politician__last_name_r") <= 2
)
AND (
STRLEN(politician__last_name_l) >= 5
)
)
OR (
(
(
CONTAINS(politician__last_name_l, politician__last_name_r)
AND LENGTH(politician__last_name_r) >= 3
)
OR (
CONTAINS(politician__last_name_r, politician__last_name_l)
AND LENGTH(politician__last_name_l) >= 3
)
)
)
OR (
"politician__last_name_sorted_l" = "politician__last_name_sorted_r"
)
THEN 1
ELSE 0
END AS gamma_politician__last_name,
"tf_politician__last_name_l",
"tf_politician__last_name_r",
"politician__party_l",
"politician__party_r",
CASE
WHEN "politician__party_l" = "politician__party_r"
THEN 1
WHEN "politician__party_l" IS NULL OR "politician__party_r" IS NULL
THEN -1
ELSE 0
END AS gamma_party,
"politician__state_l",
"politician__state_r",
"politician__city_l",
"politician__city_r",
CASE
WHEN (
"politician__city_l" IS NULL OR "politician__city_r" IS NULL
)
AND (
"politician__state_l" IS NULL OR "politician__state_r" IS NULL
)
THEN -1
WHEN (
"politician__city_l" = "politician__city_r"
)
AND (
"politician__state_l" = "politician__state_r"
)
THEN 2
WHEN (
"politician__city_l" = "politician__city_r"
)
OR (
"politician__state_l" = "politician__state_r"
)
THEN 1
ELSE 0
END AS gamma_city_state,
"src_l",
"src_r",
"DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_l",
"DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_r",
match_key
FROM __splink__df_blocked
), __splink__df_match_weight_parts AS (
SELECT
"raw__politician__id_l",
"raw__politician__id_r",
"politician__first_name_l",
"politician__first_name_r",
"politician__first_name_sorted_l",
"politician__first_name_sorted_r",
"politician__first_name_gender_l",
"politician__first_name_gender_r",
gamma_politician__first_name,
"tf_politician__first_name_l",
"tf_politician__first_name_r",
"tf_politician__first_name_gender_l",
"tf_politician__first_name_gender_r",
CASE
WHEN gamma_politician__first_name = -1
THEN CAST(1.0 AS DOUBLE)
WHEN gamma_politician__first_name = 5
THEN CAST(1.0633529488606062e-180 AS DOUBLE)
WHEN gamma_politician__first_name = 4
THEN CAST(134.4620704350685 AS DOUBLE)
WHEN gamma_politician__first_name = 3
THEN CAST(16.68908639059793 AS DOUBLE)
WHEN gamma_politician__first_name = 2
THEN CAST(9.78866591115e-313 AS DOUBLE)
WHEN gamma_politician__first_name = 1
THEN CAST(2.0336191245597566e-24 AS DOUBLE)
WHEN gamma_politician__first_name = 0
THEN CAST(0.31856082028220833 AS DOUBLE)
END AS bf_politician__first_name,
CASE
WHEN gamma_politician__first_name = -1
THEN CAST(1 AS DOUBLE)
WHEN gamma_politician__first_name = 5
THEN CAST(1 AS DOUBLE)
WHEN gamma_politician__first_name = 4
THEN (
CASE
WHEN NOT COALESCE("tf_politician__first_name_l", "tf_politician__first_name_r") IS NULL
THEN POWER(
CAST(0.006170059161530446 AS DOUBLE) / (
CASE
WHEN COALESCE("tf_politician__first_name_l", "tf_politician__first_name_r") >=
COALESCE("tf_politician__first_name_r", "tf_politician__first_name_l")
THEN COALESCE("tf_politician__first_name_l", "tf_politician__first_name_r")
ELSE COALESCE("tf_politician__first_name_r", "tf_politician__first_name_l")
END
),
CAST(1.0 AS DOUBLE)
)
ELSE CAST(1 AS DOUBLE)
END
)
WHEN gamma_politician__first_name = 3
THEN (
CASE
WHEN NOT COALESCE("tf_politician__first_name_l", "tf_politician__first_name_r") IS NULL
THEN POWER(
CAST(0.006170059161530446 AS DOUBLE) / (
CASE
WHEN COALESCE("tf_politician__first_name_l", "tf_politician__first_name_r") >=
COALESCE("tf_politician__first_name_r", "tf_politician__first_name_l")
THEN COALESCE("tf_politician__first_name_l", "tf_politician__first_name_r")
ELSE COALESCE("tf_politician__first_name_r", "tf_politician__first_name_l")
END
),
CAST(1.0 AS DOUBLE)
)
ELSE CAST(1 AS DOUBLE)
END
)
WHEN gamma_politician__first_name = 2
THEN CAST(1 AS DOUBLE)
WHEN gamma_politician__first_name = 1
THEN (
CASE
WHEN NOT COALESCE("tf_politician__first_name_gender_l", "tf_politician__first_name_gender_r") IS NULL
THEN POWER(
CAST(0.6262998096190533 AS DOUBLE) / (
CASE
WHEN COALESCE("tf_politician__first_name_gender_l", "tf_politician__first_name_gender_r") >=
COALESCE("tf_politician__first_name_gender_r", "tf_politician__first_name_gender_l")
THEN COALESCE("tf_politician__first_name_gender_l", "tf_politician__first_name_gender_r")
ELSE COALESCE("tf_politician__first_name_gender_r", "tf_politician__first_name_gender_l")
END
),
CAST(1.0 AS DOUBLE)
)
ELSE CAST(1 AS DOUBLE)
END
)
WHEN gamma_politician__first_name = 0
THEN CAST(1 AS DOUBLE)
END AS bf_tf_adj_politician__first_name,
"politician__last_name_l",
"politician__last_name_r",
"politician__last_name_sorted_l",
"politician__last_name_sorted_r",
gamma_politician__last_name,
"tf_politician__last_name_l",
"tf_politician__last_name_r",
CASE
WHEN gamma_politician__last_name = -1
THEN CAST(1.0 AS DOUBLE)
WHEN gamma_politician__last_name = 2
THEN CAST(2072.5200422351572 AS DOUBLE)
WHEN gamma_politician__last_name = 1
THEN CAST(22.29208621508213 AS DOUBLE)
WHEN gamma_politician__last_name = 0
THEN CAST(1.033614963801202e-59 AS DOUBLE)
END AS bf_politician__last_name,
CASE
WHEN gamma_politician__last_name = -1
THEN CAST(1 AS DOUBLE)
WHEN gamma_politician__last_name = 2
THEN (
CASE
WHEN NOT COALESCE("tf_politician__last_name_l", "tf_politician__last_name_r") IS NULL
THEN POWER(
CAST(0.0004647912544256984 AS DOUBLE) / (
CASE
WHEN COALESCE("tf_politician__last_name_l", "tf_politician__last_name_r") >=
COALESCE("tf_politician__last_name_r", "tf_politician__last_name_l")
THEN COALESCE("tf_politician__last_name_l", "tf_politician__last_name_r")
ELSE COALESCE("tf_politician__last_name_r", "tf_politician__last_name_l")
END
),
CAST(1.0 AS DOUBLE)
)
ELSE CAST(1 AS DOUBLE)
END
)
WHEN gamma_politician__last_name = 1
THEN CAST(1 AS DOUBLE)
WHEN gamma_politician__last_name = 0
THEN CAST(1 AS DOUBLE)
END AS bf_tf_adj_politician__last_name,
"politician__party_l",
"politician__party_r",
gamma_party,
CASE
WHEN gamma_party = 1
THEN CAST(3.139473469428194 AS DOUBLE)
WHEN gamma_party = -1
THEN CAST(1.0 AS DOUBLE)
WHEN gamma_party = 0
THEN CAST(0.08290316547026433 AS DOUBLE)
END AS bf_party,
"politician__state_l",
"politician__state_r",
"politician__city_l",
"politician__city_r",
gamma_city_state,
CASE
WHEN gamma_city_state = -1
THEN CAST(1.0 AS DOUBLE)
WHEN gamma_city_state = 2
THEN CAST(583.62425872987 AS DOUBLE)
WHEN gamma_city_state = 1
THEN CAST(5.782229188377593 AS DOUBLE)
WHEN gamma_city_state = 0
THEN CAST(0.018929761894094878 AS DOUBLE)
END AS bf_city_state,
"src_l",
"src_r",
"DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_l",
"DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_r",
match_key
FROM __splink__df_comparison_vectors
)
SELECT
LOG2(
CAST(0.00010001000100010001 AS DOUBLE) * bf_politician__first_name * bf_tf_adj_politician__first_name *
bf_politician__last_name * bf_tf_adj_politician__last_name * bf_party * bf_city_state
) AS match_weight,
CASE
WHEN bf_politician__first_name = CAST('infinity' AS DOUBLE)
OR bf_tf_adj_politician__first_name = CAST('infinity' AS DOUBLE)
OR bf_politician__last_name = CAST('infinity' AS DOUBLE)
OR bf_tf_adj_politician__last_name = CAST('infinity' AS DOUBLE)
OR bf_party = CAST('infinity' AS DOUBLE)
OR bf_city_state = CAST('infinity' AS DOUBLE)
THEN 1.0
ELSE (
CAST(0.00010001000100010001 AS DOUBLE) * bf_politician__first_name * bf_tf_adj_politician__first_name *
bf_politician__last_name * bf_tf_adj_politician__last_name * bf_party * bf_city_state
) / (
1 + (
CAST(0.00010001000100010001 AS DOUBLE) * bf_politician__first_name * bf_tf_adj_politician__first_name *
bf_politician__last_name * bf_tf_adj_politician__last_name * bf_party * bf_city_state
)
)
END AS match_probability,
"raw__politician__id_l",
"raw__politician__id_r",
"politician__first_name_l",
"politician__first_name_r",
"politician__first_name_sorted_l",
"politician__first_name_sorted_r",
"politician__first_name_gender_l",
"politician__first_name_gender_r",
gamma_politician__first_name,
"tf_politician__first_name_l",
"tf_politician__first_name_r",
"tf_politician__first_name_gender_l",
"tf_politician__first_name_gender_r",
bf_politician__first_name,
bf_tf_adj_politician__first_name,
"politician__last_name_l",
"politician__last_name_r",
"politician__last_name_sorted_l",
"politician__last_name_sorted_r",
gamma_politician__last_name,
"tf_politician__last_name_l",
"tf_politician__last_name_r",
bf_politician__last_name,
bf_tf_adj_politician__last_name,
"politician__party_l",
"politician__party_r",
gamma_party,
bf_party,
"politician__state_l",
"politician__state_r",
"politician__city_l",
"politician__city_r",
gamma_city_state,
bf_city_state,
"src_l",
"src_r",
"DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_l",
"DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_r",
match_key
FROM __splink__df_match_weight_parts
WHERE
LOG2(
CAST(0.00010001000100010001 AS DOUBLE) * bf_politician__first_name * bf_tf_adj_politician__first_name *
bf_politician__last_name * bf_tf_adj_politician__last_name * bf_party * bf_city_state
) >= -26.57542474467195
ORDER BY
1 NULLS LAST
)
Error was: Out of Range Error: cannot take logarithm of zero
To Reproduce
sorry, if you really want I can come up with something, but I think you might be better able to come up with it than me.
OS:
duckdb
Splink version:
4.9.11
Have you tried this on the latest master branch?
- no I haven't
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- no
Yeah, I agree. I think (but not sure, haven't thought that hard) that this is the same as the infinity protection. Which would lead me to think the case statement would be the best option (just for symmetry)
I have wondered before whether there's a third option (which also could potentially deal with the infinity case statements) of having this logic on the python side rather than in the sql.
e.g. could we clamp the value of the bf_ to between some values (e.g. between 1e-100, 1e100 or something), and issue a warning
Another thing I've wondered a bit about is whether we should move to the bf_ being match weights in the sql, which would then be additive. That would at least avoid (possibly?) the floating point issue.
Hey attempting to test out this package and I am receiving this error. I can see the columns that make up this function call failure in sql, but what exactly would be the way to research and address this problem?
This is occurring on my first test dataset, so I do not know how to address it or if it is incorrect column profiling.
Hi there, I am also running into this issue, using the duckdb backend. Does anyone know a workaround to fix this issue, at least for the moment? I understand where the issue comes from based on the discussion above, but do not know where to start with the suggested solutions, e.g. adding a tiny delta to prevent log(0).
A simple workaround fix is duplicate a single row with a new unique id. Just one row that is an exact match to another. For me that was enough to fix it.
It'd be great if someone could find a reprex for this issue. I've not actually encountered it myself. not doubting it exists - I suspect it happens with data of a certain type that we don't usually encounter, possibly such as certain values having no dupes (as vfrank66 alludes to).
If not a reprex, @JohnHenningsen are you able to post a screenshot of the match weight charts - it's possible that provides some insights...
In any case, we should hopefully be able to get round to fixing fairly soon, once Splink 4 is released (which has been absorbing most of our time for some months now)
@RobinL Im just re-reading your original response, and yes I think we should switch to combining match weights additively, otherwise Im pretty sure we will run into floating point errors. So that might make this whole thing moot?
Yeah, agree, it's definitely the right solution. The problem is it's quite a big job because all the visualisations and dashboards expect data in the current format
Thanks for the helpful suggestions everyone! Unfortunately our cluster is down at the moment but I will try the simple workaround and help reproduce this issue as soon as possible.
To give a bit of context, aside from a few columns of categorical data we are relying on a product description column to match records. That column contains a 3-10 word string, and we came up with some custom comparisons based on array_intersect. It is quite likely that there are no exact matches, as we have high variance in the data entry for this column.