bigquery-utils
bigquery-utils copied to clipboard
Bug: Jaccard UDF giving incorrect results in some cases
I found a bug in the the Jaccard distance function https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/udfs/community/jaccard.sqlx (the function below is a copy of the Jaccard UDF, cast as a TEMP FUNCTION)
CREATE TEMP FUNCTION Jaccard(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS r"""
let intersectSize = 0;
if(a && b) {
// Split word into characters, and make characters unique
const sa = [...new Set(String(a).split(''))]
const sb = [...new Set(String(b).split(''))]
const la = (sa.length > sb.length) ? sa.length : sb.length
const lb = (sa.length > sb.length) ? sb.length : sa.length
for(let i = 0; i < la; i++) {
for(let j = 0; j < lb; j++) {
intersectSize += (sa[i] === sb[j]) ? 1 : 0
}
}
// Compute Jaccard distance
const union = (sa.length + sb.length) - intersectSize
return parseFloat(((intersectSize / union) * 100 + Number.EPSILON) / 100).toFixed(2)
} else {
return 0.0
}
""";
with datas as (
select
'12' as a,
'132' as b
)
select
Jaccard(a, b)
from datas
This should give 2/3 = 0.667 (since here there the intersection is size 2, and the union is size 3), but instead gives 0.25.
I believe the bug can be fixed by replacing
const la = (sa.length > sb.length) ? sa.length : sb.length
const lb = (sa.length > sb.length) ? sb.length : sa.length
with
const la = sa.length
const lb = sb.length
With the current version of these lengths, the array indices become out of bounds.
It might be possible to further optimize and generalize this function to arrays of any types, rather than just strings by casting it in SQL
create temp function `Jaccard`(a ANY TYPE, b ANY TYPE)
returns FLOAT64 as (
(select count(distinct agrp) from unnest(a) as agrp inner join unnest(b) as bgrp on agrp = bgrp)/
(select count(1) from (select * from unnest(a) union distinct select * from unnest(b)))
);
with datas as (
select
['1', '2'] as a,
['1', '3', '2'] as b,
[1, 2] as a_int,
[1, 3, 2] as b_int
)
select
Jaccard(a, b),
Jaccard(a_int, b_int)
from datas
(See https://stackoverflow.com/a/36705483)