spark-lucenerdd-examples

Datasets and Entity Likage

The following pairs of datasets are used here to demonstrate the accuracy/quality of the record linkage methods. Note that the goal here is to demonstrate the user-friendliness of the spark-lucenerdd library and no optimization is attempted.

Dataset	Domain	Attributes	Accuracy (top-1)	References
DBLP vs ACM article	Bibliographic	title, authors, venue, year	0.98	Benchmark datasets for entity resolution
DBLP vs Scholar article	Bibliographic	title, authors, venue, year	0.953	Benchmark datasets for entity resolution
Amazon vs Google products	E-commerce	name, description, manufacturer, price	0.58	Benchmark datasets for entity resolution
Abt vs Buy products	E-commerce	name, description, manufacturer, price	0.64	Benchmark datasets for entity resolution

The reported accuracy above is by selecting as the linked entity: the first result from the top-K list of results.

All datasets are available in Spark friendly Parquet format here; original datasets are available here.

Spatial linkage between countries and capitals

This example loads all countries from a parquet file containing fields "name" and "shape" (shape is mostly polygons in WKT)

val allCountries = spark.read.parquet("data/spatial/countries-poly.parquet")

then, it load all capitals from a parquet file containing fields "name" and "shape" (shape is mostly points in WKT)

val capitals = spark.read.parquet("data/spatial/capitals.parquet")

A ShapeLuceneRDD instance is created on the countries and a linkageByRadius is performed on the capitals. The output is presented in the logs.

Development

Usage (spark-submit)

Install Java, SBT and clone the project

git clone https://github.com/zouzias/spark-lucenerdd-examples.git
cd spark-lucenerdd-examples
sbt compile assembly

Download and extract apache spark under your home directory, update the spark-submit.sh script accordingly and run

./spark-linkage-*.sh

to run the record linkage examples and ./spark-search-capitalts.sh to run a search example.

Usage (docker)

Setup docker and assuming that you have a docker machine named default, type

./startZeppelin.sh

To start an Apache Zeppelin with preloaded notebooks.

spark-lucenerdd-examples
spark-lucenerdd-examples copied to clipboard

Metadata

spark-lucenerdd-examples

Datasets and Entity Likage

Spatial linkage between countries and capitals

Development

Usage (spark-submit)

Usage (docker)

← Metadata

Owner

Metadata

spark-lucenerdd-examples spark-lucenerdd-examples copied to clipboard

Metadata

spark-lucenerdd-examples

Datasets and Entity Likage

Spatial linkage between countries and capitals

Development

Usage (spark-submit)

Usage (docker)

← Metadata

Owner

Metadata

spark-lucenerdd-examples
spark-lucenerdd-examples copied to clipboard