sedona icon indicating copy to clipboard operation
sedona copied to clipboard

Enhancements to libpostal integration: Fetch model from HDFS/object store

Open jornfranke opened this issue 3 months ago • 0 comments

Thank you for the great integration of libpostal described in https://github.com/apache/sedona/issues/2074

I have the following enhancement proposal to make it more usable in an enterprise context. The main issue is that in an enterprise context there is usually no Internet connectivity available from a Spark cluster and also no direct access to the nodes. Thus, it is difficult to use the libpostal integration as it needs to download the model from the internet.

Based on the libpostal integration pull request https://github.com/apache/sedona/pull/2077, I can see that a config "spark.sedona.libpostal.dataDir" is accepted. It defaults into a local tmp-dir, because libpostal can only load from a local filesystem.

I propose the following addition: Accept a folder on HDFS, object stores (e.g. S3 etc.). If you have a larger job with a lot of nodes then it is much more efficient to load from HDFS/object stores than the Internet (and Internet may not be available, server down etc.).

Since libpostal expects a local directory, I propose that if someone puts spark.sedona.libpostal.dataDir to, for example, "s3a://blabla/libpostal" that it uses the Hadoop dependency of Spark to list the content of the dataDir (https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html) , e.g. via

FileSystem.get(sparkContext.hadoopConfiguration).listFiles()
...

copy all the files to a local tmp directory using the Filesystem class (if not done already) and point libpostal to it.

Additionally, I propose that the documentation in Apache Sedona contains a small shell script how to fetch the data via Internet so that a user can upload it to HDFS/object store (e.g. S3). Maybe sth. similar to https://github.com/openvenues/libpostal/blob/master/src/libpostal_data.in

@james-willis

jornfranke avatar Sep 23 '25 20:09 jornfranke