mapshaper icon indicating copy to clipboard operation
mapshaper copied to clipboard

[feature request] support gz compressed GeoJSONs

Open indus opened this issue 3 years ago • 2 comments

Would be nice to support gz compressed GeoJSONs for input and output for bigger datasets. When using streams it could look something like this:

if (this.options.gz) {
    outStream = zlib.createGzip();
    outStream.on('error', (err: any) => console.log(err.stack));
    let writeStream = fs.createWriteStream(`${file}.gz`);
    outStream.pipe(writeStream);
} else {
    outStream = fs.createWriteStream(file);
}

#358 #501

indus avatar Jan 06 '22 11:01 indus

It would probably be more useful to support FlatGeobuf ~which can be a lot smaller~ than geojson.gz

edit: FlatGeobuf has no built-in compression but it has a built-in spatial index and it streams well so performance is likely better in many cases. Another nice thing about FlatGeobuf is that it forces consistency with geometry types

http://switchfromshapefile.org/

chapmanjacobd avatar Apr 12 '22 13:04 chapmanjacobd

The tool I feed the data in doesn't support flatgeobuf, but gzipped JSONs. So "more useful" depends.

indus avatar Apr 12 '22 14:04 indus

I've changed my mind. tippecanoe now supports FlatGeobuf input. So now this would be useful for me as well ;-)

indus avatar Nov 13 '22 19:11 indus

A demo to do it using only command line. It may not fit your use case as not inside mapshaper code itself and need Unix like system

# Get data
wget https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_2_counties.zip
# uncompress
unzip ne_10m_admin_2_counties.zip
# Convert to GeoJSON
ogr2ogr -f GeoJSON ne_10m_admin_2_counties.geojson ne_10m_admin_2_counties.shp -lco WRITE_NAME=NO -lco RFC7946=YES
# Compress to gz and keep original geojson
gzip -k ne_10m_admin_2_counties.geojson
# Uncompress on the fly gzipped GeoJSON, use - arg for input and output in order to use stdin and stdout and compress result
zcat ne_10m_admin_2_counties.geojson.gz \
        | mapshaper -i - -filter '"ME,VT,NH,MA,CT,RI".indexOf(REGION) > -1' -o - format=geojson \
        | gzip -c filtered.geojson.gz

# Not that you can use GDAL too to uncompress gz e.g instead
# of the "zcat ne_10m_admin_2_counties.geojson.gz" part.
# Commented below
# ogr2ogr -f GeoJSON /vsistdout /vsigzip/ne_10m_admin_2_counties.geojson.gz ne_10m_admin_2_counties.geojson

ThomasG77 avatar Nov 19 '22 23:11 ThomasG77

@ThomasG77 I tried something similar on windows but it didn't worked for me. I was unable to get a an output of mapshaper bigger then 2GB. Gzip would bring down the size of the final file but it hasn't helped me with mapshapers limits.

So I've tried to split the output like so:

mapshaper-xl 12GB ./veryLarge_OSM_extract.shp `
<# some processing #>
-each "part = this.id % 5" `
-split part `
-o "./outpath/" extension=.ndjson format=geojson ndjson

This splits the output in multiple parts that can be concatenated easily. But this doesn't work with pipes at all as the output gets closed for every file part and stops the pipe. Another drawback is that you've to add a property to split on. It would be nice if split would allow an expression (like the one shown in each) on its own. But thats not a big issue for me as I can strip the temporary property later on.

indus avatar Nov 20 '22 19:11 indus

I just found out that the split function actually supports expressions. So what it boils down to is this:

mapshaper-xl 12GB ./veryLarge_OSM_extract.shp \
# some processing #
-split "this.id % 5" \
-o "./outpath/" extension=.ndjson format=geojson ndjson

I have added a PR to make this clearer in the REFERENCE.

indus avatar Dec 05 '22 08:12 indus

@mbloch I was working on GZ support for an hour today before I realized that you have implemented this just yesterday 😅 Thank you!

indus avatar Dec 07 '22 15:12 indus

@indus I added GZ support in the simplest possible way, and there is room for improvement. The web interface doesn't support .gz files. And mapshaper uncompresses the entire file into memory, which limits uncompressed file size to ~2GB (i think, typically). Mapshaper is able to read uncompressed CSV and JSON files incrementally, which means that you can load larger files if they are uncompressed.

mbloch avatar Dec 07 '22 22:12 mbloch

I've just saw these drawbacks as well. For writing I find it is a huge improvement to have an option to gunzip the output. Reading is actually limited to 512 MB. Buffer.toString() gives an error when the string gets bigger. 

indus avatar Dec 07 '22 23:12 indus

Here is a description of the error: https://cmdcolin.github.io/posts/2021-10-30-spooky https://stackoverflow.com/questions/68230031/cannot-create-a-string-longer-than-0x1fffffe8-characters-in-json-parse

indus avatar Dec 07 '22 23:12 indus

Of course you're right, it's 512 MB... I'll look into increasing that limit.

mbloch avatar Dec 07 '22 23:12 mbloch

I published an update that increases the maximum uncompressed size of gzipped GeoJSON and CSV files. The new maximum should be around 2GB (the max size of a Buffer in most environments, the last time I checked). After the update, I was able to import a gzipped 1.82GB GeoJSON file.

mbloch avatar Dec 08 '22 01:12 mbloch