Each geometry is split into 'shards', where each shard contains a maximum of n vertices. This process is used to ensure that PIP operations are fast over large polygons.

The current default is 200, this results in many shards being created, as a result the entire database files can be over 50% used for the shard table.

The official sqlite3_analyzer tool can be used to display the byte usage per table/index, the following diff shows the difference between using the default setting of 200 and a setting of 2000:

.rw-r--r--@ 2.1Gi peter 11 Jul 16:22 -I  zcta.spatial.shard200.db
.rw-r--r--@ 1.8Gi peter 11 Jul 16:21 -I  zcta.spatial.shard500.db
.rw-r--r--@ 1.7Gi peter 11 Jul 16:03 -I  zcta.spatial.shard2000.db
.rw-r--r--@ 1.6Gi peter 11 Jul 16:20 -I  zcta.spatial.shard100000.db

This issue is open to explore this default setting more, each dataset will have a different 'sweet spot', with datasets containing many large polygons being more impacted.

If increasing this setting is shown to have negligible impact on PIP performance then it would be ideal to increase it in order to reduce disk usage and make page caching more effective.

Jul 11 '25 14:07 missinglink

The setting is already configurable here and overridable via the CLI with a command such as ... import --tweak_module_shard_complexity=2000 zcta

Jul 11 '25 14:07 missinglink

I will try to do some stress test next week just in case using https://github.com/jawg/pelias-server-stress and your 4 examples.

Should I use only zcta data? Because I feel like WOF polygons aren't that detailed.

Jul 12 '25 17:07 Joxit

I just used zcta because it's small and builds in like a minute on my laptop, my latest PR has some new notes on how to generate the data, and code updates for the latest data version, if you'd like to use that too.

WOF probably provides a better variety of complexity, particularly for the country polygons which are usually pretty complex.

It's one of those settings which is kinda hard to find a universal value for since each dataset will produce different values of average(shards/geometry).

I guess tuning it to WOF would make the most sense.

Jul 13 '25 10:07 missinglink

Here's a script which can be used to generate a bunch of variations of shard count complexity:

sqlite3 /data/wof/whosonfirst-data-admin-nz-latest.db 'SELECT json_extract(body, "$") FROM geojson' | tee \
  >(node bin/spatial.js --db=nz.wof.shard200.db import --tweak_module_shard_complexity=200 whosonfirst) \
  >(node bin/spatial.js --db=nz.wof.shard500.db import --tweak_module_shard_complexity=500 whosonfirst) \
  >(node bin/spatial.js --db=nz.wof.shard1000.db import --tweak_module_shard_complexity=1000 whosonfirst) \
  >(node bin/spatial.js --db=nz.wof.shard2000.db import --tweak_module_shard_complexity=2000 whosonfirst) \
  >(node bin/spatial.js --db=nz.wof.shard100000.db import --tweak_module_shard_complexity=100000 whosonfirst) \
  > /dev/null

.rw-r--r--@ 426Mi peter 14 Jul 14:52 -I  nz.wof.shard200.db
.rw-r--r--@ 379Mi peter 14 Jul 14:52 -I  nz.wof.shard500.db
.rw-r--r--@ 360Mi peter 14 Jul 14:52 -I  nz.wof.shard1000.db
.rw-r--r--@ 350Mi peter 14 Jul 14:52 -I  nz.wof.shard2000.db
.rw-r--r--@ 344Mi peter 14 Jul 14:52 -I  nz.wof.shard100000.db

Jul 14 '25 12:07 missinglink

I did some basic k6 load testing of these file against the more complex /query/pip/_view/pelias endpoint with the following results:

complexity 200

avg=6.04ms min=1.09ms med=5.28ms max=51.65ms p(90)=9.96ms p(95)=11.62ms
avg=6.19ms min=1.1ms  med=5.25ms max=52.92ms p(90)=10.61ms p(95)=12.66ms
avg=6.13ms min=1.3ms  med=5.31ms max=66.01ms p(90)=10.12ms p(95)=11.84ms

complexity 500

avg=6.25ms min=1.14ms med=5.43ms max=42.13ms p(90)=10.46ms p(95)=12.19ms
avg=6.22ms min=1.25ms med=5.48ms max=44.13ms p(90)=10.27ms p(95)=11.98ms
avg=6.37ms min=1.12ms med=5.58ms max=44.35ms p(90)=10.6ms  p(95)=12.41ms

complexity 1000

avg=6.49ms min=1.32ms med=5.73ms max=46.09ms p(90)=10.82ms p(95)=12.56ms
avg=6.61ms min=1.28ms med=5.76ms max=66.12ms p(90)=10.89ms p(95)=12.79ms
avg=6.55ms min=1.2ms  med=5.8ms  max=48.95ms p(90)=10.7ms  p(95)=12.56ms

complexity 2000

avg=6.68ms min=1.4ms  med=5.82ms max=72.2ms  p(90)=11.11ms p(95)=12.95ms
avg=6.73ms min=1.24ms med=5.91ms max=56.34ms p(90)=11.11ms p(95)=13.23ms
avg=6.76ms min=1.39ms med=5.91ms max=42.2ms  p(90)=11.18ms p(95)=13.25ms

complexity 100000 (ie. every polygon is a single shard)

avg=10.79ms min=2.29ms med=9.54ms max=49.67ms p(90)=18.24ms p(95)=20.91ms
avg=10.97ms min=2.5ms  med=9.38ms max=70.35ms p(90)=18.7ms  p(95)=22.65ms
avg=10.67ms min=2.33ms med=9.44ms max=58.37ms p(90)=17.95ms p(95)=20.97ms

k6 run --vus 20 --iterations 10000 load.js

cat load.js
import http from 'k6/http'
const baseurl = 'http://localhost:3000/query/pip/_view/pelias'

export default function () {
  const lon = '174.77607'
  const lat = '-41.28655'
  http.get(`${baseurl}/${lon}/${lat}`)
}

Jul 14 '25 13:07 missinglink

As expected, we see the filesize reduce as the shard complexity rises, likewise we see an increase in latency with a cliff at the higher complexities:

Looks like maybe a value between 200-1000 offers a nice tradeoff between disk usage and latency, although even at 4.16% performance penalty it might be worth using more disk 🤷‍♂️ haha I guess this is why I made it tuneable.

@Joxit interested in seeing if you come to similar conclusions.

Shard Complexity	File Size	Average Latency
200	426Mi	~5.28ms
500	379Mi (-11.1%)	~5.50ms (+4.16%)
1000	360Mi (-15.49%)	~5.76ms (+9.09%)
2000	350Mi (-17.84%)	~5.88ms (+11.36%)
100000	344Mi (-19.24%)	~9.45ms (+78.97%)

Jul 14 '25 13:07 missinglink

Also worth noting that running these benchmarks on my laptop it's going to be tricky getting a good reading, I turned off info level logging in pelias.json to avoid seeing the HTTP logs, which IIRC increased performance.

Jul 14 '25 13:07 missinglink

Some more interesting stats, looking at the average amount of shards produced per-geometry with the different shard complexity settings:

It shows that even the setting 100000 is generating a bunch of shards for the large geometries, presumably the country entry and maybe some of the region features.

shard complexity	min	avg	max
200	1	16.5630773291106	2746
500	1	7.20797044074954	1154
1000	1	4.12298759567168	700
2000	1	2.634336236474	567
100000	1	1.61084718923199	436

sqlite3 nz.wof.shard100000.db 'SELECT AVG(count) FROM (SELECT COUNT(*) AS count FROM shard GROUP BY source, id);'
1.61084718923199

sqlite3 nz.wof.shard2000.db 'SELECT AVG(count) FROM (SELECT COUNT(*) AS count FROM shard GROUP BY source, id);'
2.634336236474

sqlite3 nz.wof.shard1000.db 'SELECT AVG(count) FROM (SELECT COUNT(*) AS count FROM shard GROUP BY source, id);'
4.12298759567168

sqlite3 nz.wof.shard500.db 'SELECT AVG(count) FROM (SELECT COUNT(*) AS count FROM shard GROUP BY source, id);'
7.20797044074954

sqlite3 nz.wof.shard200.db 'SELECT AVG(count) FROM (SELECT COUNT(*) AS count FROM shard GROUP BY source, id);'
16.5630773291106

Jul 14 '25 13:07 missinglink

Gatling Stress Test x Shard Complexity

Hi there, I did some tests with Gatling and check how the shard complexity will affect the performance of Spatial Server.

I built 5 versions of Spatial database, 200, 500, 1000, 2000 and 100000 using --tweak_module_shard_complexity= option on import command.

Spatial database stats

All versions contains data from France, for my investigation I used sqlean SQLite extension for stats_median, stats_p95, stats_p99 functions and spatialite for ST_NPoints function.

Database sizes

This stats are for reference, I also included our spatial database.

SELECT COUNT(*) from shard;

shard complexity	size	shards
200	5505M	1020436
500	5015M	481531
1000	4892M	348422
2000	4852M	302893
100000	4839M	284247
jawg-world-200	58G	14432731

From a space point of view, we notice that starting from 1000, we no longer see a clear difference for this dataset. We might think that staying between 500 and 1000 would be good enough.

Number of shards by geometries

My first stats was the same as yours, I just added some percentiles, I wanted to check how many times shards were triggered.

How to read the table ? "With the shard complexity 200, 95% of geometries are composed by 13 or less shards and the biggest polygon is composed by 4414 shards"

SELECT AVG(count), MIN(count), stats_median(count), stats_p95(count), stats_p99(count), MAX(count) FROM (SELECT COUNT(*) AS count FROM shard GROUP BY source, id);

shard complexity	avg	min	50th	95th	99th	max
200	3.91913139995314	1	1	13	31	4414
500	1.84938914557193	1	1	5	12	2652
1000	1.33816486348431	1	1	2	6	2139
2000	1.16330418284538	1	1	1	3	1834
100000	1.09169153483656	1	1	1	2	1572
jawg-world-200	1.5722054216771	1	1	2	12	16300

I was quite surprised by this result, I thought something went wrong when I saw the 75th percentile at 1, that's why I replaced it by 95th and more. My hint now is 1000 might be a bit too high and could make the benefit of using shards useless.

New exercice where I deleted simple geometries, how to read the table ? "With the shard complexity 200, 95% of geometries composed by at least 2 shards are composed by 19 or less shards and the biggest polygon is composed by 4414 shards"

SELECT AVG(count), MIN(count), stats_median(count), stats_p95(count), stats_p99(count), MAX(count) FROM (SELECT COUNT(*) AS count FROM shard GROUP BY source, id HAVING COUNT(*) > 1);

shard complexity	avg	min	50th	95th	99th	max
200	6.99740397058359	2	4	19	47	4414
500	5.25934556940084	2	3	12	38	2652
1000	5.10140674492268	2	2	11	44	2139
2000	6.36394600731677	2	2	16	71	1834
100000	7.67430807939614	2	2	23	102	1572
jawg-world-200	10.7230934399526	2	4	32	95	16300

Now the average on 500+ are a way higher, especially in our world dataset.

Number of points by shards

Now I wanted to see how many shards are hitting the threshold.

How to read the table ? "With the shard complexity 200, 95% of shards are made of 187 nodes and the biggest shard is made of 199 nodes (the threshold)"

SELECT AVG(count), MIN(count), stats_median(count), stats_p95(count), stats_p99(count), MAX(count) FROM (SELECT ST_NPoints(geom) AS count FROM shard);

shard complexity	avg	min	50th	95th	99th	max
200	100.195913315485	4	108	187	197	199
500	206.746982021926	4	205	454	490	499
1000	283.836373707745	4	196	831	960	999
2000	325.78730112614	4	178	1278	1784	1999
100000	347.529166534739	4	170	1195	2604	89976
jawg-world-200	59.067824239224	4	33	171	193	199

This one is interesting too, the 50th from the world dataset is a way lower than others. I feel like having a 50th percentile near the half of the complexity would be a good configuration. That means 200 and 500 are good values for France.

Gatling Configuration

The original KPIs and parameters were the same as the old stress test I did some year ago: 95th percentile below 750ms, 75k users for 60s. But something went wrong, with the exact same specs, the result was bad. I found out that it was introduced by #89 but it makes sense since the commit add many changes. The 95th percentiles moved from 28ms using pelias/spatial:master-2025-07-14-de64a9c3ad54d0f0a9cefc282fcaba0c974c6901 to 43.338s (yeah, seconds instead of ms) with pelias/spatial:master-2025-06-27-656fa8d5100ad1c071632f732840b74f4c345940.

So I will reduce the number of users parameters for this stress test (60k instead of 75k) test and use the latest spatial version.

Since I'm using a French dataset, I will query only in French regions, this is the environments USERS_COUNT=60000, USERS_RAMP_TIME=60, SERVER_URL=http://$SPATIAL_IP/query/pip/_view/pelias and CSV I used:

Region,LatMin,LatMax,LngMin,LngMax
AUVERGNE-RHONE-ALPES,44.1154,46.804,2.0629,7.1859
BOURGOGNE-FRANCHE-COMTE,46.1559,48.4001,2.8452,7.1435
BRETAGNE,47.278,48.9008,-5.1413,-1.0158
CENTRE-VAL DE LOIRE,46.3471,48.9411,0.053,3.1286
CORSE,41.3336,43.0277,8.5347,9.56
GRAND EST,47.4202,50.1692,3.3833,8.2333
HAUTS-DE-FRANCE,48.8372,51.089,1.3797,4.2557
ILE-DE-FRANCE,48.1205,49.2413,1.4465,3.5587
NORMANDIE,48.1799,50.0722,-1.9485,1.8027
NOUVELLE-AQUITAINE,42.7775,47.1758,-1.7909,2.6116
OCCITANIE,42.3331,45.0467,-0.3272,4.8456
PAYS DE LA LOIRE,46.2664,48.568,-2.6245,0.9167
PROVENCE-ALPES-COTE D'AZUR,42.9818,45.1268,4.2303,7.7188

Specs

OS: Debian 12.10
CPU: 4 thread (2.3ghz)
RAM: 15Go
Arch: linux/amd64
Docker Version: 28.3.2
Container: `pelias/spatial:master-2025-07-14-de64a9c3ad54d0f0a9cefc282fcaba0c974c6901`

Results

All tests are run at least twice for the warmup. I checked the CPU and other stats, nothing is wrong, the CPU for 200-2000 was between 80% and 100%, the max iowait was 15% and decreasing during the stress test. For the second test, the CPU was between 70% and 90%.

pelias/spatial:master-2025-06-25-02bda6265ab8d9283b73a795479dbc4e13ba5eb0

shard complexity	KO	% KO	Cnt/s	Min	50th	75th	95th	99th	Max	Mean	Std Dev
200	0	0%	983.607	1	13	22	48	76	585	18	20
500	0	0%	983.607	1	17	27	52	81	626	21	22
1000	0	0%	983.607	2	230	348	540	736	14986	310	752
2000	0	0%	833.333	8	520	675	49121	56877	58722	6226	14639
100000	50994	85%	645.161	3	1414	60000	60001	60004	60106	28236	29741

My conclusion here is, without a big surprise, tuning the shard complexity might affect the overall performances. The difference between 200 and 500 is quite small and might be a good start. I don't know if on a full size build the difference will be pronounced for this change to be interesting in terms of performance and disk space. In any case, I don't think we should go any further (at least as default value).

pelias/spatial:master-2025-07-14-de64a9c3ad54d0f0a9cefc282fcaba0c974c6901

shard complexity	KO	% KO	Cnt/s	Min	50th	75th	95th	99th	Max	Mean	Std Dev
200	0	0%	983.607	1	6	9	16	30	196	7	7
500	0	0%	983.607	1	6	9	14	24	136	7	5
1000	0	0%	983.607	1	7	11	21	39	607	9	15
2000	0	0%	983.607	1	10	17	71	154	2612	23	96
100000	46229	77%	618.557	4	1369	60000	60001	60001	60047	28233	29684

I didn't check the code to see why there is such a difference in terms of performances, but we can still see that starting from 1000 the max is 5 times higher than 500. The shard complexity 100000 is still too high and kills all the perfs.

So I have the same conclusion as you @missinglink , the best values might be between 200 and 1000 where 500 is the best compromise.

Jul 17 '25 13:07 Joxit

This new PR seems to do miracles:

.rw-r--r--@ 533Mi peter 10 Sep 06:25 -I  zcta.spatial.db
.rw-r--r--@ 2.1Gi peter 11 Jul 17:13 -I  zcta.spatial.shard200.db

Sep 10 '25 04:09 missinglink

tune `module.shard.complexity`

Gatling Stress Test x Shard Complexity

Spatial database stats

Database sizes

Number of shards by geometries

Number of points by shards

Gatling Configuration

Specs

Results