dataverse
dataverse copied to clipboard
GDCC/Geosearch indexing
What this PR does / why we need it: Adds bounding box indexing in preparation for providing geosearch or other features
Which issue(s) this PR closes:
Closes #
Special notes for your reviewer:
Suggestions on how to test this:
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?:
Additional documentation: design doc: https://docs.google.com/document/d/1aQCIM6wbAdgMNLTVSUhvuHQKzWDAvQ5XoYGbfint3P4/edit?usp=sharing
Coverage increased (+0.008%) to 19.989% when pulling b5383f4420337fe3b4d3335bf0b827d0ac083e6e on GlobalDataverseCommunityConsortium:GDCC/geosearch into 11abccfb61d28e3b9e0695cb7be85d775a83ce59 on IQSS:develop.
Per discussion at the DCM that the improved geospatial indexing could be useful independent of adding further geospatial search/display capabilities, I'm moving this PR from draft to review-ready. Not sure what the Rocky shellspec issue is - probably worth re-running the build to see if there's any real issue.
I'm not sure it is possible to make an IT test yet. The solr query that works is as shown:
In the UI, I don't think we can submit filterqueries and with the API, you can submit the filter query part but I haven't seen how to send the pt and d parameters yet. If that is possible, an IT test would be straight forward.
Otherwise, to test one could use the solr console (dropping the jetty.host-localhost recommended for our solr install to make it accessible, and possibly doing an aws ec2 authorize-security-group-ingress --group-id <id> --protocol tcp --port 8983 --cidr <your ip>/32
to let a browser access the console.)
I'm on 202438a. When I run DataversesIT.testImportDDI it fails with this:
INFO: {"status":"ERROR","message":"Command [DatasetCreate dataset:3] failed: Exception thrown from bean: javax.ejb.EJBTransactionRolledbackException: Exception thrown from bean: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/collection1: Exception writing document id dataset_3_draft to the index; possible analysis error: DocValuesField "solr_bboxtype__minX" appears more than once in this document (only one value is allowed per field)"}
@qqmyers can you please take a look? I did update to the latest schema.xml you provided. I'll merge the latest from develop into this branch to trigger a Jenkins run. Thanks.
In c7c16d4 I updated the Search API two allow two new geo-related parameters: geo_point
and geo_radius
:
I created a dataset using lat/logs from a box I drew and copied from https://boundingbox.klokantech.com
westlimit=-71.187346; southlimit=42.33661; eastlimit=-71.043056; northlimit=42.409599
If I use a radius of 1 km, I can't find the dataset:
$ curl "http://localhost:8080/api/search?q=*&geo_point=42.3,-71.1&geo_radius=1"
{"status":"OK","data":{"q":"*","total_count":0,"start":0,"spelling_alternatives":{},"items":[],"count_in_response":0}}
If I use a radius of 4 km, I get a hit:
$ curl "http://localhost:8080/api/search?q=*&geo_point=42.3,-71.1&geo_radius=4"
{"status":"OK","data":{"q":"*","total_count":1,"start":0,"spelling_alternatives":{},"items":[{"name":"Dataverse HQ","type":"dataset","url":"https://doi.org/10.5072/FK2/MKEOMD"...
Right now this is API-only. I'm not sure how much work it would be to add it to the GUI.
Right now both geo_point
and geo_radius
are required to trigger geospatial search. Instead should I set geo_radius
to a sane default (1 km? 10 km? 100 km?) and only make geo_point
required to trigger geospatial search?
Do you like the names of these new parameters? My goal was to make them user friendly. geo_point
maps to pt
in Solr. geo_radius
maps to d
in Solr. You can read about these parameters at https://solr.apache.org/guide/8_11/spatial-search.html and here's a screenshot from those docs:
I added a geospatial search test in 7272de6.
@qqmyers I just noticed that if I comment out northLongitude
...
//.add("northLongitude",
// Json.createObjectBuilder()
// .add("value", "42.409599")
// .add("typeClass", "primitive")
// .add("multiple", false)
// .add("typeName", "northLongitude")
//)
... it gets populated in solr_srpt
and solr_bboxtype
with the value from southLongitude
:
"solr_srpt":["ENVELOPE(-71.187346,-71.043056,42.33661,42.33661)"],
"solr_bboxtype":["ENVELOPE(-71.187346,-71.043056,42.33661,42.33661)"],
"westLongitude":["-71.187346"],
"southLongitude":["42.33661"],
"eastLongitude":["-71.043056"],
What do you think? Bug or feature? 😄
p.s. Yes, I know it should be northLatitude
and southLatitude
. There's a bug tracking this:
- #5645
I think the indexing code assumes you're talking about a point (or line) if you leave one coord out so it just replicates the value for the other extreme. So ~feature.
@qqmyers ok, but I did go ahead and fix a bug when there was only one coordinate: e5187b2af4
Next up: names. I haven't heard any objections to geo_point
and geo_radius
so I'm planning on sticking with them. Unless we want spatial_point
and spatial_radius
? I'm pretty sure we're limiting ourselves to Earth, though, so I think "geo" is fine (and shorter!). 🌎
On the Solr field side we currently have solr_bboxtype
and solr_srpt
. None of our existing fields start with solr_
so I'm thinking of lopping that off.
Any reason not to shorted to just bbox
? Our Solr fields usually don't end with type
and just bbox
is what's shown in the docs: https://solr.apache.org/guide/8_11/spatial-search.html#bboxfield
I don't know what to say about RPT. SpatialRecursivePrefixTreeFieldType is a mouthful. I'm not in love with solr_srpt
but the only example I see in the docs is location_rpt
. Perhaps we should go with that.
Update: In 28215fb512 I renamed the Solr fields:
-
solr_srpt
is nowgeolocation
-
solr_bboxtype
is nowboundingBox
Feedback welcome!
I added some error checking and updated the docs based on how the code is now.
I talked to @atrisovic about this feature and she immediately asked "Can you sort by distance?" Good question! 😄 I haven't looked into this but maybe? See https://solr.apache.org/guide/8_11/spatial-search.html#distance-sorting-or-boosting-function-queries for sort=geodist() asc
. I'd need more data to test with.
I'm also ok with this being API only for now. That's how I wrote it up in the User Guide: https://dataverse-guide--8239.org.readthedocs.build/en/8239/user/find-use-data.html#geospatial-search
I'm ready for review. We could keep hacking but there's lots of other stuff to do as well. 😄 I think this is an excellent toehold into geospatial search!
Moved back to "This Sprint" so that @pdurbin's recent changes can be reviewed. (@pdurbin reviewed the initial PR and was fine with the changes Jim made.)
grooming
- Looking at the NetCDF deliverables.
- This looks like it's a related issue that is related to active work, but was done before we were trying to tag each issue
- Adding it to the backlog as cleared, and tagging it.