dataverse icon indicating copy to clipboard operation
dataverse copied to clipboard

GDCC/Geosearch indexing

Open qqmyers opened this issue 3 years ago • 4 comments

What this PR does / why we need it: Adds bounding box indexing in preparation for providing geosearch or other features

Which issue(s) this PR closes:

Closes #

Special notes for your reviewer:

Suggestions on how to test this:

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation: design doc: https://docs.google.com/document/d/1aQCIM6wbAdgMNLTVSUhvuHQKzWDAvQ5XoYGbfint3P4/edit?usp=sharing

qqmyers avatar Nov 11 '21 00:11 qqmyers

Coverage Status

Coverage increased (+0.008%) to 19.989% when pulling b5383f4420337fe3b4d3335bf0b827d0ac083e6e on GlobalDataverseCommunityConsortium:GDCC/geosearch into 11abccfb61d28e3b9e0695cb7be85d775a83ce59 on IQSS:develop.

coveralls avatar Nov 11 '21 00:11 coveralls

Per discussion at the DCM that the improved geospatial indexing could be useful independent of adding further geospatial search/display capabilities, I'm moving this PR from draft to review-ready. Not sure what the Rocky shellspec issue is - probably worth re-running the build to see if there's any real issue.

qqmyers avatar Jul 11 '22 17:07 qqmyers

I'm not sure it is possible to make an IT test yet. The solr query that works is as shown: image

In the UI, I don't think we can submit filterqueries and with the API, you can submit the filter query part but I haven't seen how to send the pt and d parameters yet. If that is possible, an IT test would be straight forward.

Otherwise, to test one could use the solr console (dropping the jetty.host-localhost recommended for our solr install to make it accessible, and possibly doing an aws ec2 authorize-security-group-ingress --group-id <id> --protocol tcp --port 8983 --cidr <your ip>/32 to let a browser access the console.)

qqmyers avatar Sep 15 '22 23:09 qqmyers

I'm on 202438a. When I run DataversesIT.testImportDDI it fails with this:

INFO: {"status":"ERROR","message":"Command [DatasetCreate dataset:3] failed: Exception thrown from bean: javax.ejb.EJBTransactionRolledbackException: Exception thrown from bean: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/collection1: Exception writing document id dataset_3_draft to the index; possible analysis error: DocValuesField "solr_bboxtype__minX" appears more than once in this document (only one value is allowed per field)"}

@qqmyers can you please take a look? I did update to the latest schema.xml you provided. I'll merge the latest from develop into this branch to trigger a Jenkins run. Thanks.

pdurbin avatar Oct 13 '22 17:10 pdurbin

In c7c16d4 I updated the Search API two allow two new geo-related parameters: geo_point and geo_radius:

Screen Shot 2022-10-24 at 11 52 26 AM

I created a dataset using lat/logs from a box I drew and copied from https://boundingbox.klokantech.com

westlimit=-71.187346; southlimit=42.33661; eastlimit=-71.043056; northlimit=42.409599

Screen Shot 2022-10-24 at 11 30 51 AM

If I use a radius of 1 km, I can't find the dataset:

$ curl "http://localhost:8080/api/search?q=*&geo_point=42.3,-71.1&geo_radius=1"

{"status":"OK","data":{"q":"*","total_count":0,"start":0,"spelling_alternatives":{},"items":[],"count_in_response":0}}

If I use a radius of 4 km, I get a hit:

$ curl "http://localhost:8080/api/search?q=*&geo_point=42.3,-71.1&geo_radius=4"

{"status":"OK","data":{"q":"*","total_count":1,"start":0,"spelling_alternatives":{},"items":[{"name":"Dataverse HQ","type":"dataset","url":"https://doi.org/10.5072/FK2/MKEOMD"...

Right now this is API-only. I'm not sure how much work it would be to add it to the GUI.

Right now both geo_point and geo_radius are required to trigger geospatial search. Instead should I set geo_radius to a sane default (1 km? 10 km? 100 km?) and only make geo_point required to trigger geospatial search?

Do you like the names of these new parameters? My goal was to make them user friendly. geo_point maps to pt in Solr. geo_radius maps to d in Solr. You can read about these parameters at https://solr.apache.org/guide/8_11/spatial-search.html and here's a screenshot from those docs:

Screen Shot 2022-10-24 at 12 03 42 PM

pdurbin avatar Oct 24 '22 16:10 pdurbin

I added a geospatial search test in 7272de6.

@qqmyers I just noticed that if I comment out northLongitude...

//.add("northLongitude",
//        Json.createObjectBuilder()
//                .add("value", "42.409599")
//                .add("typeClass", "primitive")
//                .add("multiple", false)
//                .add("typeName", "northLongitude")
//)

... it gets populated in solr_srpt and solr_bboxtype with the value from southLongitude:

    "solr_srpt":["ENVELOPE(-71.187346,-71.043056,42.33661,42.33661)"],
    "solr_bboxtype":["ENVELOPE(-71.187346,-71.043056,42.33661,42.33661)"],
    "westLongitude":["-71.187346"],
    "southLongitude":["42.33661"],
    "eastLongitude":["-71.043056"],

What do you think? Bug or feature? 😄

p.s. Yes, I know it should be northLatitude and southLatitude. There's a bug tracking this:

  • #5645

pdurbin avatar Oct 25 '22 18:10 pdurbin

I think the indexing code assumes you're talking about a point (or line) if you leave one coord out so it just replicates the value for the other extreme. So ~feature.

qqmyers avatar Oct 25 '22 18:10 qqmyers

@qqmyers ok, but I did go ahead and fix a bug when there was only one coordinate: e5187b2af4

Next up: names. I haven't heard any objections to geo_point and geo_radius so I'm planning on sticking with them. Unless we want spatial_point and spatial_radius? I'm pretty sure we're limiting ourselves to Earth, though, so I think "geo" is fine (and shorter!). 🌎

On the Solr field side we currently have solr_bboxtype and solr_srpt. None of our existing fields start with solr_ so I'm thinking of lopping that off.

Any reason not to shorted to just bbox? Our Solr fields usually don't end with type and just bbox is what's shown in the docs: https://solr.apache.org/guide/8_11/spatial-search.html#bboxfield

I don't know what to say about RPT. SpatialRecursivePrefixTreeFieldType is a mouthful. I'm not in love with solr_srpt but the only example I see in the docs is location_rpt. Perhaps we should go with that.


Update: In 28215fb512 I renamed the Solr fields:

  • solr_srpt is now geolocation
  • solr_bboxtype is now boundingBox

Feedback welcome!

pdurbin avatar Oct 25 '22 18:10 pdurbin

I added some error checking and updated the docs based on how the code is now.

I talked to @atrisovic about this feature and she immediately asked "Can you sort by distance?" Good question! 😄 I haven't looked into this but maybe? See https://solr.apache.org/guide/8_11/spatial-search.html#distance-sorting-or-boosting-function-queries for sort=geodist() asc. I'd need more data to test with.

I'm also ok with this being API only for now. That's how I wrote it up in the User Guide: https://dataverse-guide--8239.org.readthedocs.build/en/8239/user/find-use-data.html#geospatial-search

I'm ready for review. We could keep hacking but there's lots of other stuff to do as well. 😄 I think this is an excellent toehold into geospatial search!

pdurbin avatar Oct 26 '22 21:10 pdurbin

Moved back to "This Sprint" so that @pdurbin's recent changes can be reviewed. (@pdurbin reviewed the initial PR and was fine with the changes Jim made.)

scolapasta avatar Oct 31 '22 19:10 scolapasta

grooming

  • Looking at the NetCDF deliverables.
  • This looks like it's a related issue that is related to active work, but was done before we were trying to tag each issue
  • Adding it to the backlog as cleared, and tagging it.

mreekie avatar Mar 30 '23 18:03 mreekie