amigo icon indicating copy to clipboard operation
amigo copied to clipboard

Bulk access to documents using bulk input

Open kltm opened this issue 10 years ago • 36 comments

The bulk download of annotations with gene product input is an occasionally used feature of AmiGO 1.x that I suspect might be missed.

The easiest implementation would be a series of queries to the GOlr server, with the results hashed on the perl side as we go to get rid of dupes.

kltm avatar Feb 26 '14 18:02 kltm

Similarly, there has been a user that expressed interest in the possibility of creating term ID lists:

http://jira.geneontology.org/browse/GO-360 http://amigo1.geneontology.org/cgi-bin/amigo/search.cgi?search_constraint=term&termfields=acc&ont=all&gptype=all&speciesdb=all&taxid=all&evcode=all&action=new-search&search_query=GO:0000166%0A%0DGO:0001893%0A%0DGO:0005080%0A%0DGO:0005625%0A%0DGO:0005819%0A%0DGO:0005975%0A%0DGO:0006417

kltm avatar Mar 18 '14 15:03 kltm

Also: http://jira.geneontology.org/browse/GO-359

kltm avatar Mar 28 '14 01:03 kltm

I think this would be better done with the batching on the server side--more robust and we wouldn't have to deal with sync, as well as giving easier access in a more traditional way (bulk remote clients and bots). I'll have to check if the perl API is up for it. I would also have to consider if the linking library was sufficient in the perl API.

kltm avatar Apr 14 '14 23:04 kltm

http://jira.geneontology.org/browse/GO-389

kltm avatar Apr 18 '14 17:04 kltm

Looks like at least part of the functionality is going to be frustrated by kltm/bbop-js#17. A separate accordion widget is a bit of a thing on its own, but I'm not sure how to proceed with complete functionality without it...

kltm avatar May 03 '14 00:05 kltm

Creating something that makes sense in the context of AmiGO 2 is taking rather more work than expected, mostly due to a digression with kltm/bbop-js#17, which itself is more work than expected. Thinking about bumping this to its own milestone and getting what's in there out as fast as possible.

kltm avatar May 05 '14 03:05 kltm

[Will edit this list in place.] I would guess another very full day or two days for the JS frontend; still need:

  • live_filters spinner
  • new bs3 filter_shield (can use bs3 progress as spinner here)
  • search field selection for IDs
  • results selection
    • in-browser results in light form (a la the primitive search's handler)
    • TSV download (proxied through the perl) After all of that, the backend work should be easier:
  • clumping (or not) of the queries
  • results production I want to say three days total, but am a little leery after how much time had to be sunk into the accordion.

kltm avatar May 05 '14 23:05 kltm

http://jira.geneontology.org/browse/GO-429

kltm avatar May 22 '14 20:05 kltm

After some discussion with @cmungall and @hdietze, will explore (again?) if batching is really necessary. And the perl client?

To get this out the door faster, it may be that we should just switch to POST and set the bar fairly low, putting all of the responsibility on the JS client to compose a large query and then setting the limits to something reasonable through experimentation.

kltm avatar Aug 29 '14 00:08 kltm

Well, looking at the issues and requests above again, I think I can explain the rationale better this time around.

One set of issues that users are having seem to be along the lines of: I want to put in a bunch of term ids and get a listing of those terms I can work with (link, download, etc.). This is essentially the way the current bulk interface prototype is heading, and would probably be workable for non-huge numbers by composing large queries (will require some new stuff in the manager, but probably not too bad).

That said, this is not the actual issue open here. The other set of issues (this issue) is along the lines of: I want to put in random gps and get annotation data; I want to put in terms and get gp (annotation) data. This would require the Solr equivalent of an RDB join, and I think can only practically be done with an initial query to get the "key ids" from one doctype and then using another query to get the wanted data from the target doctype.

In large or complicated cases, maybe not something that one would want to handle in a single pass from the client (time), and breaking it up (a la the matrix tool) is rather unwieldy in practice. Reading this thread though, it seems my final deciding factor for wanting to do it on the server was to make it easier to directly create links and kick-ins for the service, so that these bulk pages could be treated in much the same way that term and gp pages are currently treated.

Shelving those reasons for the moment though, if one was willing to sacrifice easy kick-ins and the single-step satisfaction of going straight from gp symbols to terms, a possible workable interface would be:

  • nice bulk search: live filters, search fields, and bulk input (essentially the current bulk mock-up)
    • maybe needs a download field selector too?
  • clicking search will produce the results (maybe just a preview, 100?)
  • these results will include buttons or links (a la current live search pages) that have options like: download, get direct annotations for these IDs, get all annotations for these IDs

One thing would be lost in an interface like this, at least in the beginning, would be filtering on the second step (e.g. with these term ids, give me all annotations with this evidence); but you can image either extending this or feeding it into itself (we'd need kick in to get things like TE results links to work) so there somebody could take multiple steps through the different document types, filtering and joining with the next one. Sort of a shopping cart of the moment.

If this makes sense, I think this might be a way to go for now: we'd get some parts running immediately, and could grow it out into other needed functionality (at the cost of single steps). It would also mean that the perl bits could be ignored for a while longer (possibly allowing us to stall long enough to get rid of them completely). If it sounds right, I'll add another issue for basic bulk search, and let this be the second step.

kltm avatar Aug 29 '14 06:08 kltm

Re: joins - the golr documents are denormalized and should not require joins for non-boutique queries. E.g. fetching annotations or entities by term ids would be achieved via the closure field. It may be the case that performance would be poor, but no join required.

cmungall avatar Aug 29 '14 19:08 cmungall

Talked again; the current plan is to 1) get the bulk download working and 2) get more gp information into the association doctype (which should meet most of the needs for most of the users without needing complicated "join" code). Will change title to reflect this.

kltm avatar Aug 29 '14 20:08 kltm

Still needs new manager functionality (in progress) and results widget (#104).

kltm avatar Sep 16 '14 23:09 kltm

Will put up to start getting some feed back. Still needs:

  • spinner integration across widget-lets (may need go-mme/noctua-like pre/post hooks)
  • download/activity button integration
  • convert all widgets to take manager as argument
  • results needs active checkboxes
  • needs better default download options
  • our button needs to close around response variable? (refactored into different widget instead)
  • want to get the glyphs working
  • button injection into different widget (weird, but will look better)
  • feedback and style cleaning (explanations, text, and getting the bits to fit a little better)

kltm avatar Sep 23 '14 00:09 kltm

Whelp. There is a same-origin policy (implemented in jQuery) so that cross-site POSTs are coerced into GETs. This is alluded to in the documentation, but never spelled out; I explicitly forced the example (localhost:8080 vs localhost:9999 for the exact same request). I believe this means nothing (not CORS, not different arguments--nothing) can help here except for a server-level policy change.

To make this work as it stands, this means that we will either have to: a) proxy Solr through AmiGO's Apache/standalone (oi) or b) make deployments raise their GET max to something closer to the limits of what we want/performance. I feel like the latter is more sensible right now.

kltm avatar Sep 23 '14 05:09 kltm

Bumped nginx on our labs servers, getting around 5-6s responses with all fields searched in ontology, sometimes a little more, but it's probably sufficient considering use case for the tool. I worry though that could hose other queries coming in. Testing and lower limit? Also, will have to start warming production to the fact that the limits will have to be bumped on whatever they have in front of Solr. (Solr has a very nice high limits, BTW.)

kltm avatar Sep 23 '14 06:09 kltm

New sub-tasks in comment: https://github.com/kltm/amigo/issues/69#issuecomment-56462524

kltm avatar Sep 25 '14 05:09 kltm

Awesome, v useful.

  • fields should be grouped together (e.g. all 3 gene/product fields should be contiguous). Same order as column order is OK.
  • ordering should have the most used fields first (e.g. inferred annotation is more useful than annotation extension closure)
  • search should be case-insensitive - e.g. for any label/name search

The page does suffer slightly from what at the last meeting Helen called over-exposing the architecture. This is actually really useful for power users, but I think the average user on landing on this page may be a bit overwhelmed as to what to do next. Having a simple list of real IDs might help. Also having some sensible default selections may also help (and a button select all). The difference between g/p, g/p label and g/p name will be utterly impenetrable to those not steeped in amigoness. Having an extra field in the yaml for a full-text description of the field and exposing this here would help.

Change the information from:

"This Bulk Search is specialized on the Annotations search personality: Associations between GO terms and genes or gene products. Bulk Search will let you get information on lists of input identifiers. " --> "This form allows you to search for and download associations between GO terms and genes or gene products. Enter one or more identifiers or symbols into the form and use the checkboxes on the right to select the identifier type. For example, if you enter a list of gene symbols, select "gene/product""

Finally, we'll want to ultimately want to exploit this front end for TE queries as well. They're very similar. This allows a disjunctive query over a set of entity IDs. TE is simply performing a statistical test over the results of this.

cmungall avatar Sep 27 '14 20:09 cmungall

I agree with pretty much everything, however because of the way that these pages are generated, a the specialized text and particular orders are not currently represented in the YAML files and would have to be added and threaded through. The choice is to do it the "right" way, which will take more effort, or to add special code for our use cases to the client. Although, in this case, there may not be much choice but the latter...

Related, the text overlap other places between the labels and the ids is confusing here and should probably be cleaned-up as a separate ticket (again, the framework mechanism is too cute other places, exposing structure here).

I'm not sure about usability, but I'd actually rather make the users select things by hand: the query grows rather quickly and anything that might limit that I feel is a plus.

Also, I rather feel that the display is too busy (I was using this as an a test bed for the trickier widgets for #104) and thought about getting rid of the filter widget.

kltm avatar Sep 28 '14 08:09 kltm

Test set from http://jira.geneontology.org/browse/GO-621

O60760 O14684 Q9H845 P24752 O60488 O15120 Q9NUQ2 P15121 P42330 P18054 O75342 P16050 O15296 P09917 P20292 Q9BYJ1 P16152 P04798 P05177 Q16678 P20813 P33261 P10632 P11712 P10635 P05181 P51589 Q7Z449 P08684 Q02928 Q9HBI6 P78329 Q6NT55 Q08477 P98187 Q9Y271 Q9NS75 P16444 Q9H4A9 P34913 P19440 P36269 P07203 P22352 P36969 P15428 P04180 Q9Y5X9 Q7L5N7 Q6P1A2 P09960 Q15722 Q9NPC1 Q16873 P49137 Q96N66 O15496 Q9BZM1 P04054 P14555 Q9UNK4 Q9NZK7 Q9BZM2 Q9NZ20 P47712 P0C869 Q9UP65 Q86XP0 P39877 O60733 Q6P1J6 Q13258 Q9Y5Y4 P41222 P34995 P43116 P43115 P35408 Q9H7Z7 Q15185 P43088 P43119 Q16647 Q14914 P23219 P35354 P21731 P24557 P33527 O15439 Q96AD5 Q9UKU0 P53816 O95864 O60427 Q9NYP7 Q9NXB9 P38571 Q8NF37 Q99487 O14975 P35610 O95573 Q6E213 P18283 Q5TCH4 P33121 P33260 Q05469 Q9BX93 O75828 Q8TBF2 Q6P531 Q9UJ14 O75715 P59796 Q96SL4 Q3MJ16 Q68DD2 Q9NRZ5 Q9ULC5 Q86TX2 P49753 Q8N9L9 O00154 O14734 Q8WXI4 Q9NPJ3 Q14032 Q13304 Q99735 O14880 Q8TDS5 Q96P68 Q8N8N7 Q92887 O15438 O95255 Q5T3U5 Q96J66 P52895 Q15067 Q9NP80 Q04828 P13584 Q9UNU6 Q9H4B8 Q9HCS2 P14550 Q8NCG7 Q9Y4D2 Q99685 Q9BX51 O94956 Q8N6M6 Q9H4A4 Q9HAU8 Q92959

kltm avatar Oct 31 '14 17:10 kltm

Some comments on current status of bulk search (not sure if this is redundant with some of the extensive info above).

GO ID use case

  • This is v useful when you have a set of GO IDs
  • It's not completely clear what box to click on the right. "GO class (direct) (annotation_class)" is pretty opaque. And it doesn't do what is expected. It appears to use the closure (which is actually what most users want, even if they don't have the language to express it)
  • Arriving at a list of GO IDs is obviously a challenge for many users, be awesome when this is hooked up, e.g. shopping carts
  • Entering in GO term labels is fun, but again most people don't know the strings by memory, and getting the required list may be a challenge for some. Hooking up to the SG annotator would be awesome here.

Gene use case

  • there are checkboxes for bioentity label and bioentity name, but none for the ID. Most will have IDs
  • Guaranteed that all kinds of mad IDs will be entered. If we are serious about this functionality we should be sure we have similar behavior for term enrichment (the gene use case can be seen as a degenerate case of TE)

Genercity

It's great that this is driven by the schema metadata... from a CS perspective. But I think the experience will be alienating for a user here. Perhaps a more fruitful long term approach would be grebe with the ability to enter lists?

Breaking resolution into a separate step

There is no way for the user to see which subset of entered IDs resolve. Really resolution and bulk query are separate concerns. There are many cases where we might want to plug in the resolution part (e.g. TE) and many places where we want to include the bulk query part.

Text annotation can be seen as a special case of resolution. For example, try the default query here: https://monarchinitiative.org/annotate/text

You should end up with a box that says "35 terms found" and a button to search with these 35.

Another scenario is where the user enters IDs one at a time, e.g. https://monarchinitiative.org/analyze/phenotypes/

Both text annotation (for terms or genes) and autocomplete based list building are equally useful for GO

cmungall avatar Apr 28 '16 00:04 cmungall

(The plain ID is just "bioentity"--it's being driven off of the metadata for other displays.)

There is a bit of question of scope here. There is the unarguably power tool aspect here--this obviously needs inside knowledge--but it does possibly fill a niche as is. It was relatively easy to create given that it is bootstrapped from the metadata, but it is obviously not particularly useful to most users (hence it never graduating from labs).

I think it would be good to determine if the current tool, or something similar, has any prospects (it was made for an initial narrow use case) and then try and work out from there. If a grebe-ified version would be more useful to users, we should probably just branch it off and try again now that the troubling parts work better.

Once label and ID resolution is brought in, I think we should scrape and try for a new tool--there is just so much packed into that there is unlikely much that could be salvaged from here.

The cart chugging along again would be so nice...

kltm avatar Apr 28 '16 00:04 kltm

Still interest here: http://jira.geneontology.org/browse/GO-1224 We really should just go ahead and add this at some point, or something; @mcourtot does the tool as it stands look usable to you for some use cases? http://tomodachi.berkeleybop.org/amigo/bulk_search/annotation http://tomodachi.berkeleybop.org/amigo/bulk_search/ontology http://tomodachi.berkeleybop.org/amigo/bulk_search/bioentity

kltm avatar May 23 '16 22:05 kltm

Hi Seth,

I had a quick look and it is quite unclear what the categories are (in the attached screenshot); what is the difference between the first and fourth checkboxes for example? screen shot 2016-05-24 at 11 38 34

Also re download, it would be good to have 'download as GAF, GPAD, txt or else instead of the columns choice maybe? This is a bit overwhelming.

Finally, when searching for GO IDs, QuickGO allows to search for the exact terms, or for their terms and descendants (the latter being the default, typically desired, behaviour), while AmiGO seems to search only the exact ID.

In short, the functionality seems to be pretty much there, but the UI is a bit too complex IMHO.

Cheers, Melanie

mcourtot avatar May 24 '16 10:05 mcourtot

There could be some simplification added (e.g. GAF download on the annotation bulk download), and certainly better explanatory text for the different fields, but fundamentally the reason we essentially get these bulk widgets and pages for "free" is that they run right along the software patterns that are used underneath.

kltm avatar May 24 '16 18:05 kltm

There is still interest in this feature: http://jira.geneontology.org/browse/GO-1277

vanaukenk avatar Jul 08 '16 15:07 vanaukenk

talking to @hattrill , for the annotation search we should defualt to checked boxes for going from gene names/ids to terms.

kltm avatar May 14 '18 14:05 kltm

Note that these exist in a hidden "live" form: Gene/product/bioentity: http://amigo.geneontology.org/amigo/bulk_search/bioentity Annotation: http://amigo.geneontology.org/amigo/bulk_search/annotation Ontology term/class: http://amigo.geneontology.org/amigo/bulk_search/ontology

kltm avatar Jul 12 '18 17:07 kltm

Nice ! Can we add links from the 'Search' menu?

pgaudet avatar Jul 12 '18 20:07 pgaudet

Although you might want to remove 'This should not be displayed (bioentity_internal_id)' ... (although it's also in the AmiGO download)

pgaudet avatar Jul 12 '18 20:07 pgaudet