recogito2 icon indicating copy to clipboard operation
recogito2 copied to clipboard

Networks

Open eltonteb opened this issue 8 years ago • 15 comments

If Recogito is able to identify where (in a text) any annotation is made, does this mean that it's able to identify whether 1 or more annotation is made within the same sentence? And could that information be used then to construct simple "network" diagrams of the relationships (based on co-occurrence)?

eltonteb avatar Oct 15 '16 07:10 eltonteb

Technically, this is feasible, of course. (We've discussed something similar at some point with Chiara, I think. This may have been a thread on the Commons forum as well.)

However: a plaintext file has no notion of "sentence". At least you'd need to do some minimal processing. Chunking based on punctuation, a bit of math to build the network based on character offset/-length/-overlap of annotations and sentences, etc. I'm not convinced this functionality should ever be part of Recogito. Successful tools are usually those that do one thing (and that well), and we're already struggling with almost more breadth than we can handle in the given time and budget.

I've created a "future projects" label for ideas such as this. Does that sound ok?

rsimon avatar Oct 15 '16 12:10 rsimon

@eltonteb: Since a "sentence" is not necessary a well-defined thing (also given one or the other edition praxis for ancient texts), all those "automagical" co-occurrence approaches are somewhat limited to word proximity contexts. We discussed some times before the chances and limitations of such approaches (lacking any notion of meaning or intentions) - and the challenges with "too dense" and "too sparse" areas that all compromise readability of the network.

@rsimon: I agree that you should avoid creating bloatware. On the other hand such extended application scenarios are really adding value to the whole system. My suggestion would be that you allow for the inclusion of "Third Party Tools": Basically an external webpage that reads a parameter: http://example.com/tool?document=####### and then tries to load http://recogito.pelagios.org/document/#######/downloads/csv (or ...downloads/tei or something made specifically for this purpose) and does stuff with it.

If a user "subscribes" in his/her settings page to using such a tool, a new Button could be added that opens the tool's urls with the currend document ID in a new tab (maybde with target="blank", maybe with target="toolname#######") when clicked.

So in the easiest scenario the tools would work only for public datasets (but some distributed auth/secret mechanism could be added).

I would be volunteering to write a co-occurrence-network tool with such a loosely coupled recogito integration, if there is enough demand. (But I will have no time for it before January!)

What do you think?

efi avatar Oct 15 '16 12:10 efi

Wow - that sounds fantastic! And it's really what I'm hoping the download channels would be used for. I'd be excited to see something along those lines (and happy to shape and extend the downloads to whatever is needed to make things work. At the moment, all of them are rather rudimentary.)

rsimon avatar Oct 15 '16 13:10 rsimon

Concerning permissions: yes for public docs this won't cause any problems. For non-open things: social login (Google, GitHub and Twitter) are on our list. I guess this would allow us, in theory, to integrate different applications on restricted documents?

rsimon avatar Oct 15 '16 13:10 rsimon

OK, just a proof of concept for now:

Input: a Recogito document from which to extract its ID (example: xzephxecjjybkd from http://recogito.pelagios.org/document/xzephxecjjybkd/part/1/edit)

Tool-Prototype URL: is then formed with this ID (it can be part of the path if only ASCII characters are used) http://recogito-network.textgraph.science:2224/document/xzephxecjjybkd

Output: a list of annotations that can highlight others to which they are "connected" on hover

It's now easy to put this on a map, display edge weights, display an offset-based "timeline", etc from here. But I'll have to postpone that a few months.

As for the permissions thing: It's usually more complicated than one thinks beforehand. I have struggled some times before to get such a system running, but yeah: first things first...

efi avatar Oct 16 '16 13:10 efi

Yes, re-thought it myself yesterday and social login won't solve the issue. There's probably no way around some form of token exchange, or looking into some sort of single-sign-on framework. So let's keep it for the distant future and make it work for the public docs for now ;-)

rsimon avatar Oct 16 '16 14:10 rsimon

Re data: is your plan to fetch CSV (annotations) and/or TEI (for the text, to determine sentences)? Or just annotations CSV, simply going via character distance?

Finally: question of visualization. Would that be in the scope of your tool as well? Or would you just deliver e.g. downloadable edge/node lists and people should work further in Gephi et al.?

rsimon avatar Oct 16 '16 14:10 rsimon

Re token-exchange: Agreed. ;-)

Maybe one more thing to consider is providing means of secure cross-origin XHR. So far my prototype downloads the csv on the server side, which is OK for me, but if prevents a purely browser-based, lightweight solution, that everyone could for example create using github pages only.

I planned on including a visualization as well, but that would be based on a graph representation, which is dead-easy to make downloadable for Gephi as well. So far I of course only focused on the co-occurrence calculation (on the browser side, may be a bit slow for huge documents).

Well, you did click my link to the current prototype for the tool, right? :-)

efi avatar Oct 16 '16 14:10 efi

<Well, you did click my link to the current prototype for the tool, right? :-)>

No! Thought it was just meant as example, for the future & didn't expect things to work already :-D

That's really fantastic :-) I'll think about a good way of linking that in to the Recogito document pages. And I'll get the CSV escaping fixed next week, right after the IIIF stuff (which shouldn't take more than 2-3 days).

rsimon avatar Oct 16 '16 14:10 rsimon

Hi @efi, have you ever followed up on this further and/or are you interested in picking it up again? I’ve now been experimenting with an “API” for client-side plug-ins that could add extra JS components to the annotation stats page. I guess the graph view could be remodeled into one in theory?

rsimon avatar Mar 03 '18 07:03 rsimon

Hi @rsimon,

I remember, I did some further test implementations towards a connected map/network view but haven't really improved on that prototype back then.

Given your new "API" I think (for the average sized annotated text) it is totally fine to perform all calculations for such a visualisation on the client side.

It would certainly be fun to try out a few things & in the beginning it should be fine to just play around in the page's Javascript Console. So unless you expect fast results we should be able to work something out together. Do you have some form of documentation or starting point for me?

efi avatar Mar 07 '18 13:03 efi

Hi @efi,

no, it's mostly in the idea stage for the moment ;-) The plan is that Recogito would provide JSON endpoints for the plugin to get...

  • all annotations in this document (might be a lot of data, but should allow doing pretty much everything)
  • various stats, like unique tags, places/people/etc. (these could, of course, be computed from the annotations, too, so they would simply be convenience APIs)

And then provide a DIV for the plugin to use on the screen.

Documentation: we'd need to make that up as we go. At the moment, I can however give you samples of what the existing JSON responses for annotations and places look like:

(I think you may have used this for your demo anyway? Or did you go via the CSV download?)

The advantage over an approach that goes via an external application would be that the plugin operates inside the "private" workspace of a user, so it would work for non-public documents, without the plugin developer having to deal with authorization-related issues, OAuth, tokens or similar. The drawback, obviously, would be that 3rd party code... is operating inside the private workspace, so these plugins would have to be from trusted sources. But users wouldn't be able to install them themselves, anyway. But at most be able to select them from a pool of available plugins, installed by the Recogito admin.

Once plugin-specific server side processing is needed, we'd need to think about additional ways on how to facilitate that. But one step at a time...

rsimon avatar Mar 08 '18 08:03 rsimon

Hi @rsimon ,

I had to use the csv file to get a notion of the annotation's position in the text (char-offset in "ANCHOR") to be able to calculate "cooccurrence".

As for now such an entry is missing in the annotations api endpoint.

It was also quite nice to get the coordinates merged into the annotation row directly without the roudtrip to manually joining bodies>>uri and items>>is_conflation_of>>*uri*<<representative_point.

efi avatar Mar 09 '18 10:03 efi

The char offset is in the annotations API endpoint (anchor field in each annotation object). But, as you say, the coordinates are not merged into the annotations, but only available indirectly through the places endpoint.

Those, however, are the things we can discuss about adding as extra options for the plugin. (Either as a modification to the JSON API, or as client-side functionality provided through some sort of utility library made available per default to plugins.)

rsimon avatar Mar 09 '18 10:03 rsimon

Wow. Should have seen the anchor field... OK, the weekend is near! ;-)

Yes, That can then be handled flexibly. I will see how far I get with what's already there.

efi avatar Mar 09 '18 10:03 efi