coralnet Workflow needed: Facilitate pooling more data into a single source or classifier

Workflow needed: Facilitate pooling more data into a single source or classifier

Open StephenChan opened this issue 3 years ago • 0 comments

@kriegman @beijbom This is my attempt to generalize item 1B from our recent email conversation and add some more ideas/thoughts. Let me know if I'm off the mark, though.

Pooling more data into a single classifier is obviously useful for making more robust classifiers, perhaps even classifiers that work well across different reef regions or something similar. Pooling more data into a single source can be appealing from a project management standpoint.

Some functionality routes to consider:

Copy images from one source to another, also duplicating the underlying jpg/png files on S3. Optionally copy points and/or annotations as well.
- We agreed that this is something we don't want to encourage at NOAA's data-volume levels.
Copy images from one source to another, creating new Image references in the database, but using the same S3 files to save S3 space and copying time. Optionally copy points and/or annotations as well.
- Per the discussion in issue #149, I can't think of any technical problems that would make this impossible.
- However, the Browse interfaces could get slow at 100,000s of images (this is not necessarily insurmountable, but would certainly take extra work to ensure the interface scales better than it currently does). And there may or may not be concerns with increasing the size of the site-wide Image and Annotation DB tables with many duplicate objects.
Parent/child source relationships, in which the child source syncs with the parent's images/points, and also inherits annotations, but the child can replace annotations as desired. A source can have one or more parents.
- This does seem pretty powerful, but the inheritance semantics could get complex or confusing. If the parent deletes images/points, then what happens to the child's unique annotations for those images/points? Unless syncing of images/points has some exceptions built in - but after enough exceptions, at what point does the source cease to look like a child of the parent?
- There could also be major implications for the source UI. It seems we would need to give the child source a workflow for annotating the parent's images, while disabling all the ways to modify the parent's images/points. However, the child should still be able to manage its own unique images. This could affect the UI for quite a few different pages.
- Again, if the parent's images are included in the child's Browse pages, then the Browse interfaces could get slow at 100,000s of images.
- Issue #174 is tangentially related, but has different motivations compared to this issue.
Parent/child source relationships, in which the child source's classifier is trained on the child's images/points/annotations and parent's images/points/annotations, but there is nothing else the child can do with the parent's data. A source can have one or more parents.
- The parent source's images don't appear in the child source's Browse pages, for example. Perhaps the child source's members would have to visit the parent source itself to see the parent source's data in detail.
- When the parent adds or modifies images/annotations, those changes automatically propagate to the child's classifier training.
- Parent labelset must be the same or a subset of the child labelset. There could be complications here, such as what to do when a parent source adds a label to its labelset.
Rethink how classifiers are defined; allow creating a classifier which points to an arbitrary subset of sources, but does not necessarily belong to a particular source.
- The classifier's effective labelset is the union of all the pointed sources' labelsets.
- There needs to be a way to define an owner/admin (or multiple owners/admins) for such a classifier.
- The classifier must be discoverable somehow.
- There might be limits on how much data a classifier can point to, or preapproval/prerequisites for creating this kind of classifier, or something similar. Otherwise, someone could create a bunch of redundant classifiers which point to the 10 biggest sources on CoralNet to overload our training pipeline.
Define a concept of 'source networks' which accomplishes point 5 as well as logically grouping the sources themselves together. This might be similar to the concept of 'affiliations' that we thought up a long time ago (issue #66).

Nov 03 '21 23:11 StephenChan

coralnet coralnet copied to clipboard

Workflow needed: Facilitate pooling more data into a single source or classifier

coralnet
coralnet copied to clipboard