anystyle icon indicating copy to clipboard operation
anystyle copied to clipboard

Scramble non-open access finder datasets to avoid copyright issues

Open cboulanger opened this issue 1 year ago • 4 comments

This is not an issue of AnyStyle itself, but related to the availability of more specialized training material. I'm posting it here anyways because it concerns my AnyStyle-based workflow.

I have a lot of finder annotations that I would be willing to share, but I cannot because the source material is copyrighted. One can of course always distribute the model itself, but the model is dependent on the version of the engine and cannot be mixed/matched like the source annotations can. I wonder if this is a more general problem that could be solved by a training data format for the finder that preserves the information which goes into the model, but stores it in a way that would not allow to reverse-engineer it (or at least make it not worth the effort).

cboulanger avatar Oct 24 '22 17:10 cboulanger

Yes, we have the same issue with the default finder model which is trained on a more comprehensive set than res/finder. The university in question gave us the permission to publish the model, but we can't share the copyrighted sources.

I doubt there is a good solution to packaging the sources as proposed, because they would have to be decrypted at some point prior to training. In fact, publishing the model is not such a bad solution I think. The main issue is that incrementally training the model does not work currently. In theory Wapiti should support this, but I've always ran into issues; as far as I remember, I never figured out if that is in fact a bug or if incrementally training a compacted model is not possible.

inukshuk avatar Oct 25 '22 09:10 inukshuk

openalex.org, for example, uses an inverted index to publish abstracts "due to legal constraints". However, this could of course be reverse-engineered. On the other hand, this combined with skipping low-entropy lines (#199) would produce text that could not be reconstructed in any useful way and would probably avoid copyright issues.

cboulanger avatar Oct 25 '22 10:10 cboulanger

Sorry, I'm not sure I follow. Publishing an inverted index is very similar to publishing the compacted finder mode, no? But this does not address the specific issue that, when training the model yourself from scratch, you need access to the source text. The finder model requires for each full-text (i.e., for each sequence) every line of text (i.e., the tokens; note that dropping low-entropy 'lines' in the finder context means dropping entire books, not individual lines) in their respective order, so if you wanted to protect the content during that process I think you would require at a minimum signed binaries and DRM technology (which is not a direction I'd envision this open source project taking).

inukshuk avatar Oct 25 '22 11:10 inukshuk

Ah, you're right, I forgot that the finder sequences are the entire document, so #199 only makes sense for parser sequences. So this isn't working. DRM tech is not what I have in mind, just a form of encoding that would escape copyright and still be able to train the model.

According what people have told me, bibliographies and footnotes count as "facts" and are not copyrightable, so they could be published. This leaves the main body of text. I wonder how a pre-publishing step could look like that would alter the text in ways that would make it non-reversable and still contain the training information. Word order probably matters so shuffling the words would probalbly decrease the model's quality considerably. But one could segment the body into sentences and shuffle the sentences. I wonder if copyright would cover individual sentences which are out of order.

cboulanger avatar Oct 25 '22 11:10 cboulanger