ud-annotatrix icon indicating copy to clipboard operation
ud-annotatrix copied to clipboard

Github integration

Open yaskevich opened this issue 5 years ago • 10 comments

Supershort formulations of tasks:

  • add "fork from GitHub url" button to the /annotatrix page
  • add "fork from GitHub url" textbox to the / page
  • support for different url schemes (e.g. raw.githubusercontent.com, github.com, etc.)
  • button to submit a pull request
  • show diffs before finalizing the PR (could rely on GitHub's interface for this)

It seems that code to interact with Github is already in the app code base. But there is no info, how user could start to use this feature, also it is not clear from which step development should continue.

yaskevich avatar Jun 26 '19 21:06 yaskevich

The first step is to implement loading of a corpus from a GitHub url.

jonorthwash avatar Jun 27 '19 02:06 jonorthwash

Ok, I tested, it works. Maybe it is reasonable to set limit of size for downloadable corpus, since I found some issues related big files from GitHub (#377) .

yaskevich avatar Jun 27 '19 17:06 yaskevich

No, we allow whatever the user's system won't crash on, and file a bug against notatrix regarding inefficient parsing / memory usage.

jonorthwash avatar Jun 28 '19 12:06 jonorthwash

Ok, I tested, it works.

Which part? Was it already implemented??

jonorthwash avatar Jul 01 '19 14:07 jonorthwash

Uploading a corpus from Github link was implemented before, and it works.

As to big files, there is notatrix problem (discussed there #377 and notatrix issue is there: https://github.com/keggsmurph21/notatrix/issues/6).

yaskevich avatar Jul 01 '19 20:07 yaskevich

Uploading a corpus from Github link was implemented before, and it works.

That's great, but the remaining pieces of this issue still need to be implemented.

jonorthwash avatar Jul 03 '19 12:07 jonorthwash

I've managed to connect the app to Github. As I can see, there are some code parts that are related to Github, but it's more like a draft. When user's credentials are put in the app config and user clicks login button, OAuth authentication is initiated which leads to appearance of two menu items "Manage repositories" and "Manage permissions" (but they don't provide an interface to manage anything). Current goal is fork, next is PR. I separated functions of loading content from loading and forking.

2019-07-04 17_28_42-UD Annotatrix

As I understand, a full workflow could be like this:

  • user clicks "fork" in the app
  • modal with URL input field (URL of specific corpus) is opened, user puts a link to a corpus.
  • name of the repo is extracted from the URL and request to fork the repo in user's github is sent
  • corpus is downloaded into the Annotatrix and user is able to edit
  • when user saves the state, all changes are stored in the app
  • when user presses PR button
    • fork is downloaded
    • the changed corpus is copied to the fork
    • changes are pushed to fork
    • PR for original repo is prepared

yaskevich avatar Jul 04 '19 23:07 yaskevich

Subtasks of Github integration related to corpus downloading, editing, pushing back, and preparing PR request (back end). Progress is checked.

  • [x] Connect to Github https://github.com/yaskevich/ud-annotatrix/commit/eb812fe8ae568c1d1151adcd10c2cab3fa55427d
  • [x] Request permissions to access a repo https://github.com/yaskevich/ud-annotatrix/commit/eb812fe8ae568c1d1151adcd10c2cab3fa55427d
  • [x] Create DB table to store data of repo https://github.com/yaskevich/ud-annotatrix/commit/eb812fe8ae568c1d1151adcd10c2cab3fa55427d
  • [x] Download a file from a repo via... https://github.com/yaskevich/ud-annotatrix/commit/eb812fe8ae568c1d1151adcd10c2cab3fa55427d
    • [x] NodeJS request (to be replaced with API)
    • [x] Content API (less than 1 MB)
    • [x] Database API (big files)
  • [x] Interact with DB (implementing a class for fork/repo metadata in DB) https://github.com/yaskevich/ud-annotatrix/commit/eb812fe8ae568c1d1151adcd10c2cab3fa55427d
  • [x] Fork a repo https://github.com/yaskevich/ud-annotatrix/commit/eb812fe8ae568c1d1151adcd10c2cab3fa55427d
  • [x] Push changes to user's fork https://github.com/yaskevich/ud-annotatrix/commit/eb812fe8ae568c1d1151adcd10c2cab3fa55427d
    • [x] Content API (less than 1 MB)
    • [x] Database API (big files)
  • [x] Prepare PR request to original repo https://github.com/yaskevich/ud-annotatrix/commit/eb812fe8ae568c1d1151adcd10c2cab3fa55427d
  • [x] Remove DB data of the repo, if corpus is deleted in the app https://github.com/yaskevich/ud-annotatrix/commit/eb812fe8ae568c1d1151adcd10c2cab3fa55427d
  • [x] Binding functions together (fork→commit→PR workflow)
    • [x] Add pre-check for fork (whether fork of this repo was already made by user) https://github.com/yaskevich/ud-annotatrix/commit/3be04a6361c8cb245f86d46e4d1e73d6d3f997e7
    • [x] Add pre-check for existing PR of this repo to original one (because only one active PR is allowed on Github) https://github.com/yaskevich/ud-annotatrix/commit/3be04a6361c8cb245f86d46e4d1e73d6d3f997e7
    • [x] Different requests to Github API (commit) depending on file size https://github.com/yaskevich/ud-annotatrix/commit/42ea3b69cb8dc184b1f85c9738b9866b7400c2b3 + https://github.com/yaskevich/ud-annotatrix/commit/d904e2d336e6fa2b1ab45e35693b2fe329452f99 (workaround for cases when user has empty Github profile, so API doesn't provide email and name)
    • [x] ~~Allow~~ Optimize code for loading corpus from a user's own repo (no fork) https://github.com/yaskevich/ud-annotatrix/commit/881772d7f056d42d981855f50c49c7551294103c
    • [x] Fix Treebank Settings page (currently, it is a stub) as #364
      • [x] Fix HTML markup https://github.com/yaskevich/ud-annotatrix/commit/46673957bf83f043ca78adfc07df5173cab21118
      • [x] Create database back-end for treebank properties https://github.com/yaskevich/ud-annotatrix/commit/46673957bf83f043ca78adfc07df5173cab21118
      • [x] Implement UI & server-side interaction when changing settings (AJAX) https://github.com/yaskevich/ud-annotatrix/commit/a65bbaa8aaf1220e09411a7cdfb1d514a40314ca https://github.com/yaskevich/ud-annotatrix/commit/03277cbf32ec16f956bbeb4571ff4d45478b55fd
      • [x] Bind with access checking (item below) https://github.com/yaskevich/ud-annotatrix/commit/e556debe274cb0245acf0eb9459f6b6e6ea40285
  • [x] Refactor/unify functions interacting with external API https://github.com/yaskevich/ud-annotatrix/commit/eb812fe8ae568c1d1151adcd10c2cab3fa55427d
    • [x] Migrate requestaxios (actively maintained, supports async/await & promises interfaces)
  • [x] Adjust UI
    • [x] Remove dummy buttons https://github.com/yaskevich/ud-annotatrix/commit/eb812fe8ae568c1d1151adcd10c2cab3fa55427d
    • [x] Add buttons for fork-related functions https://github.com/yaskevich/ud-annotatrix/commit/eb812fe8ae568c1d1151adcd10c2cab3fa55427d
    • [x] Add indication of fork-commit-PR state of Github corpus https://github.com/yaskevich/ud-annotatrix/commit/eb812fe8ae568c1d1151adcd10c2cab3fa55427d
    • [x] Add modal box for user's commit message https://github.com/yaskevich/ud-annotatrix/commit/eb812fe8ae568c1d1151adcd10c2cab3fa55427d
    • [x] Add modal box for displaying commit result (clickable link) https://github.com/yaskevich/ud-annotatrix/commit/3be04a6361c8cb245f86d46e4d1e73d6d3f997e7
    • [x] Make UI for creating a PR (title, description, modification flag, draft flag) https://github.com/yaskevich/ud-annotatrix/commit/3be04a6361c8cb245f86d46e4d1e73d6d3f997e7
    • [x] Add modal box for displaying PR result (clickable link) https://github.com/yaskevich/ud-annotatrix/commit/3be04a6361c8cb245f86d46e4d1e73d6d3f997e7
    • [x] Add UI elements for Github login and Github file loading to main page https://github.com/yaskevich/ud-annotatrix/commit/5a253810c7d1e9e7be3017aed32d78212c8e229c
    • [x] Access check
      • [x] Display in a different manner user's and other's corpora (on main page) ~~Make clickable only links to user's Github corpora~~ https://github.com/yaskevich/ud-annotatrix/commit/bf75a54115ab5c1e3316b06758504df2ad4be943
      • [x] Allow to perform Github actions on corpus only ~~if it is user's corpus (on edtior page)~~ if it is allowed by Treebank settings (see above, fix for #364) https://github.com/yaskevich/ud-annotatrix/commit/e556debe274cb0245acf0eb9459f6b6e6ea40285
    • [x] Disable PR button when user works in own original repository (not in other's fork) https://github.com/yaskevich/ud-annotatrix/commit/727f14a035a1b314d0e3d1b98fc8aecdb57fe909
  • [ ] Misc.
    • [x] Create Annotatrix Github account (to mark with it commits from the app, and for testing purposes) @annotatrix
    • [x] Write a section on Github integration in application readme (https://github.com/yaskevich/ud-annotatrix/commit/09a44add1cfde6331fe50a340f11ef9ef68c2d1f).
    • [x] Log to file (for debugging purposes) (https://github.com/yaskevich/ud-annotatrix/commit/e12534e55b73c0a1782414ad2d42a68db121c3f2) Further work is in #378
    • [x] Implement autologin into Github on the app start (after user was already logged in) https://github.com/yaskevich/ud-annotatrix/commit/0614de09e34c5551c95ac4d54ecdd98a2360302d + https://github.com/yaskevich/ud-annotatrix/commit/aa2a10b76346c8034e090b54f0c554f036b9d456 + https://github.com/yaskevich/ud-annotatrix/commit/8ab80bb22073cf0fdf6ee943bcd3efd194f3b099 (cookie expiration moved time to config)
  • [x] Bug fixes
    • [x] Wrong treebank ID is sent from the client code (https://github.com/yaskevich/ud-annotatrix/commit/dd97c2ae889d01767d324331162567645d9b5cd2)
    • [x] QuotaExceededError was not safely processed, which prevented treebank to be saved on server https://github.com/yaskevich/ud-annotatrix/commit/3d06173218df22e7cc688f696238d6182ab875e6

Note for previous workflow description: it is not necessary to make a local repo to push the changes.

yaskevich avatar Jul 08 '19 18:07 yaskevich

Previous comment is a tracker for the function set for interaction with Github API, if there is someone who would like to keep track of the development of the project, but didn't read this thread before.

yaskevich avatar Jul 22 '19 22:07 yaskevich

Some thoughts for consideration on challenges related to interaction with files after some experience with Github API.

Filesize and Github limits

Github and its API are designed for dealing rather with small files (less than 1 MB).

Thus, it's easy to get a small file or push/commit it, it all smoothly goes via Content API . But it becomes more tricky, if it's about the files that are bigger than 1 MB. After I made it work, I had to change this way of interacting with Github to Tree/Blob API: so, it's like when I read a directory, look for a file I need, get its meta, and then fetch it as a blob and decode.

This API allows to process files which are bigger than 1 MB, but it's noticed: "This API supports blobs up to 100 megabytes in size." Although Github is unhappy when user pushes something bigger than 50 MB.

I use for testing purposes corpora from Universal Dependencies git, and I didn't meet any corpus that is larger than 100 MB. Generally, it seems that it's possible to deal with files which are bigger by means of Git Large File Storage (I hadn't had yet any experience with that thing). However, it would be an issue for Annotatrix.

As to updating a corpus from the app, the siuation is similar to data loading (easy for small files and tricky for bigger ones). So, pushing changes to Github in code is a sequence of queries to low-level API of Github. Technically, it's like compiling commit and replacing a whole tree, not just single request (but 7 requests).

As to files bigger than 100 MB, the only way could be a programmatic interaction with LFS, which is rather not documented, being super rare case for using Github via code.

Generally, it looks like the larger file size becomes common for UD users, the worse Github suits as a storage service. E.g., on limits for repos: "We recommend repositories be kept under 1GB each. Repositories have a hard limit of 100GB. If you reach 75GB you'll receive a warning from Git in your terminal when you push. This limit is easy to stay within if large files are kept out of the repository. If your repository exceeds 1GB, you might receive a polite email from GitHub Support requesting that you reduce the size of the repository to bring it back down."

We have to keep those things in mind.

yaskevich avatar Aug 18 '19 16:08 yaskevich