ga4gh-schemas icon indicating copy to clipboard operation
ga4gh-schemas copied to clipboard

Release process (from #287, item 8)

Open pgrosu opened this issue 9 years ago • 0 comments

I'm opening this issue up, since it's not been posted yet (after the June 15). This is with regards to #287 for issue # 8. I posted some of these steps previously under [https://github.com/ga4gh/schemas/issues/323#issuecomment-110227872], but I will generalize them below, which should make the release process a little more straightforward:

1 -> Defining Features and Goals: Start with the goals and the features to implement. (N.B.: This can be the next goals based on a previous version of the API.) We also should list the goals of how it will be used and how those features fit in to provide these intended opportunities, such as:

The API should provide the ability to request ad-hoc aggregation of ReadGroups into a ReadGroupSet for on-the-fly alignments.

2 -> Acceptance Criteria via BDD: Using those goals/features then document the acceptance criteria for each via behaviour-driven development (BDD). Here we should try to use a standard such as the Gherkin syntax, where we define the scenarios and what we would expect. Below is one specific example for searching for reads using a Dataset id:

Feature: Retrieving Reads
  Scenario: Retrieving reads using a dataset id
    Given a dataset id
    When it is retrieved 
    Then a '200 OK' status is return
    Then then it is returned
    It should have a dataset id
    It should have a list of reads
    It should present it into a JSON format using schema X (lines 12-19)
    It should list the dataset id first
    It should nest the reads as in JSON schema X (lines 42-49)
    ...

3 -> Information Flow Model and Underlying Data Models: Based on the above features, goals, and behaviours we define next the Information Flow Models that would support those features. These can be in the form of UML diagrams, and through them we want to understand the flow of information and how the underlying supporting data models would be connected to each other.

Here the appropriate decisions for Thrift, Protobuf, Avro, etc have to be made to match the appropriate schema requirements and other necessary definitions (i.e. algorithms, etc.) to fully define and document the intent of the APIs.

For each implementation of the features transform and make each BDD into a test so that it becomes a test-driven development process. For each component of the schema, a test has to be provided with the expected results. These would be just like unit tests using actual data - which can be referenced - and should include any additional definition/descriptions regarding processing and the output. Below is an example of a test after defining a read in the schema:

[READ TEST 1]
TEST DESCRIPTION: This test will take a BAM file and produce the JSON format 
                  as defined in schema X lines 12-19.
INPUT:   BAM file 
           [LOCATION: http://.../some_file.bam]
PROGRAM: test_read2json.py -input INPUT -output formatted_reads.json 
           [LOCATION: http://.../test_read2json.py]
OUTPUT:  formatted_reads.json 
           [LOCATION: http://.../formatted_reads.json]
DOCUMENTATION: [LOCATION: http://.../test_read2json.html]
EXAMPLES:      [LOCATION: http://.../test_read2json.html]

It is critical that any additions/enhancements to the schemas would require at least one test to be added and all previous tests would need to pass - including any security-based tests.

4 -> Architecture and Workflow: Here it would be good to describe - in detail if possible - the components of the framework of the server architecture and that of the client, including their interaction. This would allow implementors to understand the possibilities for storage and transmission of data for the above features/goals, step-wise behaviors, information flow model and underlying data models. This should include information regarding the security and authentication scheme, including resource and scope access. Error handling would explicitly be defined here.

5 -> Resource Models for the API: The next step is to specify the resource models:

Based on the above behaviours (BDD), we would need to define the required resource models which are:

Root Resource

These would be the point of contact to which the clients would connect to.

Discovery of Resources

These are the available point from which resource discovery would take place, and which would be available for accessing the API capabilities. Each of these API calls will be requesting something to be processed (i.e. query, RPC, etc.).

Resources

These can be defined by the functionality they perform (i.e. search resource, item retrieval resource, collection group resource, etc). A detailed description for each would need to provided, including the inputs and outputs. These would be like the test contracts above. Here the requirements for content-type negotiation and versioning would be explicitly defined.

API Resource Style

With each API capability above the following would need to be defined:

5.1) Contract Definition

a) What the resource represent? b) How does it fit with the Information Model?

5.2) API Classification (borrowed from the Richardson Maturity Model)

What type of resource is it and how is it classified? Below are the four types of levels for API classification:

Level 0: RPC oriented Level 1: Resource oriented Level 2: HTTP verbs Level 3: Hypermedia

We can mix them up in case some we find such a mix most optimal, though it would have to be justified and documented as for any API component.

5.3) Implementation Details and Wire Format

These would be the detailed representation of how it would function and the type of wire format it would utilize. Here would be the the details and definitions of how it interacts and communicates among different components of the infrastructure, as well as within the information model. This would include how it would be transmitted with the specific media type being implemented (i.e. binary JSON, etc) - including the associated frameworks/protocols (i.e. HTTP/2, etc) with reasoning on why it is most optimal. This should include information regarding why it was chosen, with implementation, throughput, and timing provided subsequently - (i.e. Is the wire format self-describing by embedding metadata into the payload?). With each API - and its associated data model(s) - would need to be associated at least one test with the required sample data, as well as the acceptance criteria of the test. The API should be built for extensibility and evolvability in mind.

6 -> Testability

This would include all the required testing criteria for unit testing and other types of tests of all the APIs. It is critical that any additions/enhancements to the schemas would require at least one test to be added and all previous tests would need to pass - including any security-based tests.

7 -> Roadmap and release schedule

Next we map out steps 1 through 5 with specific dates for delivery, including code-freezes. For step 6 we map out a development phase, the testing phases, and then the release phase. Each component of the release should be tagged accordingly, such as 0.7-development-alpha, 0.7-development-beta, 0.7-test-integration, 0.7-development-smoke, 0.7-test-regression, 0.7-production. The unit tests are already associated, since every change or addition to the schema would require at least one test by the submitter, and all previous tests would need to pass - including any security-based tests.

The whole roadmap and code-freezes would need to be documented, and posted clearly on the main GitHub page (https://github.com/ga4gh and https://github.com/ga4gh/schemas), as well as on the GA4GH website (http://ga4gh.org) under an appropriate heading regarding the API status/release schedule.

So this is just a starting-point for us to work from, and by no means written in stone. So please feel free to contribute so that we can decide as a group.

Thanks, Paul

pgrosu avatar Jun 16 '15 13:06 pgrosu