tidysq icon indicating copy to clipboard operation
tidysq copied to clipboard

Formalize sqibble

Open DominikRafacz opened this issue 3 years ago • 2 comments

sqibble is non formalized idea, by formalization of which the package may benefit in numerous ways.

We can define sqibble as a tibble containing at least one column of type sq. Additionally, exactly one of columns of type sq has a special role of being "sequence" column. sqtibble has also attribute column_roles which is a named character vector with at least one element. This element has name sequence and value that is equal to the name of the "sequence" column (which usually is equal to "sequence").

Other columns in the sqibble can also have roles specified. In this case, the mapping between a column's role (the role name is determined by the functions that use and generate the column) and its actual name (which can potentially change) is done using the column_roles attribute. Another frequently used role will potentially be "name", a column that determines the name of the sequence.

By specifying roles in this way, we will be able to create a function (working title: extract_role_column) to extract from sqibble a column with the required role. If it is not available, a warning and a column with NA will be returned, or an error altogether -- the user will be able to specify the security level (as with other functions).

Why do we need such formalization? It will allow us to write functions that operate on such objects instead of writing functions that take several vectors including one sequence vector. An example of such a function is currently write_fasta -- it takes two vectors: x and name. With a formalization like the one described above, the function will instead be able to take a single parameter -- sqibbl. The requirement will be for sqibble to have columns with the roles "sequence" (which, recall, is a general requirement on sqibble) and "name". A call to

write_fasta(some_sqibble)

will then be equivalent to a call to

write_fasta(x = some_sqibble %>% extract_role_column("sequence"), name = extract_role_column("name"))

which currently, if we are using unformed sqibbles looks like this:

write_fasta(x = some_sqibble %>% pull("whatever-name-sequence-column-has-i-have-no-freaking-idea"), name = some_sqibble %>% pull("whatever-name-name-has"))

It could bring ease of use to users and another convenience to potential developers.

DominikRafacz avatar Feb 28 '21 01:02 DominikRafacz

By the way -- read_fasta currently returns a sqibble with the name sq for the sequence column, which can sometimes be problematic or confusing, perhaps it would be better to use the name "sequence"?

DominikRafacz avatar Feb 28 '21 01:02 DominikRafacz

By the way -- read_fasta currently returns a sqibble with the name sq for the sequence column, which can sometimes be problematic or confusing, perhaps it would be better to use the name "sequence"?

We should use "name" or "id" and "sequence". Thoughts on the naming convention @leonjessen?

michbur avatar Mar 08 '21 11:03 michbur