tidysq
tidysq copied to clipboard
Formalize sqibble
sqibble
is non formalized idea, by formalization of which the package may benefit in numerous ways.
We can define sqibble
as a tibble
containing at least one column of type sq
. Additionally, exactly one of columns of type sq
has a special role of being "sequence" column. sqtibble
has also attribute column_roles
which is a named character vector with at least one element. This element has name sequence
and value that is equal to the name of the "sequence" column (which usually is equal to "sequence"
).
Other columns in the sqibble can also have roles specified. In this case, the mapping between a column's role (the role name is determined by the functions that use and generate the column) and its actual name (which can potentially change) is done using the column_roles
attribute. Another frequently used role will potentially be "name", a column that determines the name of the sequence.
By specifying roles in this way, we will be able to create a function (working title: extract_role_column
) to extract from sqibble
a column with the required role. If it is not available, a warning and a column with NA will be returned, or an error altogether -- the user will be able to specify the security level (as with other functions).
Why do we need such formalization? It will allow us to write functions that operate on such objects instead of writing functions that take several vectors including one sequence vector. An example of such a function is currently write_fasta
-- it takes two vectors: x
and name
. With a formalization like the one described above, the function will instead be able to take a single parameter -- sqibbl
. The requirement will be for sqibble
to have columns with the roles "sequence" (which, recall, is a general requirement on sqibble
) and "name". A call to
write_fasta(some_sqibble)
will then be equivalent to a call to
write_fasta(x = some_sqibble %>% extract_role_column("sequence"), name = extract_role_column("name"))
which currently, if we are using unformed sqibbles
looks like this:
write_fasta(x = some_sqibble %>% pull("whatever-name-sequence-column-has-i-have-no-freaking-idea"), name = some_sqibble %>% pull("whatever-name-name-has"))
It could bring ease of use to users and another convenience to potential developers.
By the way -- read_fasta
currently returns a sqibble
with the name sq
for the sequence column, which can sometimes be problematic or confusing, perhaps it would be better to use the name "sequence"?
By the way --
read_fasta
currently returns asqibble
with the namesq
for the sequence column, which can sometimes be problematic or confusing, perhaps it would be better to use the name "sequence"?
We should use "name" or "id" and "sequence". Thoughts on the naming convention @leonjessen?