[DISCUSSION] Splitting utils.py into multiple modules
This has been bothering me for a long time, so this morning I decided to take a stab at this.
The utils.py has grown to over 150 functions over the years and close to 5,000 loc :/ Having everything in one file obviously is a bad idea not only for performance but also maintainability of the codebase.
After 3 hours of work, this is a dictionary that I plan to use to carry function names to new modules, where a given function name (represented by the keys of the dict below) in anvio/utils.py, will become a function in a new module (represented by he values in the dict).
For instance, according to this, get_total_memory_usage function in anvio/utils.py would be a function in anvio/utils/system.py and so on.
My question is, does the function names to categories below make sense? Is it done well, or do you think we can refine it more? For instance, I hate anviohelp and misc categories. The first one is a bad name, the second is a useless category. Does anyone have any suggestions regarding those functions or should we go with this?
Once this is finalized I will use this dictionary to generate these new modules, and hopefully to write a migration script that can go through all files in the codebase to update import statements (I have no idea how to do that at this point, but I will document things here of course).
FUNCTION_TO_MODULE = {
'get_total_memory_usage': 'anvio.utils.system',
'display_top_memory_usage': 'anvio.utils.system',
'get_available_program_names_in_active_environment': 'anvio.utils.system',
'is_program_exists': 'anvio.utils.system',
'check_h5py_module': 'anvio.utils.system',
'rev_comp': 'anvio.utils.sequences',
'rev_comp_gene_calls_dict': 'anvio.utils.sequences',
'get_GC_content_for_sequence': 'anvio.utils.sequences',
'get_synonymous_and_non_synonymous_potential': 'anvio.utils.sequences',
'get_list_of_AAs_for_gene_call': 'anvio.utils.sequences',
'get_list_of_codons_for_gene_call': 'anvio.utils.sequences',
'get_translated_sequence_for_gene_call': 'anvio.utils.sequences',
'translate': 'anvio.utils.sequences',
'get_most_likely_translation_frame': 'anvio.utils.sequences',
'get_codon_order_to_nt_positions_dict': 'anvio.utils.sequences',
'nt_seq_to_nt_num_array': 'anvio.utils.sequences',
'nt_seq_to_RC_nt_num_array': 'anvio.utils.sequences',
'nt_seq_to_codon_num_array': 'anvio.utils.sequences',
'nt_seq_to_RC_codon_num_array': 'anvio.utils.sequences',
'_nt_seq_to_codon_num_array': 'anvio.utils.sequences',
'is_amino_acid_functionally_conserved': 'anvio.utils.sequences',
'convert_sequence_indexing': 'anvio.utils.sequences',
'summarize_alignment': 'anvio.utils.sequences',
'restore_alignment': 'anvio.utils.sequences',
'serialize_args': 'anvio.utils.commandline',
'format_cmdline': 'anvio.utils.commandline',
'run_command': 'anvio.utils.commandline',
'start_command': 'anvio.utils.commandline',
'run_command_STDIN': 'anvio.utils.commandline',
'get_command_output_from_shell': 'anvio.utils.commandline',
'get_cmd_line': 'anvio.utils.commandline',
'get_gene_caller_ids_from_args': 'anvio.utils.commandline',
'gzip_compress_file': 'anvio.utils.files',
'gzip_decompress_file': 'anvio.utils.files',
'tar_extract_file': 'anvio.utils.files',
'store_array_as_TAB_delimited_file': 'anvio.utils.files',
'store_dataframe_as_TAB_delimited_file': 'anvio.utils.files',
'store_dict_as_TAB_delimited_file': 'anvio.utils.files',
'get_column_data_from_TAB_delim_file': 'anvio.utils.files',
'get_columns_of_TAB_delim_file': 'anvio.utils.files',
'is_all_columns_present_in_TAB_delim_file': 'anvio.utils.files',
'transpose_tab_delimited_file': 'anvio.utils.files',
'get_vectors_from_TAB_delim_matrix': 'anvio.utils.files',
'get_TAB_delimited_file_as_dictionary': 'anvio.utils.files',
'get_BLAST_tabular_output_as_dict': 'anvio.utils.files',
'get_samples_txt_file_as_dict': 'anvio.utils.files',
'get_primers_txt_file_as_dict': 'anvio.utils.files',
'get_groups_txt_file_as_dict': 'anvio.utils.files',
'get_yaml_as_dict': 'anvio.utils.files',
'get_chunk': 'anvio.utils.files',
'concatenate_files': 'anvio.utils.files',
'get_file_md5': 'anvio.utils.files',
'split_fasta': 'anvio.utils.fasta',
'get_num_sequences_in_fasta': 'anvio.utils.fasta',
'get_all_ids_from_fasta': 'anvio.utils.fasta',
'check_fasta_id_formatting': 'anvio.utils.fasta',
'check_fasta_id_uniqueness': 'anvio.utils.fasta',
'get_read_lengths_from_fasta': 'anvio.utils.fasta',
'remove_sequences_with_only_gaps_from_fasta': 'anvio.utils.fasta',
'get_FASTA_file_as_dictionary': 'anvio.utils.fasta',
'get_GC_content_for_FASTA_entries': 'anvio.utils.fasta',
'unique_FASTA_file': 'anvio.utils.fasta',
'store_dict_as_FASTA_file': 'anvio.utils.fasta',
'export_sequences_from_contigs_db': 'anvio.utils.fasta',
'create_fasta_dir_from_sequence_sources': 'anvio.utils.fasta',
'get_db_type_and_variant': 'anvio.utils.database',
'get_db_type': 'anvio.utils.database',
'get_db_variant': 'anvio.utils.database',
'get_required_version_for_db': 'anvio.utils.database',
'get_all_sample_names_from_the_database': 'anvio.utils.database',
'get_all_item_names_from_the_database': 'anvio.utils.database',
'get_bams_and_profiles_txt_as_data': 'anvio.utils.database',
'get_genes_database_path_for_bin': 'anvio.utils.database',
'is_contigs_db': 'anvio.dbinfo',
'is_trnaseq_db': 'anvio.dbinfo',
'is_pan_or_profile_db': 'anvio.dbinfo',
'is_profile_db': 'anvio.dbinfo',
'is_structure_db': 'anvio.dbinfo',
'is_blank_profile': 'anvio.dbinfo',
'is_pan_db': 'anvio.dbinfo',
'is_genome_storage': 'anvio.dbinfo',
'is_genes_db': 'anvio.dbinfo',
'is_kegg_modules_db': 'anvio.dbinfo',
'is_profile_db_merged': 'anvio.dbinfo',
'is_profile_db_and_contigs_db_compatible': 'anvio.dbinfo',
'is_structure_db_and_contigs_db_compatible': 'anvio.dbinfo',
'is_pan_db_and_genomes_storage_db_compatible': 'anvio.dbinfo',
'check_sample_id': 'anvio.utils.validation',
'check_collection_name': 'anvio.utils.validation',
'is_this_name_OK_for_database': 'anvio.utils.validation',
'check_contig_names': 'anvio.utils.validation',
'check_misc_data_keys_for_format': 'anvio.utils.validation',
'is_gene_caller_id': 'anvio.utils.validation',
'is_gene_sequence_clean': 'anvio.utils.validation',
'is_ascii_only': 'anvio.utils.validation',
'get_port_num': 'anvio.utils.network',
'get_next_available_port_num': 'anvio.utils.network',
'is_port_in_use': 'anvio.utils.network',
'download_file': 'anvio.utils.network',
'get_remote_file_content': 'anvio.utils.network',
'get_anvio_news': 'anvio.utils.network',
'download_protein_structure': 'anvio.utils.network',
'open_url_in_browser': 'anvio.utils.network',
'get_predicted_type_of_items_in_a_dict': 'anvio.utils.misc',
'HTMLColorToRGB': 'anvio.utils.misc',
'get_random_colors_dict': 'anvio.utils.misc',
'get_ordinal_from_integer': 'anvio.utils.misc',
'get_f_string_evaluated_by_dict': 'anvio.utils.misc',
'get_time_to_date': 'anvio.utils.misc',
'get_hash_for_list': 'anvio.utils.misc',
'get_filtered_dict': 'anvio.utils.misc',
'convert_numpy_array_to_binary_blob': 'anvio.utils.algorithms',
'convert_binary_blob_to_numpy_array': 'anvio.utils.algorithms',
'add_to_2D_numeric_array': 'anvio.utils.algorithms',
'apply_and_concat': 'anvio.utils.algorithms',
'get_indices_for_outlier_values': 'anvio.utils.algorithms',
'get_list_of_outliers': 'anvio.utils.algorithms',
'get_stretches_for_numbers_list': 'anvio.utils.algorithms',
'merge_stretches': 'anvio.utils.algorithms',
'get_constant_value_blocks': 'anvio.utils.algorithms',
'find_value_index': 'anvio.utils.algorithms',
'get_N50': 'anvio.utils.algorithms',
'split_by_delim_not_within_parens': 'anvio.utils.algorithms',
'get_split_start_stops': 'anvio.utils.algorithms',
'get_split_start_stops_with_gene_calls': 'anvio.utils.algorithms',
'get_split_start_stops_without_gene_calls': 'anvio.utils.algorithms',
'human_readable_file_size': 'anvio.utils.algorithms',
'anvio_hmm_target_term_to_alphabet_and_context': 'anvio.utils.hmm',
'get_pruned_HMM_hits_dict': 'anvio.utils.hmm',
'get_HMM_sources_dictionary': 'anvio.utils.hmm',
'get_attribute_from_hmm_file': 'anvio.utils.hmm',
'sanity_check_hmm_model': 'anvio.utils.hmm',
'sanity_check_pfam_accessions': 'anvio.utils.hmm',
'run_functional_enrichment_stats': 'anvio.utils.statistics',
'get_enriched_groups': 'anvio.utils.statistics',
'get_required_packages_for_enrichment_test': 'anvio.utils.statistics',
'check_R_packages_are_installed': 'anvio.utils.statistics',
'get_values_of_gene_level_coverage_stats_as_dict': 'anvio.utils.statistics',
'get_names_order_from_newick_tree': 'anvio.utils.phylogenetics',
'gen_NEXUS_format_partition_file_for_phylogenomics': 'anvio.utils.phylogenetics',
'ununique_BLAST_tabular_output': 'anvio.utils.anviohelp',
'convert_SSM_to_single_accession': 'anvio.utils.anviohelp',
'get_default_gene_caller': 'anvio.utils.anviohelp',
'get_split_and_contig_names_of_interest': 'anvio.utils.anviohelp',
'get_contigs_splits_dict': 'anvio.utils.anviohelp',
'get_variabile_item_frequencies': 'anvio.utils.anviohelp',
'get_consensus_and_departure_data': 'anvio.utils.anviohelp',
'get_bin_name_from_item_name': 'anvio.utils.anviohelp',
'get_contig_name_to_splits_dict': 'anvio.utils.anviohelp',
'get_variability_table_engine_type': 'anvio.utils.anviohelp',
'CoverageStats': 'anvio.utils.anviohelp',
'gen_gexf_network_file': 'anvio.utils.visualization',
'run_selenium_and_export_svg': 'anvio.utils.visualization',
'compare_times': 'anvio.utils.debug',
'is_all_npm_packages_installed': 'anvio.utils.debug',
}
(latest version in #2465 may be different from this)
For instance export_sequences_from_contigs_db can go into seqeunces and store_dict_as_FASTA_file can go into files. It is really difficult to do this in a way that one is certain that all choices are the best one could make.
This will require over 2,000 changes in the codebase o_O
But I still think it is worth it, and will create a branch to work on this.
Does anyone have any suggestions regarding those functions or should we go with this?
Yep :) Here are my thoughts.
Instead of anviohelp, I think we could eliminate that category entirely by moving its functions to the other modules according to where they most logically fit.
For instance, the following functions that directly access an anvi'o database could go in the new database module:
'get_default_gene_caller': 'anviohelp',
'get_split_and_contig_names_of_interest': 'anviohelp',
'get_bin_name_from_item_name': 'anviohelp',
The following functions work with information from databases indirectly (they don't access the database themselves, but they work with entries/elements from db tables):
'get_contigs_splits_dict': 'anviohelp',
'get_variabile_item_frequencies': 'anviohelp',
'get_consensus_and_departure_data': 'anviohelp',
'get_contig_name_to_splits_dict': 'anviohelp',
These could perhaps also go in database, or perhaps we could put them in an existing module like tableops.py (as static functions rather than in the Table class) or tables/__init__.py (but this one seems a bit too general-purpose for them to fit well here).
Two other functions currently classified under anviohelp are all working with files in some way. So perhaps they should go in the new files module instead.
'ununique_BLAST_tabular_output': 'anviohelp', ## this one is missing a function defline BTW
'get_variability_table_engine_type': 'anviohelp',
For ununique_BLAST_tabular_output, putting it in files would make sense since its related function get_BLAST_tabular_output_as_dict will go in files, too.
That leaves just one loner function in anviohelp, convert_SSM_to_single_accession. It works with a matrix input and specifically relates to anvio/data/SSMs so perhaps we should just put it in anvio/data/SSMs/__init__.py.
The misc functions are a bit more diverse and tricky to categorize, but perhaps we could do a similar thing. For instance, the following functions seem/are generically useful and could go into a new module called utilities:
get_filtered_dictget_hash_for_listget_predicted_type_of_items_in_a_dictget_time_to_dateget_f_string_evaluated_by_dictget_ordinal_from_integer
Finally,
HTMLColorToRGBseems to be called exclusively from anotherutilsfunction,gen_gexf_network_file, which is currently destined for thevisualizationmodule. Perhaps those functions could stay togetherget_random_colors_dictis used bytables/collectionsbut seems related to visualization needs so perhaps it could also go invisualization
Hi @ivagljiva,
Thank you so much for your input. I just realized that I missed your message by minutes before using the latest version of the mapping file, which is now here.
I think implementing your suggestions will be quite straightforward, but it will have to be done manually once everything is merged and master is operational .. simply because refactor_utils_changes.py, a temporary solution in #2465 that automatizes all the edits to the code only knows how to work with vanilla master, and not after an iteration of names changed and committed (that's what I missed by mere minutes, if I want to go back now I would have to revert a commit with thousands of changes, and I don't want to increase the size of our git database hehe). But I promise I will do these. Once the modular structure is in place, these will be simple find/replace operations and quick checks for circular import issues.