microbetag.utils

Utils supporting tasks for file parsing and formating

Module Contents

microbetag.utils.get_library_version(library_name)[source]
microbetag.utils.resolve_relative_path(base_dir, file_path)[source]
microbetag.utils.resolve_file_path(base_dir, file_path)[source]
class microbetag.utils.SetEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Bases: json.JSONEncoder

Custom JSON encoder that handles serialization of Python sets.

This encoder extends the functionality of the standard JSONEncoder to support serializing Python sets. JSON does not have a native representation for sets, so this encoder converts sets to lists before serializing them.

Usage:

When serializing data to JSON using json.dump() or json.dumps(), specify cls=SetEncoder to use this custom encoder.

References

default(obj)[source]

Override the default method of JSONEncoder to handle serialization of sets. .. rubric:: Notes

If the object is a set, it is converted to a list before serialization. Otherwise, the default behavior of JSONEncoder.default() is used. Notes: If the object is a set, it is converted to a list before serialization. Otherwise, the default behavior of JSONEncoder.default() is used.

microbetag.utils.get_files_with_suffixes(directory, suffixes)[source]

Recursively retrieves files from a specified directory and its subdirectories that have extensions matching a given list of suffixes.

Parameters: directory (str): The root directory to start the search. suffixes (list of str): A list of file suffixes (extensions) to match.

Each suffix should include the dot (e.g., ‘.txt’, ‘.csv’).

Returns: list of str: A list of full paths to files that match any of the specified suffixes.

Example: >>> get_files_with_suffixes(‘/path/to/directory’, [‘.txt’, ‘.csv’]) [‘/path/to/directory/file1.txt’, ‘/path/to/directory/subdir/file2.csv’]

microbetag.utils.flatten(list_of_lists: List)[source]

This function takes a list of lists and flattens it until it returns a list with all the components of the initial one.

microbetag.utils.run_until_done(command: str)[source]

Function to run recursively a command until

microbetag.utils.file_exists_and_nonzero(filename: str)[source]

Check if a file exists and its size is nonzero.

Parameters:

filename (str) – The path to the file.

Returns:

True if the file exists and its size is nonzero, False otherwise.

Return type:

bool

microbetag.utils.split_list(input_list: List, chunk_size: int)[source]

Split a list to sublists of a size.

microbetag.utils.many_to_one_files(dir_with_files, merged_file)[source]

Makes a single file out of all files in a directory by concatenating having rows of one after the other

microbetag.utils.ko_list_parser(ko_list: str)[source]

Parses ko_list file into a dict object - based on DiTing

Parameters:

ko_list – path to the ko_list file that comes from the kofam database https://www.genome.jp/ftp/db/kofam/

Returns:

a dictionary mapping knum to threshold and score_type

Return type:

dict

microbetag.utils.merge_ko(hmmout_dir, output)[source]

Parses the KO<>.<bin>.hmmout files produced by the kegg_annotation() function to create a single 3-column file (output) with the bin_id, the corresponding conting and the KO that wa mapped to it. The function then returns a dictionary with the bin ids as the keys and the set of KOs found to each as the value.

Parameters:
  • (str) (output) – path to the .hmmout files

  • (str) – path/filename to save the output file

microbetag.utils.bin_kos_to_file(hmmout_dir, bin_id)[source]

Builds a 3-col file for a bin and remove the KO-specific output files of hmmsearch

Parameters:
  • hmmout_dir – Directory to the hmmout files

  • ko_tmp

microbetag.utils.parse_hmmout(hmmout_file, hmmout_dir)[source]

Parses the output of the hmmsearch

Parameters:
  • (str) (hmmout_dir) – Filename of the .hmmout file

  • (str) – Directory where hmmout_file is located

Return basename (str):

Bin id

Return gene_id (str):

Gene id

Retrun k_number (str):

KEGG ORTHOLOGY term found

microbetag.utils.load_merged_ko_file(merged_ko)[source]

Load the 3-columns KEGG annotations file as built from the merge_ko()

Input:

merged_ko (str): path to 3-columns output file of the merge_ko()

Returns:

a presence-absence (1/0) df where KOs are the rows and bin_ids the columns

Return type:

pivot_df (pd.DataFrame)

microbetag.utils.convert_to_json_serializable(obj)[source]

Recursively serializes entries of an object, i.e. a set is converted to a list, a list is split to its items and a dictionary keeps its key and their values get serialized

microbetag.utils.ensure_flashweave_format(conf)[source]

Build an OTU table that will be in a FlashWeave-based format.

microbetag.utils.ensure_same_namespace_after_fw(conf)[source]

[TODO] can be removed but let’s wait Inconsistencies from D300244:bin_000023 in the abundance table to D300244.bin_000023 in FlashWeave. Keep the routine in general along with the find_id_differences().

Apparently, the conf.network is a FlashWeave network file - thus the skip 2 rows

microbetag.utils.extend_complements(complements_json, descrps_path, max_scratch_alt, pathway_complement_percentage, pathway_complements_dir)[source]

Extends pathway complement annotations based on given settings and descriptions.

Parameters:
  • complements_json (-) – Path to the complements JSON file.

  • descrps_path (str) – Path to the KEGG MODULES description file.

  • max_scratch_alt (-) – Maximum number of alternative complements allowed.

  • pathway_complement_percentage (-) – Maximum allowable percentage of required KOs that must be present.

  • pathway_complements_dir (-) – Directory to save the extended complements JSON file.

  • complements_dict (dict) – Dictionary of complements loaded from a JSON file.

  • descrps_path – Path to the module descriptions file (tab-separated file with no header).

Returns:

Pathway Complementarities in a dictionary to be assigned in the mgg format

Return type:

dict

Builds:

pathway_complements_extended: JSON file with the dictionary returned

microbetag.utils.extend_faprotax(conf)[source]
microbetag.utils.load_phenotypic_traits(conf)[source]

PhenDB-like traits

microbetag.utils.flatten_list(lista, flat_list=[])[source]

Recursive function taking as input a nested list and returning a flatten one. E.g. [‘GCF_003252755.1’, ‘GCF_900638025.1’, ‘GCF_003252725.1’, ‘GCF_003253005.1’, ‘GCF_003252795.1’, [‘GCF_000210895.1’], [‘GCF_000191405.1’]] becomes [‘GCF_003252755.1’, ‘GCF_900638025.1’, ‘GCF_003252725.1’, ‘GCF_003253005.1’, ‘GCF_003252795.1’].

microbetag.utils.detect_separator(file_path)[source]

Detect the separator used in a text file, i.e ` , `, , ; etc.

microbetag.utils.find_three_column_format(file_path, delimiter)[source]
microbetag.utils.get_tool_location(software)[source]

Check if a software is available in the system path or in the alternative location. Will return either the sofware name itself which will then be ok to run as is or the full path to the software if it’s found in the alternative location. In both cases, the return value will allow running the software.