microbetag.utils

Utility functions to be used across the microbetag library.

Classes

SetEncoder

Custom JSON encoder that handles serialization of Python sets.

Functions

get_library_version(→ str)

Returns the version of a Python library

resolve_relative_path(→ str)

Resolves a relative file path into an absolute file path based on a given base directory.

resolve_file_path(→ str)

Resolves a file path relative to a given base directory and returns the absolute file path.

mtg_logger(→ logging.getLogger)

Creates and returns a configured logger instance. This logger:

get_files_with_suffixes(→ list[str])

Recursively retrieves files from a specified directory and its subdirectories

safe_literal_eval(value)

Safely evaluates a string that may represent a Python literal (e.g., list, dict, int).

flatten(→ List)

Recursively flattens a nested list into a single-level list.

flatten_list(→ set)

Recursively flattens a nested list and returns a set of unique elements.

run_until_done(command)

Function to run recursively a command until

file_exists_and_nonzero(→ bool)

Check if a file exists and its size is nonzero.

split_list(→ list)

Split a list to sublists of a user defined size (chunk_size).

many_to_one_files(→ None)

Makes a single file out of all files in a directory by concatenating having rows of one after the other

ko_list_parser(→ Dict)

Parses ko_list file into a dict object - based on DiTing

merge_ko(→ None)

Parses the KO<>.<bin>.hmmout files produced by the kegg_annotation() function

bin_kos_to_file(→ None)

Builds a 3-col file for a bin and removes the KO-specific output files of hmmsearch

parse_hmmout(→ Tuple[str, str, str])

Parses the output of the hmmsearch to return the the sequence id along with the

load_merged_ko_file(→ pandas.DataFrame)

Load the 3-columns KEGG annotations file as built from the merge_ko()

convert_to_json_serializable(→ Any)

Recursively serializes entries of an object

ensure_flashweave_format(→ None)

Build an OTU table that will be in a FlashWeave-based format.

ensure_same_namespace_after_fw(→ None)

Reads FlashWeave edgelist file and tries to map sequence ids of node columns of the edgelist

extend_complements(→ Dict)

Extends pathway complement annotations based on given settings and descriptions.

extend_faprotax(→ Tuple[dict[str, list], list[str]])

Parses the sub tables of the faprotax analysis

load_phenotypic_traits(→ Tuple[Dict[str, Dict[str, ...)

Load phenotrex-based trait files and assignm them per genome.

is_any_nan(→ bool)

Checks whether the input value is NaN (Not a Number).

remove_nan_from_list(→ List)

Removes Nan from a list using the is_any_nan.

detect_separator(→ str)

Detects the separator used in a text file, i.e ` , `, , ; etc.

find_three_column_format(→ tuple[int, Union[None, int]])

Checks if a file is in a three-column format and whether the third column is a float.

get_tool_location(→ str)

Check if a software program is available in the system path or in the alternative location.

Module Contents

microbetag.utils.get_library_version(library_name: str) str[source]

Returns the version of a Python library

microbetag.utils.resolve_relative_path(base_dir: str, file_path: str) str[source]

Resolves a relative file path into an absolute file path based on a given base directory.

This function processes a relative file_path (which may contain one or more ../ segments) and resolves it into an absolute path by moving back the corresponding number of directory levels from base_dir. It returns the resulting absolute file path.

Parameters:
  • base_dir (str) – The base directory from which to resolve the relative file_path. This should be an absolute path to a directory.

  • file_path (str) – The relative file path to be resolved. It may contain ../ to navigate up the directory hierarchy.

Returns:

The resolved absolute file path.

Return type:

str

Examples

>>> resolve_relative_path("/home/user/docs", "../files/report.txt")
'/home/user/files/report.txt'
microbetag.utils.resolve_file_path(base_dir: str, file_path: str) str[source]

Resolves a file path relative to a given base directory and returns the absolute file path.

If the provided file_path is relative, it is resolved using the base_dir. The function handles absolute paths, user directory expansion (e.g., ~), and relative paths (e.g., ../).

Parameters:
  • base_dir (str) – The base directory to resolve relative paths from.

  • file_path (str) – The file path to resolve. It can be absolute, relative, or use ~ for the home directory.

Returns:

The resolved absolute file path.

Return type:

str

Raises:

FileNotFoundError – If the resolved file path does not exist.

Examples

>>> resolve_file_path("/home/user/docs", "~/file.txt")
'/home/user/file.txt'
class microbetag.utils.SetEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Bases: json.JSONEncoder

Custom JSON encoder that handles serialization of Python sets.

This encoder extends the functionality of the standard JSONEncoder to support serializing Python sets. JSON does not have a native representation for sets, so this encoder converts sets to lists before serializing them.

Usage:

When serializing data to JSON using json.dump() or json.dumps(), specify cls=SetEncoder to use this custom encoder.

References

default(obj)[source]

Override the default method of JSONEncoder to handle serialization of sets. .. rubric:: Notes

If the object is a set, it is converted to a list before serialization. Otherwise, the default behavior of JSONEncoder.default() is used. Notes: If the object is a set, it is converted to a list before serialization. Otherwise, the default behavior of JSONEncoder.default() is used.

microbetag.utils.mtg_logger(filename: str) logging.getLogger[source]

Creates and returns a configured logger instance. This logger:

  • Logs messages to stdout with colored formatting using colorlog

  • Avoids adding duplicate handlers if called multiple times

  • Uses the given filename as the logger’s name

  • Logs messages with level INFO and above

Parameters:

script – The filename of the script where the logger will be applied to.

Returns:

The logger instance.

microbetag.utils.get_files_with_suffixes(directory: str, suffixes: List[str]) list[str][source]

Recursively retrieves files from a specified directory and its subdirectories that have extensions matching a given list of suffixes.

Parameters:
  • directory – The root directory to start the search.

  • suffixes – A list of file suffixes (extensions) to match. Each suffix should include the dot (e.g., ‘.txt’, ‘.csv’).

Returns:

A list of full paths to files that match any of the specified suffixes.

Example: >>> get_files_with_suffixes(‘/path/to/directory’, [‘.txt’, ‘.csv’]) [‘/path/to/directory/file1.txt’, ‘/path/to/directory/subdir/file2.csv’]

microbetag.utils.safe_literal_eval(value: Any)[source]

Safely evaluates a string that may represent a Python literal (e.g., list, dict, int).

This function attempts to parse a string using ast.literal_eval, which only evaluates Python literals (e.g., strings, numbers, tuples, lists, dicts, booleans, and None), avoiding the security risks of eval()`*. If `value is not a string or if evaluation fails, the original value is returned unchanged.

Parameters:

value (Any) – The input to be evaluated. If it’s a string that looks like a literal (e.g., “[1, 2]”), it will be parsed. Otherwise, it’s returned as is.

Return type:

The evaluated literal if successful, or the original value if evaluation fails.

Examples

>>> safe_literal_eval("[1, 2, 3]")
[1, 2, 3]
>>> safe_literal_eval("{'a': 1}")
{'a': 1}
microbetag.utils.flatten(list_of_lists: List) List[source]

Recursively flattens a nested list into a single-level list.

This function handles arbitrarily nested lists and returns a new list containing all the leaf elements in the original order.

Parameters:

lst – A list that may contain other nested lists.

Returns:

A flat list containing all non-list elements in the original order.

Examples:

>>> flatten([1, [2, [3, 4]], 5])
[1, 2, 3, 4, 5]
microbetag.utils.flatten_list(lst: List, flat_list: List = None) set[source]

Recursively flattens a nested list and returns a set of unique elements.

This function traverses all nested lists and collects elements into a set, removing any duplicates. The final result is unordered.

Parameters:

lst: A list that may contain other nested lists. flat_list : Optional list. Used internally during recursion. Should not be set manually.

Returns:

A set containing all unique elements from the nested list.

Examples:

>>> flatten_list([1, [2, [2, 3]], 4, 1])
{1, 2, 3, 4}
microbetag.utils.run_until_done(command: str)[source]

Function to run recursively a command until

microbetag.utils.file_exists_and_nonzero(filename: str) bool[source]

Check if a file exists and its size is nonzero.

Parameters:

filename (str) – The path to the file.

Returns:

True if the file exists and its size is nonzero, False otherwise.

Return type:

bool

microbetag.utils.split_list(input_list: list, chunk_size: int) list[source]

Split a list to sublists of a user defined size (chunk_size).

microbetag.utils.many_to_one_files(dir_with_files: str, merged_file: str) None[source]

Makes a single file out of all files in a directory by concatenating having rows of one after the other

Parameters:
  • dir_with_files – Path of the directory the files of which to be merged

  • merged_file – Path to merged output file

microbetag.utils.ko_list_parser(ko_list: str) Dict[source]

Parses ko_list file into a dict object - based on DiTing

Parameters:

ko_list (path to the ko_list file that comes from the kofam database https://www.genome.jp/ftp/db/kofam/)

Return type:

A dictionary mapping knum to threshold and score_type

microbetag.utils.merge_ko(hmmout_dir: str, output: str) None[source]

Parses the KO<>.<bin>.hmmout files produced by the kegg_annotation() function to create a single 3-column file (output) with the bin_id, the corresponding conting and the KO that was mapped to it. The function then returns a dictionary with the bin ids as the keys and the set of KOs found to each as the value.

Parameters:
  • hmmout_dir – path to the .hmmout files

  • output – Path/filename to save the output file

microbetag.utils.bin_kos_to_file(hmmout_dir: str, bin_id: str) None[source]

Builds a 3-col file for a bin and removes the KO-specific output files of hmmsearch

Parameters:
  • hmmout_dir – Directory to the hmmout files

  • bin_id – Name of the sequence id under study

microbetag.utils.parse_hmmout(hmmout_file: str, hmmout_dir: str) Tuple[str, str, str][source]

Parses the output of the hmmsearch to return the the sequence id along with the a gene and its corresponding KEGG ORTHOLOGY term as mentioned in the hmmout_file.

Parameters:
  • hmmout_file – Filename of the .hmmout file

  • hmmout_dir – Directory where hmmout_file is located

Returns:

  • basename: Sequence id

  • gene_id: Gene id

  • k_number: KEGG ORTHOLOGY term found

Return type:

A tuple consisting of

microbetag.utils.load_merged_ko_file(merged_ko: str) pandas.DataFrame[source]

Load the 3-columns KEGG annotations file as built from the merge_ko()

Input:

merged_ko: path to 3-columns output file of the merge_ko()

Returns:

a presence-absence (1/0) df where KOs are the rows and bin_ids the columns

Return type:

pivot_df

microbetag.utils.convert_to_json_serializable(obj: Any) Any[source]

Recursively serializes entries of an object A set is converted to a list, a list is flattened to its items and a dictionary keeps its key and their values get serialized.

Note

This is essential step both for allowing a jsonified response and to be able to dump a dictionary as a JSON file.

microbetag.utils.ensure_flashweave_format(conf: microbetag.config.Config) None[source]

Build an OTU table that will be in a FlashWeave-based format.

Note

Saves abundance data to be used with FlashWeave in the output directory.

microbetag.utils.ensure_same_namespace_after_fw(conf: microbetag.config.Config) None[source]

Reads FlashWeave edgelist file and tries to map sequence ids of node columns of the edgelist to their corresponding in the abundance table.

Attention

The need of this was first met with a local data set where sequence ids were like: D300244:bin_000023 in the abundance table and then in the edgelist returned by FlashWeave, those idsz to D300244.bin_000023 in FlashWeave.

# NOTE (Haris Zafeiropoulos, 2025-05-16): After a few changes this behavior changed but I am not sure why. Thus, maybe this step is not necessary anymore and it could be removed. Yet, tests are required.

Note

Apparently, the conf.network in this case is in the format FlashWeave networks, thus the skiprows=2

microbetag.utils.extend_complements(complements_json: str, descrps_path: str, path_compl_perce: int, path_compl_dir: str) Dict[source]

Extends pathway complement annotations based on given settings and descriptions.

Parameters:
  • complements_json (-) – Path to the complements JSON file.

  • descrps_path (str) – Path to the KEGG MODULES description file.

  • path_compl_perce (-) – Maximum allowable percentage of required KOs that must be present.

  • path_compl_dir (-) – Directory to save the extended complements JSON file.

  • complements_dict (dict) – Dictionary of complements loaded from a JSON file.

  • descrps_path – Path to the module descriptions file (tab-separated file with no header).

Returns:

A dictionary with pathway complementarities to be assigned in the MGG format

Note

Here we build the pathway_complements_extended.json a JSON file with the dictionary returned

microbetag.utils.extend_faprotax(faprotax_sub_tables, sequence_id_column_name) Tuple[dict[str, list], list[str]][source]

Parses the sub tables of the faprotax analysis to assign the biological processes related to each sequence id

Returns:

  • bin_faprotax_traits: A dictionary with the sequence id as key and a list of FAPROTAX trais a value

  • faprotax_traits: A list with the unique set of the FAPROTAX traits found across all taxa of the study

Return type:

A tuple consisting of

microbetag.utils.load_phenotypic_traits(phen_outdir) Tuple[Dict[str, Dict[str, str | float]], Set[str]][source]

Load phenotrex-based trait files and assignm them per genome.

Returns:

  • bin_phen_traits: A dictionary with genome id as key and a dictionary as value,

    with each phenotrex-based trait as value and their presence/absence and scores as value

phentraits: A set with the traits presentt

Return type:

A tuple consisting of

Note

Example of a bin_phen_traits: ``` bin_phen_traits[bin_id][trait_name] = {

“presence”: case[“Trait present”], “confidence”: case[“Confidence”],

microbetag.utils.is_any_nan(x) bool[source]

Checks whether the input value is NaN (Not a Number).

It first tries to use numpy.isnan() for numerical or array-like inputs. If that fails (e.g., for non-numeric types), it falls back to checking if the string representation of x is equal to ‘nan’ (case-insensitive).

Returns:

bool

microbetag.utils.remove_nan_from_list(lst: List) List[source]

Removes Nan from a list using the is_any_nan.

microbetag.utils.detect_separator(file_path: str) str[source]

Detects the separator used in a text file, i.e ` , `, , ; etc.

It makes use of the csv.Sniffer and gets a sample of the text based on its size.

Parameters:

file_path – Path to the file to be considered

Returns:

A separator, e.g. “,”

microbetag.utils.find_three_column_format(file_path: str, delimiter: str) tuple[int, None | int][source]

Checks if a file is in a three-column format and whether the third column is a float. If not, it skips row and goes to the next one checking for the 3-colummn format. Once met, it returns the line number, if that is neve the case raises an Exception.

Parameters:
  • file_path (str) – Path to the file to be checked.

  • delimiter (str) – The delimiter used to separate columns (e.g., ‘ ‘ for tab-separated values).

Returns:

(line_number, None) if the third column is a float, (line_number, 0) otherwise.

Return type:

tuple

microbetag.utils.get_tool_location(software: str) str[source]

Check if a software program is available in the system path or in the alternative location.

Will return either the sofware name itself which will then be ok to run as is (globally) or the full path to the software if it’s found in the alternative location. In both cases, the return value will allow running the software.

If software not available, it will reaise a SystemExit() error with a message about the missing software.

Parameters:

software – Name of the software program to be found