microbetag.utils
================

.. py:module:: microbetag.utils

.. autoapi-nested-parse::

   Utility functions to be used across the `microbetag` library.


Classes
-------

.. autoapisummary::

   microbetag.utils.SetEncoder


Functions
---------

.. autoapisummary::

   microbetag.utils.get_library_version
   microbetag.utils.resolve_relative_path
   microbetag.utils.resolve_file_path
   microbetag.utils.mtg_logger
   microbetag.utils.get_files_with_suffixes
   microbetag.utils.safe_literal_eval
   microbetag.utils.flatten
   microbetag.utils.flatten_list
   microbetag.utils.run_until_done
   microbetag.utils.file_exists_and_nonzero
   microbetag.utils.split_list
   microbetag.utils.many_to_one_files
   microbetag.utils.ko_list_parser
   microbetag.utils.merge_ko
   microbetag.utils.bin_kos_to_file
   microbetag.utils.parse_hmmout
   microbetag.utils.load_merged_ko_file
   microbetag.utils.convert_to_json_serializable
   microbetag.utils.ensure_flashweave_format
   microbetag.utils.ensure_same_namespace_after_fw
   microbetag.utils.extend_complements
   microbetag.utils.extend_faprotax
   microbetag.utils.load_phenotypic_traits
   microbetag.utils.is_any_nan
   microbetag.utils.remove_nan_from_list
   microbetag.utils.detect_separator
   microbetag.utils.find_three_column_format
   microbetag.utils.get_tool_location


Module Contents
---------------

.. py:function:: get_library_version(library_name: str) -> str

   Returns the version of a Python library


.. py:function:: resolve_relative_path(base_dir: str, file_path: str) -> str

   Resolves a relative file path into an absolute file path based on a given base directory.

   This function processes a relative `file_path` (which may contain one or more `../`
   segments) and resolves it into an absolute path by moving back the corresponding
   number of directory levels from `base_dir`. It returns the resulting absolute file path.

   :param base_dir: The base directory from which to resolve the relative `file_path`. This should
                    be an absolute path to a directory.
   :type base_dir: str
   :param file_path: The relative file path to be resolved. It may contain `../` to navigate up the
                     directory hierarchy.
   :type file_path: str

   :returns: The resolved absolute file path.
   :rtype: str

   .. rubric:: Examples

   >>> resolve_relative_path("/home/user/docs", "../files/report.txt")
   '/home/user/files/report.txt'


.. py:function:: resolve_file_path(base_dir: str, file_path: str) -> str

   Resolves a file path relative to a given base directory and returns the absolute file path.

   If the provided `file_path` is relative, it is resolved using the `base_dir`. The function
   handles absolute paths, user directory expansion (e.g., `~`), and relative paths (e.g., `../`).

   :param base_dir: The base directory to resolve relative paths from.
   :type base_dir: str
   :param file_path: The file path to resolve. It can be absolute, relative, or use `~` for the home directory.
   :type file_path: str

   :returns: The resolved absolute file path.
   :rtype: str

   :raises FileNotFoundError: If the resolved file path does not exist.

   .. rubric:: Examples

   >>> resolve_file_path("/home/user/docs", "~/file.txt")
   '/home/user/file.txt'


.. py:class:: SetEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)

   Bases: :py:obj:`json.JSONEncoder`


   Custom JSON encoder that handles serialization of Python sets.

   This encoder extends the functionality of the standard JSONEncoder to support
   serializing Python sets. JSON does not have a native representation for sets,
   so this encoder converts sets to lists before serializing them.

   Usage:
       When serializing data to JSON using json.dump() or json.dumps(), specify
       cls=SetEncoder to use this custom encoder.

   .. rubric:: References

   - json.JSONEncoder: https://docs.python.org/3/library/json.html#json.JSONEncoder


   .. py:method:: default(obj)

      Override the default method of JSONEncoder to handle serialization of sets.
      .. rubric:: Notes

      If the object is a set, it is converted to a list before serialization.
      Otherwise, the default behavior of JSONEncoder.default() is used.        Notes:
      If the object is a set, it is converted to a list before serialization.
      Otherwise, the default behavior of JSONEncoder.default() is used.


.. py:function:: mtg_logger(filename: str) -> logging.getLogger

   Creates and returns a configured logger instance. This logger:

   - Logs messages to stdout with colored formatting using `colorlog`
   - Avoids adding duplicate handlers if called multiple times
   - Uses the given `filename` as the logger's name
   - Logs messages with level INFO and above

   :param script: The filename of the script where the logger will be applied to.

   :returns: The logger instance.


.. py:function:: get_files_with_suffixes(directory: str, suffixes: List[str]) -> list[str]

   Recursively retrieves files from a specified directory and its subdirectories
   that have extensions matching a given list of suffixes.

   :param directory: The root directory to start the search.
   :param suffixes: A list of file suffixes (extensions) to match.
                    Each suffix should include the dot (e.g., '.txt', '.csv').

   Returns:
   ---------
       A list of full paths to files that match any of the specified suffixes.

   Example:
   >>> get_files_with_suffixes('/path/to/directory', ['.txt', '.csv'])
   ['/path/to/directory/file1.txt', '/path/to/directory/subdir/file2.csv']


.. py:function:: safe_literal_eval(value: Any)

   Safely evaluates a string that may represent a Python literal (e.g., list, dict, int).

   This function attempts to parse a string using :class:`ast.literal_eval`, which only evaluates
   Python literals (e.g., strings, numbers, tuples, lists, dicts, booleans, and None),
   avoiding the security risks of `eval()`*.
   If `value` is not a string or if evaluation fails,
   the original value is returned unchanged.

   :param value: The input to be evaluated. If it's a string that looks like a literal (e.g., "[1, 2]"),
                 it will be parsed. Otherwise, it's returned as is.
   :type value: Any

   :rtype: The evaluated literal if successful, or the original value if evaluation fails.

   .. rubric:: Examples

   >>> safe_literal_eval("[1, 2, 3]")
   [1, 2, 3]

   >>> safe_literal_eval("{'a': 1}")
   {'a': 1}

   .. note::

      * Security risks of `eval`:
      https://www.adventuresinmachinelearning.com/safe-and-secure-eval-in-python-how-to-minimize-security-risks/


.. py:function:: flatten(list_of_lists: List) -> List

   Recursively flattens a nested list into a single-level list.

   This function handles arbitrarily nested lists and returns a new list
   containing all the leaf elements in the original order.

   :param lst: A list that may contain other nested lists.

   Returns:
   -------
       A flat list containing all non-list elements in the original order.

   Examples:
   --------
   >>> flatten([1, [2, [3, 4]], 5])
   [1, 2, 3, 4, 5]


.. py:function:: flatten_list(lst: List, flat_list: List = None) -> set

   Recursively flattens a nested list and returns a set of unique elements.

   This function traverses all nested lists and collects elements into a set,
   removing any duplicates. The final result is unordered.

   Parameters:
   ----------
   lst: A list that may contain other nested lists.
   flat_list : Optional list. Used internally during recursion. Should not be set manually.

   Returns:
   -------
       A set containing all unique elements from the nested list.

   Examples:
   --------
   >>> flatten_list([1, [2, [2, 3]], 4, 1])
   {1, 2, 3, 4}


.. py:function:: run_until_done(command: str)

   Function to run recursively a command until


.. py:function:: file_exists_and_nonzero(filename: str) -> bool

   Check if a file exists and its size is nonzero.

   :param filename: The path to the file.
   :type filename: str

   :returns: True if the file exists and its size is nonzero, False otherwise.
   :rtype: bool


.. py:function:: split_list(input_list: list, chunk_size: int) -> list

   Split a list to sublists of a user defined size (`chunk_size`).


.. py:function:: many_to_one_files(dir_with_files: str, merged_file: str) -> None

   Makes a single file out of all files in a directory by concatenating having rows of one after the other

   :param dir_with_files: Path of the directory the files of which to be merged
   :param merged_file: Path to merged output file


.. py:function:: ko_list_parser(ko_list: str) -> Dict

   Parses ko_list file into a dict object - based on DiTing

   :param ko_list:
   :type ko_list: path to the `ko_list` file that comes from the kofam database https://www.genome.jp/ftp/db/kofam/

   :rtype: A dictionary mapping knum to threshold and score_type


.. py:function:: merge_ko(hmmout_dir: str, output: str) -> None

   Parses the KO<>.<bin>.hmmout files produced by the kegg_annotation() function
   to create a single 3-column file (output) with the bin_id, the corresponding conting
   and the KO that was mapped to it.
   The function then returns a dictionary with the bin ids as the keys and the set of KOs found to each as the value.

   :param hmmout_dir: path to the .hmmout files
   :param output: Path/filename to save the output file


.. py:function:: bin_kos_to_file(hmmout_dir: str, bin_id: str) -> None

   Builds a 3-col file for a bin and removes the KO-specific output files of `hmmsearch`

   :param hmmout_dir: Directory to the hmmout files
   :param bin_id: Name of the sequence id under study


.. py:function:: parse_hmmout(hmmout_file: str, hmmout_dir: str) -> Tuple[str, str, str]

   Parses the output of the hmmsearch to return the the sequence id along with the
   a gene and its corresponding KEGG ORTHOLOGY term as mentioned in the `hmmout_file`.

   :param hmmout_file: Filename of the .hmmout file
   :param hmmout_dir: Directory where hmmout_file is located

   :returns:     - basename: Sequence id
                 - gene_id: Gene id
                 - k_number: KEGG ORTHOLOGY term found
   :rtype: A tuple consisting of


.. py:function:: load_merged_ko_file(merged_ko: str) -> pandas.DataFrame

   Load the 3-columns KEGG annotations file as built from the merge_ko()

   Input:
       merged_ko: path to 3-columns output file of the merge_ko()

   :returns: a presence-absence (1/0) df where KOs are the rows and bin_ids the columns
   :rtype: pivot_df


.. py:function:: convert_to_json_serializable(obj: Any) -> Any

   Recursively serializes entries of an object
   A set is converted to a list, a list is flattened to its items
   and a dictionary keeps its key and their values get serialized.

   .. note::

      This is essential step both for allowing a jsonified response and to be able
      to dump a dictionary as a JSON file.


.. py:function:: ensure_flashweave_format(conf: microbetag.config.Config) -> None

   Build an OTU table that will be in a FlashWeave-based format.

   .. note:: Saves abundance data to be used with FlashWeave in the output directory.


.. py:function:: ensure_same_namespace_after_fw(conf: microbetag.config.Config) -> None

   Reads FlashWeave edgelist file and tries to map sequence ids of node columns of the edgelist
   to their corresponding in the abundance table.

   .. attention::

      The need of this was first met with a local data set where sequence ids were like:
      D300244:bin_000023 in the abundance table
      and then in the edgelist returned by FlashWeave, those idsz to D300244.bin_000023 in FlashWeave.
      
      # NOTE (Haris Zafeiropoulos, 2025-05-16):
      After a few changes this behavior changed but I am not sure why.
      Thus, maybe this step is not necessary anymore and it could be removed.
      Yet, tests are required.

   .. note:: Apparently, the conf.network in this case is in the format FlashWeave networks, thus the `skiprows=2`


.. py:function:: extend_complements(complements_json: str, descrps_path: str, path_compl_perce: int, path_compl_dir: str) -> Dict

   Extends pathway complement annotations based on given settings and descriptions.

   :param - complements_json: Path to the complements JSON file.
   :param - descrps_path: Path to the KEGG MODULES description file.
   :param - path_compl_perce: Maximum allowable percentage of required KOs that must be present.
   :param - path_compl_dir: Directory to save the extended complements JSON file.
   :param complements_dict: Dictionary of complements loaded from a JSON file.
   :type complements_dict: dict
   :param descrps_path: Path to the module descriptions file (tab-separated file with no header).
   :type descrps_path: str

   :returns: A dictionary with pathway complementarities to be assigned in the MGG format

   .. note:: Here we build the `pathway_complements_extended.json` a JSON file with the dictionary returned


.. py:function:: extend_faprotax(faprotax_sub_tables, sequence_id_column_name) -> Tuple[dict[str, list], list[str]]

   Parses the sub tables of the faprotax analysis
   to assign the biological processes related to each sequence id

   :returns:

             - bin_faprotax_traits: A dictionary with the sequence id as key and a list of FAPROTAX trais a value
             - faprotax_traits: A list with the unique set of the FAPROTAX traits found across all taxa of the study
   :rtype: A tuple consisting of


.. py:function:: load_phenotypic_traits(phen_outdir) -> Tuple[Dict[str, Dict[str, Union[str, float]]], Set[str]]

   Load phenotrex-based trait files and assignm them per genome.

   :returns:

                 - bin_phen_traits: A dictionary with genome id as key and a dictionary as value,
                                     with each phenotrex-based trait as value and their presence/absence
                                     and scores as value
             phentraits: A set with the traits presentt
   :rtype: A tuple consisting of

   .. note::

      Example of a `bin_phen_traits`:
      ```
      bin_phen_traits[bin_id][trait_name] = {
          "presence": case["Trait present"],
          "confidence": case["Confidence"],
      }
      ```


.. py:function:: is_any_nan(x) -> bool

   Checks whether the input value is NaN (Not a Number).

   It first tries to use `numpy.isnan()` for numerical or array-like inputs.
   If that fails (e.g., for non-numeric types), it falls back to checking if the string
   representation of `x` is equal to 'nan' (case-insensitive).

   :returns: bool


.. py:function:: remove_nan_from_list(lst: List) -> List

   Removes Nan from a list using the :class:`is_any_nan`.


.. py:function:: detect_separator(file_path: str) -> str

   Detects the separator used in a text file, i.e `    `,  `,` , `;` etc.

   It makes use of the :class:`csv.Sniffer` and gets a sample of the text based on its size.

   :param file_path: Path to the file to be considered

   :returns: A separator, e.g. ","


.. py:function:: find_three_column_format(file_path: str, delimiter: str) -> tuple[int, Union[None, int]]

   Checks if a file is in a three-column format and whether the third column is a float.
   If not, it skips row and goes to the next one checking for the 3-colummn format.
   Once met, it returns the line number, if that is neve the case raises an Exception.

   :param file_path: Path to the file to be checked.
   :type file_path: str
   :param delimiter: The delimiter used to separate columns (e.g., '        ' for tab-separated values).
   :type delimiter: str

   :returns: (line_number, None) if the third column is a float, (line_number, 0) otherwise.
   :rtype: tuple


.. py:function:: get_tool_location(software: str) -> str

   Check if a software program is available in the system path or in the alternative location.

   Will return either the sofware name itself which will then be ok to run as is (globally)
   or the full path to the software if it's found in the alternative location.
   In both cases, the return value will allow running the software.

   If software not available, it will reaise a SystemExit() error with a message about the missing software.

   :param software: Name of the software program to be found