microbetag.utils ================ .. py:module:: microbetag.utils .. autoapi-nested-parse:: Utility functions to be used across the `microbetag` library. Classes ------- .. autoapisummary:: microbetag.utils.SetEncoder Functions --------- .. autoapisummary:: microbetag.utils.get_library_version microbetag.utils.resolve_relative_path microbetag.utils.resolve_file_path microbetag.utils.mtg_logger microbetag.utils.get_files_with_suffixes microbetag.utils.safe_literal_eval microbetag.utils.flatten microbetag.utils.flatten_list microbetag.utils.run_until_done microbetag.utils.file_exists_and_nonzero microbetag.utils.split_list microbetag.utils.many_to_one_files microbetag.utils.ko_list_parser microbetag.utils.merge_ko microbetag.utils.bin_kos_to_file microbetag.utils.parse_hmmout microbetag.utils.load_merged_ko_file microbetag.utils.convert_to_json_serializable microbetag.utils.ensure_flashweave_format microbetag.utils.ensure_same_namespace_after_fw microbetag.utils.extend_complements microbetag.utils.extend_faprotax microbetag.utils.load_phenotypic_traits microbetag.utils.is_any_nan microbetag.utils.remove_nan_from_list microbetag.utils.detect_separator microbetag.utils.find_three_column_format microbetag.utils.get_tool_location Module Contents --------------- .. py:function:: get_library_version(library_name: str) -> str Returns the version of a Python library .. py:function:: resolve_relative_path(base_dir: str, file_path: str) -> str Resolves a relative file path into an absolute file path based on a given base directory. This function processes a relative `file_path` (which may contain one or more `../` segments) and resolves it into an absolute path by moving back the corresponding number of directory levels from `base_dir`. It returns the resulting absolute file path. :param base_dir: The base directory from which to resolve the relative `file_path`. This should be an absolute path to a directory. :type base_dir: str :param file_path: The relative file path to be resolved. It may contain `../` to navigate up the directory hierarchy. :type file_path: str :returns: The resolved absolute file path. :rtype: str .. rubric:: Examples >>> resolve_relative_path("/home/user/docs", "../files/report.txt") '/home/user/files/report.txt' .. py:function:: resolve_file_path(base_dir: str, file_path: str) -> str Resolves a file path relative to a given base directory and returns the absolute file path. If the provided `file_path` is relative, it is resolved using the `base_dir`. The function handles absolute paths, user directory expansion (e.g., `~`), and relative paths (e.g., `../`). :param base_dir: The base directory to resolve relative paths from. :type base_dir: str :param file_path: The file path to resolve. It can be absolute, relative, or use `~` for the home directory. :type file_path: str :returns: The resolved absolute file path. :rtype: str :raises FileNotFoundError: If the resolved file path does not exist. .. rubric:: Examples >>> resolve_file_path("/home/user/docs", "~/file.txt") '/home/user/file.txt' .. py:class:: SetEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None) Bases: :py:obj:`json.JSONEncoder` Custom JSON encoder that handles serialization of Python sets. This encoder extends the functionality of the standard JSONEncoder to support serializing Python sets. JSON does not have a native representation for sets, so this encoder converts sets to lists before serializing them. Usage: When serializing data to JSON using json.dump() or json.dumps(), specify cls=SetEncoder to use this custom encoder. .. rubric:: References - json.JSONEncoder: https://docs.python.org/3/library/json.html#json.JSONEncoder .. py:method:: default(obj) Override the default method of JSONEncoder to handle serialization of sets. .. rubric:: Notes If the object is a set, it is converted to a list before serialization. Otherwise, the default behavior of JSONEncoder.default() is used. Notes: If the object is a set, it is converted to a list before serialization. Otherwise, the default behavior of JSONEncoder.default() is used. .. py:function:: mtg_logger(filename: str) -> logging.getLogger Creates and returns a configured logger instance. This logger: - Logs messages to stdout with colored formatting using `colorlog` - Avoids adding duplicate handlers if called multiple times - Uses the given `filename` as the logger's name - Logs messages with level INFO and above :param script: The filename of the script where the logger will be applied to. :returns: The logger instance. .. py:function:: get_files_with_suffixes(directory: str, suffixes: List[str]) -> list[str] Recursively retrieves files from a specified directory and its subdirectories that have extensions matching a given list of suffixes. :param directory: The root directory to start the search. :param suffixes: A list of file suffixes (extensions) to match. Each suffix should include the dot (e.g., '.txt', '.csv'). Returns: --------- A list of full paths to files that match any of the specified suffixes. Example: >>> get_files_with_suffixes('/path/to/directory', ['.txt', '.csv']) ['/path/to/directory/file1.txt', '/path/to/directory/subdir/file2.csv'] .. py:function:: safe_literal_eval(value: Any) Safely evaluates a string that may represent a Python literal (e.g., list, dict, int). This function attempts to parse a string using :class:`ast.literal_eval`, which only evaluates Python literals (e.g., strings, numbers, tuples, lists, dicts, booleans, and None), avoiding the security risks of `eval()`*. If `value` is not a string or if evaluation fails, the original value is returned unchanged. :param value: The input to be evaluated. If it's a string that looks like a literal (e.g., "[1, 2]"), it will be parsed. Otherwise, it's returned as is. :type value: Any :rtype: The evaluated literal if successful, or the original value if evaluation fails. .. rubric:: Examples >>> safe_literal_eval("[1, 2, 3]") [1, 2, 3] >>> safe_literal_eval("{'a': 1}") {'a': 1} .. note:: * Security risks of `eval`: https://www.adventuresinmachinelearning.com/safe-and-secure-eval-in-python-how-to-minimize-security-risks/ .. py:function:: flatten(list_of_lists: List) -> List Recursively flattens a nested list into a single-level list. This function handles arbitrarily nested lists and returns a new list containing all the leaf elements in the original order. :param lst: A list that may contain other nested lists. Returns: ------- A flat list containing all non-list elements in the original order. Examples: -------- >>> flatten([1, [2, [3, 4]], 5]) [1, 2, 3, 4, 5] .. py:function:: flatten_list(lst: List, flat_list: List = None) -> set Recursively flattens a nested list and returns a set of unique elements. This function traverses all nested lists and collects elements into a set, removing any duplicates. The final result is unordered. Parameters: ---------- lst: A list that may contain other nested lists. flat_list : Optional list. Used internally during recursion. Should not be set manually. Returns: ------- A set containing all unique elements from the nested list. Examples: -------- >>> flatten_list([1, [2, [2, 3]], 4, 1]) {1, 2, 3, 4} .. py:function:: run_until_done(command: str) Function to run recursively a command until .. py:function:: file_exists_and_nonzero(filename: str) -> bool Check if a file exists and its size is nonzero. :param filename: The path to the file. :type filename: str :returns: True if the file exists and its size is nonzero, False otherwise. :rtype: bool .. py:function:: split_list(input_list: list, chunk_size: int) -> list Split a list to sublists of a user defined size (`chunk_size`). .. py:function:: many_to_one_files(dir_with_files: str, merged_file: str) -> None Makes a single file out of all files in a directory by concatenating having rows of one after the other :param dir_with_files: Path of the directory the files of which to be merged :param merged_file: Path to merged output file .. py:function:: ko_list_parser(ko_list: str) -> Dict Parses ko_list file into a dict object - based on DiTing :param ko_list: :type ko_list: path to the `ko_list` file that comes from the kofam database https://www.genome.jp/ftp/db/kofam/ :rtype: A dictionary mapping knum to threshold and score_type .. py:function:: merge_ko(hmmout_dir: str, output: str) -> None Parses the KO<>..hmmout files produced by the kegg_annotation() function to create a single 3-column file (output) with the bin_id, the corresponding conting and the KO that was mapped to it. The function then returns a dictionary with the bin ids as the keys and the set of KOs found to each as the value. :param hmmout_dir: path to the .hmmout files :param output: Path/filename to save the output file .. py:function:: bin_kos_to_file(hmmout_dir: str, bin_id: str) -> None Builds a 3-col file for a bin and removes the KO-specific output files of `hmmsearch` :param hmmout_dir: Directory to the hmmout files :param bin_id: Name of the sequence id under study .. py:function:: parse_hmmout(hmmout_file: str, hmmout_dir: str) -> Tuple[str, str, str] Parses the output of the hmmsearch to return the the sequence id along with the a gene and its corresponding KEGG ORTHOLOGY term as mentioned in the `hmmout_file`. :param hmmout_file: Filename of the .hmmout file :param hmmout_dir: Directory where hmmout_file is located :returns: - basename: Sequence id - gene_id: Gene id - k_number: KEGG ORTHOLOGY term found :rtype: A tuple consisting of .. py:function:: load_merged_ko_file(merged_ko: str) -> pandas.DataFrame Load the 3-columns KEGG annotations file as built from the merge_ko() Input: merged_ko: path to 3-columns output file of the merge_ko() :returns: a presence-absence (1/0) df where KOs are the rows and bin_ids the columns :rtype: pivot_df .. py:function:: convert_to_json_serializable(obj: Any) -> Any Recursively serializes entries of an object A set is converted to a list, a list is flattened to its items and a dictionary keeps its key and their values get serialized. .. note:: This is essential step both for allowing a jsonified response and to be able to dump a dictionary as a JSON file. .. py:function:: ensure_flashweave_format(conf: microbetag.config.Config) -> None Build an OTU table that will be in a FlashWeave-based format. .. note:: Saves abundance data to be used with FlashWeave in the output directory. .. py:function:: ensure_same_namespace_after_fw(conf: microbetag.config.Config) -> None Reads FlashWeave edgelist file and tries to map sequence ids of node columns of the edgelist to their corresponding in the abundance table. .. attention:: The need of this was first met with a local data set where sequence ids were like: D300244:bin_000023 in the abundance table and then in the edgelist returned by FlashWeave, those idsz to D300244.bin_000023 in FlashWeave. # NOTE (Haris Zafeiropoulos, 2025-05-16): After a few changes this behavior changed but I am not sure why. Thus, maybe this step is not necessary anymore and it could be removed. Yet, tests are required. .. note:: Apparently, the conf.network in this case is in the format FlashWeave networks, thus the `skiprows=2` .. py:function:: extend_complements(complements_json: str, descrps_path: str, path_compl_perce: int, path_compl_dir: str) -> Dict Extends pathway complement annotations based on given settings and descriptions. :param - complements_json: Path to the complements JSON file. :param - descrps_path: Path to the KEGG MODULES description file. :param - path_compl_perce: Maximum allowable percentage of required KOs that must be present. :param - path_compl_dir: Directory to save the extended complements JSON file. :param complements_dict: Dictionary of complements loaded from a JSON file. :type complements_dict: dict :param descrps_path: Path to the module descriptions file (tab-separated file with no header). :type descrps_path: str :returns: A dictionary with pathway complementarities to be assigned in the MGG format .. note:: Here we build the `pathway_complements_extended.json` a JSON file with the dictionary returned .. py:function:: extend_faprotax(faprotax_sub_tables, sequence_id_column_name) -> Tuple[dict[str, list], list[str]] Parses the sub tables of the faprotax analysis to assign the biological processes related to each sequence id :returns: - bin_faprotax_traits: A dictionary with the sequence id as key and a list of FAPROTAX trais a value - faprotax_traits: A list with the unique set of the FAPROTAX traits found across all taxa of the study :rtype: A tuple consisting of .. py:function:: load_phenotypic_traits(phen_outdir) -> Tuple[Dict[str, Dict[str, Union[str, float]]], Set[str]] Load phenotrex-based trait files and assignm them per genome. :returns: - bin_phen_traits: A dictionary with genome id as key and a dictionary as value, with each phenotrex-based trait as value and their presence/absence and scores as value phentraits: A set with the traits presentt :rtype: A tuple consisting of .. note:: Example of a `bin_phen_traits`: ``` bin_phen_traits[bin_id][trait_name] = { "presence": case["Trait present"], "confidence": case["Confidence"], } ``` .. py:function:: is_any_nan(x) -> bool Checks whether the input value is NaN (Not a Number). It first tries to use `numpy.isnan()` for numerical or array-like inputs. If that fails (e.g., for non-numeric types), it falls back to checking if the string representation of `x` is equal to 'nan' (case-insensitive). :returns: bool .. py:function:: remove_nan_from_list(lst: List) -> List Removes Nan from a list using the :class:`is_any_nan`. .. py:function:: detect_separator(file_path: str) -> str Detects the separator used in a text file, i.e ` `, `,` , `;` etc. It makes use of the :class:`csv.Sniffer` and gets a sample of the text based on its size. :param file_path: Path to the file to be considered :returns: A separator, e.g. "," .. py:function:: find_three_column_format(file_path: str, delimiter: str) -> tuple[int, Union[None, int]] Checks if a file is in a three-column format and whether the third column is a float. If not, it skips row and goes to the next one checking for the 3-colummn format. Once met, it returns the line number, if that is neve the case raises an Exception. :param file_path: Path to the file to be checked. :type file_path: str :param delimiter: The delimiter used to separate columns (e.g., ' ' for tab-separated values). :type delimiter: str :returns: (line_number, None) if the third column is a float, (line_number, 0) otherwise. :rtype: tuple .. py:function:: get_tool_location(software: str) -> str Check if a software program is available in the system path or in the alternative location. Will return either the sofware name itself which will then be ok to run as is (globally) or the full path to the software if it's found in the alternative location. In both cases, the return value will allow running the software. If software not available, it will reaise a SystemExit() error with a message about the missing software. :param software: Name of the software program to be found