microbetag.utils¶
Utils supporting tasks for file parsing and formating
Module Contents¶
- class microbetag.utils.SetEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]¶
Bases:
json.JSONEncoderCustom JSON encoder that handles serialization of Python sets.
This encoder extends the functionality of the standard JSONEncoder to support serializing Python sets. JSON does not have a native representation for sets, so this encoder converts sets to lists before serializing them.
- Usage:
When serializing data to JSON using json.dump() or json.dumps(), specify cls=SetEncoder to use this custom encoder.
References
json.JSONEncoder: https://docs.python.org/3/library/json.html#json.JSONEncoder
- default(obj)[source]¶
Override the default method of JSONEncoder to handle serialization of sets. .. rubric:: Notes
If the object is a set, it is converted to a list before serialization. Otherwise, the default behavior of JSONEncoder.default() is used. Notes: If the object is a set, it is converted to a list before serialization. Otherwise, the default behavior of JSONEncoder.default() is used.
- microbetag.utils.get_files_with_suffixes(directory, suffixes)[source]¶
Recursively retrieves files from a specified directory and its subdirectories that have extensions matching a given list of suffixes.
Parameters: directory (str): The root directory to start the search. suffixes (list of str): A list of file suffixes (extensions) to match.
Each suffix should include the dot (e.g., ‘.txt’, ‘.csv’).
Returns: list of str: A list of full paths to files that match any of the specified suffixes.
Example: >>> get_files_with_suffixes(‘/path/to/directory’, [‘.txt’, ‘.csv’]) [‘/path/to/directory/file1.txt’, ‘/path/to/directory/subdir/file2.csv’]
- microbetag.utils.flatten(list_of_lists: List)[source]¶
This function takes a list of lists and flattens it until it returns a list with all the components of the initial one.
- microbetag.utils.file_exists_and_nonzero(filename: str)[source]¶
Check if a file exists and its size is nonzero.
- Parameters:
filename (str) – The path to the file.
- Returns:
True if the file exists and its size is nonzero, False otherwise.
- Return type:
bool
- microbetag.utils.split_list(input_list: List, chunk_size: int)[source]¶
Split a list to sublists of a size.
- microbetag.utils.many_to_one_files(dir_with_files, merged_file)[source]¶
Makes a single file out of all files in a directory by concatenating having rows of one after the other
- microbetag.utils.ko_list_parser(ko_list: str)[source]¶
Parses ko_list file into a dict object - based on DiTing
- Parameters:
ko_list – path to the ko_list file that comes from the kofam database https://www.genome.jp/ftp/db/kofam/
- Returns:
a dictionary mapping knum to threshold and score_type
- Return type:
dict
- microbetag.utils.merge_ko(hmmout_dir, output)[source]¶
Parses the KO<>.<bin>.hmmout files produced by the kegg_annotation() function to create a single 3-column file (output) with the bin_id, the corresponding conting and the KO that wa mapped to it. The function then returns a dictionary with the bin ids as the keys and the set of KOs found to each as the value.
- Parameters:
(str) (output) – path to the .hmmout files
(str) – path/filename to save the output file
- microbetag.utils.bin_kos_to_file(hmmout_dir, bin_id)[source]¶
Builds a 3-col file for a bin and remove the KO-specific output files of hmmsearch
- Parameters:
hmmout_dir – Directory to the hmmout files
ko_tmp
- microbetag.utils.parse_hmmout(hmmout_file, hmmout_dir)[source]¶
Parses the output of the hmmsearch
- Parameters:
(str) (hmmout_dir) – Filename of the .hmmout file
(str) – Directory where hmmout_file is located
- Return basename (str):
Bin id
- Return gene_id (str):
Gene id
- Retrun k_number (str):
KEGG ORTHOLOGY term found
- microbetag.utils.load_merged_ko_file(merged_ko)[source]¶
Load the 3-columns KEGG annotations file as built from the merge_ko()
- Input:
merged_ko (str): path to 3-columns output file of the merge_ko()
- Returns:
a presence-absence (1/0) df where KOs are the rows and bin_ids the columns
- Return type:
pivot_df (pd.DataFrame)
- microbetag.utils.convert_to_json_serializable(obj)[source]¶
Recursively serializes entries of an object, i.e. a set is converted to a list, a list is split to its items and a dictionary keeps its key and their values get serialized
- microbetag.utils.ensure_flashweave_format(conf)[source]¶
Build an OTU table that will be in a FlashWeave-based format.
- microbetag.utils.ensure_same_namespace_after_fw(conf)[source]¶
[TODO] can be removed but let’s wait Inconsistencies from D300244:bin_000023 in the abundance table to D300244.bin_000023 in FlashWeave. Keep the routine in general along with the find_id_differences().
Apparently, the conf.network is a FlashWeave network file - thus the skip 2 rows
- microbetag.utils.extend_complements(complements_json, descrps_path, max_scratch_alt, pathway_complement_percentage, pathway_complements_dir)[source]¶
Extends pathway complement annotations based on given settings and descriptions.
- Parameters:
complements_json (-) – Path to the complements JSON file.
descrps_path (str) – Path to the KEGG MODULES description file.
max_scratch_alt (-) – Maximum number of alternative complements allowed.
pathway_complement_percentage (-) – Maximum allowable percentage of required KOs that must be present.
pathway_complements_dir (-) – Directory to save the extended complements JSON file.
complements_dict (dict) – Dictionary of complements loaded from a JSON file.
descrps_path – Path to the module descriptions file (tab-separated file with no header).
- Returns:
Pathway Complementarities in a dictionary to be assigned in the mgg format
- Return type:
dict
- Builds:
pathway_complements_extended: JSON file with the dictionary returned
- microbetag.utils.flatten_list(lista, flat_list=[])[source]¶
Recursive function taking as input a nested list and returning a flatten one. E.g. [‘GCF_003252755.1’, ‘GCF_900638025.1’, ‘GCF_003252725.1’, ‘GCF_003253005.1’, ‘GCF_003252795.1’, [‘GCF_000210895.1’], [‘GCF_000191405.1’]] becomes [‘GCF_003252755.1’, ‘GCF_900638025.1’, ‘GCF_003252725.1’, ‘GCF_003253005.1’, ‘GCF_003252795.1’].
- microbetag.utils.detect_separator(file_path)[source]¶
Detect the separator used in a text file, i.e ` , `, , ; etc.
- microbetag.utils.get_tool_location(software)[source]¶
Check if a software is available in the system path or in the alternative location. Will return either the sofware name itself which will then be ok to run as is or the full path to the software if it’s found in the alternative location. In both cases, the return value will allow running the software.