.. using custom genomes¶
This is the main versioning scheme for microbetag.
The version of this tutorial is in line with the one of the documentation page.
Note
This tutorial is for users that have some basic experience working on a terminal.
Contrary to previous cases, this scenario is not performed from within the CytoscapeApp.
The user needs to run microbetag first on their computing environment (personal computer, HPC etc.),
and then load the returned annotated network to Cytoscape.
In the Cytoscape App tutorial,
our sequences were already taxonomically assigned before running microbetag,
and their taxonomies were mapped to representative GTDB genomes.
microbetag then used these genomes for the annotation steps.
However, in case of shotgun metagenomics one may end up with their own bins/
while further refinement of the latter can lead to Metagenome-Assembled Genomes (MAGs).
Also, one may already use genomes for the communities under study from earlier studies.
In this case, such local genomes can be used directly for the annotation steps of microbetag.
Yet, this requires computing resources and time much higher than those that our web-server can support.
In this tutorial, we will use a very short number of bins (7) to showcase the various steps microbetag implements.
Yet, it still gets more than a couple of hours to go through all the different steps on a personal computer.
In our experience, memory (RAM) requirements should not be a challenge;
memory would be an issue only with really large networks.
microbetag is more often than not thread-limited, i.e.
it needs computing power to go through the annotation steps.
Important
INPUT FILES USED IN THIS TUTORIAL
A complete example of running microbetag locally using the modelseedpy library for GEM reconstruction with all the intermediate files produced can be found in the
dev_io_microbetag folder of the user-bins
branch on the GitHub repo.
In the initial run, there are only 3 input files:
the
config.ymlfile; allows you to set all the relative parameters formicrobetagto run
an abundance table (following the format of the Cytoscape app tutorial) called
thirty_Samples.tsv, andits corresponding edge list (
edgelist.csv)
Remember!
The config file is mandatory.
Hint
This tutorial primarily covers the case where your starting input is an abundance table
along with a list of genomes or bins.
Alternatively, you may also run microbetag using again the list of genomes but this time,
a precomputed network file and a two-column taxonomy file,
where the first column contains sequence IDs and the second contains their corresponding taxonomies.
For such a case, you may check an example on microbetag’s GitHub.
Input and config.yml files¶
The config.yml file is rather important as it is the one that allows you to set your microbetag run.
A number of the parameters there correspond to tools that are invoked,
while others have to do with alternative routes that microbetag can follow for the annotation of the network.
Read carefully the description of each argument before setting a value.
Here, we highlight some of them.
abundance_table_file: path to your abundance table; the abundance table needs to follow the instructions for any abundance table to be used withmicrobetag, i.e., sequence identifier in the first column, sample names in the first row and a 7-level taxonomy in the last column; of course, you may provide the output of themicrobetagpreprocessing step as an abundance table.sc_input_type: This is a key parameter for runningmicrobetaglocally; based on whether you already have annotated your genomes (either using other software or from previous runs ofmicrobetag) you can use different input files as the starting point for getting the seed complementarities. Thesequence_files_for_reconstructionsparameter is strongly related to this.For example, if you have already GEMs reconstructed based on your genomes, you may set this to
sc_input_typetomodelsand then, provide the folder name with your GEMs in thesequence_files_for_reconstructionsparameter (e.g.my_xmls). Likewise, if you do not have GEMs, but you already have RAST annotations, you may setsc_input_typetoproteins_faaand give the path to those in thesequence_files_for_reconstructionsparameter.seed_complementarity: since this is the most time and resource consuming step, the user may choose not to go for it. By setting this toFasle, none of the steps for GEMs reconstruction or seed complementarity inference will be performed.flashweave_args: all the arguments under this umbrella term are related to howFlashWeavewill perform, check on the FAQs but also the FlashWeave GitHub repo for more.
Warning
Make sure that the template configuration file you are using is compatible with
the version of microbetag you have installed.
Hint
A YAML configuration file is required only in case you need to run the whole microbetag pipeline.
If you need to run indipendent tasks, you can either build pseudo :class:microbetag.Config classes
or even use directly the microbetag features you wish to.
The example cases we conver under the tests folder on GitHub they come along with their input/output data you will find on the
test_data folder.
Have a look on the README
for the cases covered and feel free to advise their corresponding pseudo-config classes, or their corresponding
YAML configuration files.
Output¶
In the config.yml file, you can specify the output_directory.
Here we discuss the folders and the files you will find under the output_directory.
We dot not always follow the order with which the files are generated.
The annotated .cx2 network file¶
The main output file of microbetag can be found in the output_directory you set in your config.yml file;
the microbetag-annotated network called mtag_net_<timestamp>.cx2.
This is the file you need to load in your Cytoscape and then, a
after enabling the MGG visual style and the MGG results panel, you can investigate your annotated network!
This file is in
.cx2{target=”_blank”}
format.
FAPROTAX¶
A folder called faprotax is made where there is a subfolder, called sub_tables
and a file with the sum of the abundances of the taxa found with a specific process in each sample, called functional_otu_table.tsv.
In the sub_tables folder, a file for each process is available mentioning the genomes/bins found
related with the process under study and their relative abundance per sample.
For example, the aerobic_nitrite_oxidation.txt looks like:
record |
seqId |
sample1 |
sample2 |
sample4 |
sample5 |
sample6 |
sample7 |
sample8 |
sample9 |
sample10 |
sample11 |
sample12 |
sample13 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Xanthobacteraceae;g__Nitrobacter;s__Nitrobacter sp001896955 |
bin_32 |
43 |
10 |
56 |
73 |
9 |
58 |
54 |
46 |
9 |
40 |
42 |
81 |
d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Xanthobacteraceae;g__Nitrobacter;s__Nitrobacter sp001897285 |
bin_223 |
47 |
87 |
69 |
64 |
25 |
95 |
40 |
71 |
16 |
78 |
52 |
40 |
genome-based trait predictions¶
train.genotypefile: this is the output of thephenotrexprogram annotating your genomes/bins with COG families using the latestpredictionsfolder: within this folder, a file with the predictions for presence/absence of each trait in each of our genomes along with a confidence score are available. For example, thesymbiont.prediction.tsvfile, in our test case looks like:
# Trait: Symbiont |
||
|---|---|---|
Identifier |
Trait present |
Confidence |
bin_101.fa |
NO |
0.643 |
bin_151.fa |
NO |
0.6395 |
bin_19.fa |
NO |
0.869 |
bin_38.fa |
NO |
0.8678 |
bin_41.fa |
NO |
0.7842 |
bin_45.fa |
YES |
0.7954 |
bin_48.fa |
NO |
0.8545 |
ORFs¶
microbetag invokes prodigal to extract Open Reading Frames (ORFs).
It creates a folder called ORFs in the output_directory and for each genome/bin it returns 3 files:
.gbk: Genbank-like format (for more information check here).faa: the reading frames as aminoacid sequences.ffn: the reading frames as nucleic acid sequences
Note
SKIP THE ORFs PREDICTION (prodigal) STEP
If you have already calculated the ORFs of your genomes before start using microbetag, you can create a folder within your output_directory called ORFs and move there all your .faa files.
This way, microbetag will be using those instead of running prodigal.
You .faa files should look like this:
>c_000000001749_1 # 2 # 913 # 1 # ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.479
DDSKIHQLGWDAFQAGTKVAKEEGLYGAGQDLLSDAFSGNVKGLGPAVAELSFEERPSEP
FLFFMADKTEPGAYNLPFYLSYADPMYNPGLMLSPKMGKGFVFTVMDVENTENDRIIELT
TPEDIYDLACLLRDNGRFVVESIRSAKTGETTAVCSTTRLNKIAGEYVGKDDPVALARVQ
KEGG annotations¶
microbetag makes use of the hmmsearch tool and the kofam_database profiles to check which KOs are present in each of your genomes.
In the output_directory, microbetag creates a folder called KEGG_annotations and there it builds a folder called hmmout, where it keeps all the 24.728 .hmmout files for each genome.
Once all the .hmmout files are there for all the genomes/bins under study, microbetag builds a file called ko_merged.txt based on the DiTing implementation, that looks like this:
bin_id |
contig_id |
ko_term |
|---|---|---|
bin_41 |
SCN18_26_2_15_R1_F_scaffold_115_57 |
K07586 |
bin_48 |
SCN18_26_2_15_R4_B_scaffold_93_80 |
K08086 |
bin_41 |
SCN18_26_2_15_R1_F_scaffold_206_63 |
K03503 |
This file is the main component for microbetag to proceed with the pathway complementarity step.
Tip
SKIP THE KEGG ANNOTATION (HMMSEARCH) STEP
If you have already hmm profiles either from analysis before using microbetag or from previous microbetag runs of your genomes, you can create a folder called hmmout within the KEGG_annotations folder of your output_directory and move all the .hmmout profiles of your bins there.
An .hmmout file would looks like:
root@8649bd465c24:/data/microbetag_local/KEGG_annotations/hmmout# more K00005.bin_101.hmmout
# --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
># target name accession query name accession E-value score bias E-value score bias exp reg clu ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ ----- --- --- --- --- --- --- --- --- ---------------------
c_000000006615_9 - K00005 - 1.2e-98 327.2 0.0 1.4e-98 327.0 0.0 1.0 1 0 0 1 1 1 1 # 14663 # 15772 # 1 # ID=74_9;partial=00;start_type=TTG;rbs_motif=AGGAGG;rbs_spacer=5-10bp;gc_cont=0.449
#
# Program: hmmsearch
# Version: 3.4 (Aug 2023)
# Pipeline mode: SEARCH
# Query file: /microbetag/microbetagDB/ref-dbs/kofam_database/profiles/K00005.hmm
# Target file: /data/microbetag_local/ORFs/bin_101.faa
# Option settings: hmmsearch -o /dev/null --tblout /data/microbetag_local/KEGG_annotations/hmmout/K00005.bin_101.hmmout -T 324.27 --cpu 1 /microbetag/microbetagDB/ref-dbs/kofam_database/profiles/K00005.hmm /data/microbetag_local/ORFs/bin_101.faa
# Current dir: /microbetag
# Date: Mon Aug 5 11:06:08 2024
# [ok]
If you already have the ko_merged.txt file, you can only add a copy of it in the KEGG_annotations folder (the hmmout files are not necessary in this case) and microbetag will use this directly skipping the hmmsearch step.
Hint
COMPUTING TIME, RESOURCES AND STORAGE
Running microbetag locally using your own genomes/bins/MAGs can take significant computing time and resources.
In this tutorial, we use only a short number of bins that has almost no biological significance.
What we want to accomplish here is to make sure that you can run microbetag locally.
Indicatively, using a Linux machine and allocating 2 CPUs it took more than 1 hour for the KO annotation of just those 7 bins.
Running the complete workflow could get up to 6-7 hours based on the approach you will choose to reconstruct your GEMs.
carveme is much more robust in running smoothly and faster since it does not require a RAST connection (see following paragraph).
GEMs already available¶
In this case, you may use your GEMs directly for the seed complementarities inference by setting:
sc_input_typeasmodelssequence_files_for_reconstructionspointing to directory with the.xmlfilesgenre_reconstruction_withcan be left blank or any value; it will not be considered
PhyloMInt post process¶
After microbetag performs PhyloMInt, it runs a step to post-process the seed and the non seed sets as initially returned by:
removing compounds from seed sets that are related to environmental metabolites that can be produced in several ways within the cell.
removing from non seed sets compounds that cannot be produced in any other way than from entering the cell from the environment.
The numbers of this post process are recorded in the log.tsv file that can be found under the seed_complementarity folder that looks like this:
model_id |
environmental_initial_seeds |
non_environmental_initial_seeds |
total_initial_seeds |
updated_seeds |
initial_non_seeds |
updated_non_seeds |
|---|---|---|---|---|---|---|
bin_151 |
173 |
57 |
230 |
230 |
879 |
879 |
bin_38 |
177 |
54 |
231 |
231 |
1211 |
1211 |
bin_101 |
152 |
51 |
203 |
203 |
845 |
845 |
This post process step is necessary since we use a complete medium to gapfill the model.
Note
One may come with alternative procedures on how to gap fill in terms of minimising the missing potential cross-feedings, but at the same time not over-predicting such cases.