Introduction

ShinyBioHEAT (Biodetection of High Evolutionary Action targets) is designed to identify phenotype driven gene in adapted bacteria. Three reference genomes are supported currently including E. coli MG1655 (RefSeq: NC_000913.3), E. coli REL606 (RefSeq: NC_012967.1), and B. subtilis 168 (RefSeq: NC_000964.3). It contains three main modules:

  • Driver Gene Analysis:
    Two orthogonal approaches, EA (Evolutionary Action) integration and a Frequency-based method, are used to identify phenotype driven genes. EA integration compares the EA distribution between mutations for each gene in the evolve strains and a mutation background, then prioritize genes that have more impactful mutations (Marciano et al., 2022). The frequency-based method ranks the genes based on mutation count and gene length. See details below.

  • Quick EA search:
    Evolutionary Action (EA) scores predicts the functional impact of protein coding mutations (see details below). This module allows user to quickly search the EA score of any given mutations in selected reference genome.

  • Structure Viewer:
    Mapping mutation profile and EA/ET scores to AlphaFold structures.

Evolutionary Action (EA)

In brief, Evolutionary Action (EA) models the phenotypic impact of a given amino acid change comparing to its reference protein sequence (Katsonis and Lichtarge, 2014). It is defined as:

$EA = \Delta\phi = \nabla{f} \cdot \Delta\gamma$

Here $f$ is the fitness landscape, and $\nabla{f}$ is the gradient of the fitness landscape. EA or $\Delta\phi$ is the fitness response triggered by a small change (mutation) in the genotype $\Delta\gamma$. In practice, for a single point coding mutation from amino acid $X$ to $Y$ at a sequence position $i$, $\nabla{f}$ is approximated by the evolutionary importance of position $i$, given by the Evolutionary Trace (ET) algorithm (Lichtarge et al., 1996). And the magnitude of the substitution $\Delta\gamma$ is approximated by amino acids substitutions log odds tables. EA has a value between 0-100, where larger values suggest higher functional impact.

EA integration

Genes under selection during evolution will accumulate more mutations with high functional impact. The fitness impact of a given set of mutations (in this case, mutations inside a given gene) can be evaluated by integrating the EA scores of these mutations. EA_KS and EA_sum can be used to approximate EA integration.

In the EA_KS approach, mutated genes are ranked by their EA mutational impact profile with a non-parametric Kolmogorov-Smirnov (KS) test against a mutation background, usually random mutations or mutations that occur without the selection of interest. The driver genes are expected to have a bised EA distribution towards high EA scores.

For the EA_sum approach, EA scores for all coding mutations observed in a gene across samples are summed and compared to the expected values from the same mutation background as EA_KS. The expected EA_sum values are calculated as avg_EA $\times$ expected mutation count, where avg_EA is the average EA score of all mutations in the mutation background and the expected mutation count is determined by the mutation count in the samples and gene length. The driver genes are expected to have higher EA_sum than non-driver genes.

A frequency-based method

A frequency-based analysis is performed based on the assumption that the probability of x mutations occurring in a protein with given length $l$, follows a Poisson distribution with $\lambda = l \times m$, where $m$ is the average mutation rate in each dataset. The frequency p-value for each gene was calculated by $p = P[X{\geq}x]$.






Driver gene analysis

Step 1. Upload data

Mutation data can be input from variant call format (VCF) files or substitution (SUB) files. One file per sample. The VCF or SUB files should be generated using the supporte reference genomes: E. coli K-12 MG1655 (RefSeq: NC_000913.3), E. coli REL606 (RefSeq: NC_012967.1) and B. subtilis 168 (RefSeq: NC_000964.3). A warning will show up if there are entries in the VCF/SUB files that do not match to the reference genome. SUB files should contain two columns, input_id and SUB, that are separated by comma (csv file), where input_id stores the locus_tags (e.g. b4112) or gene names (e.g. basS), and SUB stores the amino acid substitution (e.g. C84R).

Mutations in the founder strains (strains that are sequenced before selection) are less likely to contribute to the adapted phenotype. Thus, they are subtracted out from the evolve strains during the analysis.

Step 2: Get background distribution

EA integration compares the EA distribution between mutations in the evolve strains and a mutation background (mutations that are not under selection). This background can be generated by simulating random mutations in the reference genome, or by gathering mutations occurred when passing reference without selective pressure. Using more than 1000 mutations as background distribution is recommended.

Figure 1. (A) The EA distribution of randomly simulated coding mutations decays exponentially and is biased towards low impact EA scores. Dotted line shows the exponential fit for EA scores of non-synonymous mutations. (B) In contrast, the exponential decay pattern in EA distribution does not hold for mutations in driver genes. Those mutations are biased towards high EA. Dotted line shows EA distribution for simulated mutation background.

Step 3. Run analysis

After running the analysis, the gene rankings by EA_KS, EA_sum and Frequency method will be returned as a table and a gene ranking plot. The axes of the gene ranking plot can be adjusted. When selecting a specific gene in the table, that gene will be highlighted in the ranking plot. In addition, the mutations in that gene in the evolve strains will be returned, and their EA distribution will be plotted.

Driver genes for a given phenotype tend to function in the same biological pathway. To evaluate the functional relationship of the top ranked genes, STRING PPI enrichment analysis (Szklarczyk et al., 2021) can be run for the top predictions utilizing STRING API. Significant clustering is expected for top predictions. By default, top 5% of the mutated genes are used. Currently, E. coli REL606 is not a supported organism in STRING DB. Thus STRING analysis with E. coli REL606 will be performed using MG1655 PPI network.

Step 4. Overlapping genes

Genes that are predicted at the top by all methods are more likely to be drivers. Venn diagram is generated for the top genes from all three methods. Click on the Venn diagram to highlight overlapping genes. STRING analysis can be run with the highlighted genes.

Figure 2. (A) Venn diagram showing the overlap of the top predicted driver genes by different methods for colistin resistance. (B) Genes that are ranked high by different methods cluster significantly in STRING. Genes, e.g. waaQ, that are connected to known drivers are worth further testing.

Quick EA search

Use quick EA search to check the functional impact (EA) of specific mutations of interest.

Download

All tables in the app can be filtered or sorted. Use the "CSV" or "Excel" button on the top of the tables to download the data. Note that only what is shown in the table will be downloaded/copied. To downloaded/copied the entire list, select "show all entries" first.

The entire MG1655 EA dataset can be downloaded through the download button in the Quick EA search tab.






Mapping to protein structure

Mutations in driver genes tend to cluster together in protein structure. The Structure Viewer allows users to map values below to AlphaFold (Jumper et al., 2021) structures.

  • Evolutionary Trace (ET):
    Estimates the evolutionary importance of a residue position in a protein.

  • AlphaFold pLDDT:
    The structure prediction accuracy score from AlphaFold.

  • Mutation count and sumEA:
    Estimates the overall mutational/EA burden on a residue position.

Mutation information can be imported from evolve strains (Driver Gene Analysis, step 1) or Quick EA Search.

The raw values and color code for the structure is available in the Color table. A Pymol session file is available to reproduce the coloring locally in Pymol.

Figure 3. Visualizing waaQ mutations that are observed in colistin resistant E. coli strains. These mutations are predicted to be high impactful by EA, and are clustered around functional important sites (ET), suggesting waaQ might contribute to colistin resistance.





Acknowledgement

This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA) under BAA-17-01, contract #2019-19071900001. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government.


Contacts

If you have any questions or would like to suggest other reference genomes, please contact us at lichtarge@bcm.edu and chen.wang@bcm.edu. You could also raise an issue on our GitHub page.


References

Marciano DC, Wang C, Hsu TK, Bourquard T, Atri B, Nehring RB, Abel NS, Bowling EA, Chen TJ, Lurie PD, Katsonis P, Rosenberg SM, Herman C, Lichtarge O.
Evolutionary action of mutations reveals antimicrobial resistance genes in Escherichia coli.
Nat Commun. 2022 Jun 9;13(1):3189. [link]

Katsonis P, Lichtarge O.
A formal perturbation equation between genotype and phenotype determines the Evolutionary Action of protein-coding variations on fitness.
Genome Res. 2014 Dec;24(12):2050-8. [link]

Lichtarge O, Bourne HR, Cohen FE.
An evolutionary trace method defines binding surfaces common to protein families.
J Mol Biol. 1996 Mar 29;257(2):342-58. [link]

Szklarczyk D, Gable AL, Nastou KC, Lyon D, Kirsch R, Pyysalo S, Doncheva NT, Legeay M, Fang T, Bork P, Jensen LJ, von Mering C.
The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets.
Nucleic Acids Res. 2021 Jan 8;49(D1):D605-D612. doi: 10.1093/nar/gkaa1074. Erratum in: Nucleic Acids Res. 2021 Oct 11;49(18):10800. [link]

Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D.
Highly accurate protein structure prediction with AlphaFold.
Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15. [link]





Search EA for missense mutations


Gene id and substitution combination should be entered to query the Evolutionary Action (EA). One entry per line. Either locus tag or gene name can be used as gene id. We recommend the using the locus tag to avoid the ambiguity of gene name. Subsitution should follow the format as M1C. Gene id and subsitution should be seperated with space or tab. EA scores are scaled within each protein from 0 to 100, with 100 as the most impactful mutation on protein function.




ET

AlphaFold pLDDT

sumEA

number of unique mutations