In brief, Evolutionary Action (EA) models the phenotypic impact of a given amino acid change comparing to its reference protein sequence (Katsonis and Lichtarge, 2014). It is defined as:
$EA = \Delta\phi = \nabla{f} \cdot \Delta\gamma$
Here $f$ is the fitness landscape, and $\nabla{f}$ is the gradient of the fitness landscape. EA or $\Delta\phi$ is the fitness response triggered by a small change (mutation) in the genotype $\Delta\gamma$. In practice, for a single point coding mutation from amino acid $X$ to $Y$ at a sequence position $i$, $\nabla{f}$ is approximated by the evolutionary importance of position $i$, given by the Evolutionary Trace (ET) algorithm (Lichtarge et al., 1996). And the magnitude of the substitution $\Delta\gamma$ is approximated by amino acids substitutions log odds tables. EA has a value between 0-100, where larger values suggest higher functional impact.Genes under selection during evolution will accumulate more mutations with high functional impact. The fitness impact of a given set of mutations (in this case, mutations inside a given gene) can be evaluated by integrating the EA scores of these mutations. EA_KS and EA_sum can be used to approximate EA integration.
In the EA_KS approach, mutated genes are ranked by their EA mutational impact profile with a non-parametric Kolmogorov-Smirnov (KS) test against a mutation background, usually random mutations or mutations that occur without the selection of interest. The driver genes are expected to have a bised EA distribution towards high EA scores.
For the EA_sum approach, EA scores for all coding mutations observed in a gene across samples are summed and compared to the expected values from the same mutation background as EA_KS. The expected EA_sum values are calculated as avg_EA $\times$ expected mutation count, where avg_EA is the average EA score of all mutations in the mutation background and the expected mutation count is determined by the mutation count in the samples and gene length. The driver genes are expected to have higher EA_sum than non-driver genes.
A frequency-based analysis is performed based on the assumption that the probability of x mutations occurring in a protein with given length $l$, follows a Poisson distribution with $\lambda = l \times m$, where $m$ is the average mutation rate in each dataset. The frequency p-value for each gene was calculated by $p = P[X{\geq}x]$.
Mutations in the founder strains (strains that are sequenced before selection) are less likely to contribute to the adapted phenotype. Thus, they are subtracted out from the evolve strains during the analysis.
EA integration compares the EA distribution between mutations in the evolve strains and a mutation background (mutations that are not under selection). This background can be generated by simulating random mutations in the reference genome, or by gathering mutations occurred when passing reference without selective pressure. Using more than 1000 mutations as background distribution is recommended.
After running the analysis, the gene rankings by EA_KS, EA_sum and Frequency method will be returned as a table and a gene ranking plot. The axes of the gene ranking plot can be adjusted. When selecting a specific gene in the table, that gene will be highlighted in the ranking plot. In addition, the mutations in that gene in the evolve strains will be returned, and their EA distribution will be plotted.
Driver genes for a given phenotype tend to function in the same biological pathway. To evaluate the functional relationship of the top ranked genes, STRING PPI enrichment analysis (Szklarczyk et al., 2021) can be run for the top predictions utilizing STRING API. Significant clustering is expected for top predictions. By default, top 5% of the mutated genes are used. Currently, E. coli REL606 is not a supported organism in STRING DB. Thus STRING analysis with E. coli REL606 will be performed using MG1655 PPI network.Genes that are predicted at the top by all methods are more likely to be drivers. Venn diagram is generated for the top genes from all three methods. Click on the Venn diagram to highlight overlapping genes. STRING analysis can be run with the highlighted genes.
Use quick EA search to check the functional impact (EA) of specific mutations of interest.
All tables in the app can be filtered or sorted. Use the "CSV" or "Excel" button on the top of the tables to download the data. Note that only what is shown in the table will be downloaded/copied. To downloaded/copied the entire list, select "show all entries" first.
The entire MG1655 EA dataset can be downloaded through the download button in the Quick EA search tab.
Mutation information can be imported from evolve strains (Driver Gene Analysis, step 1) or Quick EA Search.
The raw values and color code for the structure is available in the Color table. A Pymol session file is available to reproduce the coloring locally in Pymol.
This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA) under BAA-17-01, contract #2019-19071900001. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government.
If you have any questions or would like to suggest other reference genomes, please contact us at lichtarge@bcm.edu and chen.wang@bcm.edu. You could also raise an issue on our GitHub page.
Gene id and substitution combination should be entered to query the Evolutionary Action (EA). One entry per line. Either locus tag or gene name can be used as gene id. We recommend the using the locus tag to avoid the ambiguity of gene name. Subsitution should follow the format as M1C. Gene id and subsitution should be seperated with space or tab. EA scores are scaled within each protein from 0 to 100, with 100 as the most impactful mutation on protein function.