Introduction

This web tool is designed to identify phenotype driven genes in lab evolved E. coli. Two orthogonal approaches are used to answer this question:

Evolutionary Action (EA)

In breif, Evolutionary Action (EA) models the phenotypic impact of a given amino acid chage comparing to its reference protein sequence. It is defined as:

$EA = \Delta\phi = \nabla{f} \cdot \Delta\gamma$

Here $f$ is the fitness landscape, and $\nabla{f}$ is the gradient of the fitness landscape. EA or $\Delta\phi$ is the fitness response triggered by a small change (mutation) in the genotype $\Delta\gamma$. In practice, for a single point coding mutation from amino acid $X$ to $Y$ at a sequence position $i$, $\nabla{f}$ is approximated by the evolutionary sensitivity of position $i$, given by the Evolutionary Trace (ET) algorithm. And the magnitude of the substitution $\Delta\gamma$ is approximated by amino acids substitutions log odds tables. EA has a value between 0-100, where larger values suggest higher functional impact.

EA integration

Genes under selection during evolution will accumulate more mutations with high functional impact. Thus, According to EA theory, the fitness impact of a given set of mutations can be evaluated by integrating the EA scores of these mutations. EA_KS and EA_sum can be used to approximate EA integration.

In the EA-KS approach, mutated genes were ranked by their EA mutational impact profile with a non-parametric Kolmogorov–Smirnov (KS) test against the in silico random mutation background.

For EA-sum approach, EA scores for all codon mutations observed in a gene were summed and compared to the expected values from a random distribution of mutations. The expected background EA-sum values are calculated as EAavg $\times$ expected mutation count wherein EAavg is the average EA score of all mutations in the in silico simulation and the expected mutation count is determined by the mutation count in the sample and gene length.

Frequency based method

Frequency based analyses are performed based on the assumption that the probability of x mutations occurring in a protein with given length $l$, follows a Poisson distribution with $\lambda = l \times m$, where $m$ is the average mutation rate in each dataset. The frequency p-value for each gene was calculated by $p = P[X≥x]$.


Instructions

Step 1. Upload data

Input files should follow the format of single sample VCF file. The VCF files should be generated using the reference genome of E. coli K-12 MG1655 (NCBI: U00096.3). A warning will show up if there is entry in the VCF files that does not match to the reference genome. Mutations in the founder strains will be removed from the mutations in the evolve strains.

Step 2. Simulate random mutations

Using more than 1000 random mutations as background distribution is recommended.

Step 3. Evolutionary Action and frequency analyses

After running the analyses, the gene rankings by EA_KS, EA_sum and Frequency method will be returned as a table and a gene ranking plot. The axes of the gene ranking plot can be adjusted. When selecting a specific gene in the table, that gene will be highlighted in the ranking plot. In addition, the mutations in that gene in the evolve strains will be returned, and their EA distribution will be plotted.

STRING analysis can be run for the top ranked genes to identify functional pathways that are under selection. By default, top 5% of the mutated genes are input to STRING database.

Quick EA search

Use quick EA search to check the functional impact (EA) of specific mutations of interest.

Download

All tables in the app can be filtered or sorted. Use the "CSV" or "Excel" button on the top of the tables to download the data. Note that only what is shown in the table will be downloaded/copied. To downloaded/copied the entire list, select "show all entries" first.

The entire MG1655 EA dataset can be downloaded through the download button in the Quick EA search tab.


Acknowledgement

This research is based upon work supported [in part] by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA) under BAA-17-01, contract #2019-19071900001. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government.


References

Simulation settings


Advanced settings

Gene rankings plot settings


Search EA for a mutation


Gene id and substitution combination should be entered to query the Evolutionary Action (EA). One entry per line. Either locus tag or gene name can be used as gene id. We recommend the using the locus tag to avoid the ambiguity of gene name. Subsitution should follow the format as M1C. Gene id and subsitution should be seperated with space or tab. Examples are shown in the submisson box.




Download EA for all proteins