### Introduction

This Shiny app is designed to identify phenotype driven gene in adapted E. coli. It contains three main modules:

• Driver Gene Analysis:
Two orthogonal approaches, EA (Evolutionary Action) integration and a Frequency-based method, are used to identify phenotype driven genes. EA integration compares the EA distribution between mutations for each gene in the evolve strains and a mutation background, then prioritize genes that have more impactful mutations (Marciano et al., 2022). The frequency-based method ranks the genes based on mutation count and gene length. See details below.

• Quick EA search:
Evolutionary Action (EA) scores predicts the functional impact of protein coding mutations (see details below). This module allows user to quickly search the EA score of any given mutations in the E. coli MG1655 genome.

• Structure Viewer:
Mapping mutation profile and EA/ET scores to AlphaFold structures.

#### Evolutionary Action (EA)

In brief, Evolutionary Action (EA) models the phenotypic impact of a given amino acid change comparing to its reference protein sequence (Katsonis and Lichtarge, 2014). It is defined as:

$EA = \Delta\phi = \nabla{f} \cdot \Delta\gamma$

Here $f$ is the fitness landscape, and $\nabla{f}$ is the gradient of the fitness landscape. EA or $\Delta\phi$ is the fitness response triggered by a small change (mutation) in the genotype $\Delta\gamma$. In practice, for a single point coding mutation from amino acid $X$ to $Y$ at a sequence position $i$, $\nabla{f}$ is approximated by the evolutionary importance of position $i$, given by the Evolutionary Trace (ET) algorithm (Lichtarge et al., 1996). And the magnitude of the substitution $\Delta\gamma$ is approximated by amino acids substitutions log odds tables. EA has a value between 0-100, where larger values suggest higher functional impact.

#### EA integration

Genes under selection during evolution will accumulate more mutations with high functional impact. The fitness impact of a given set of mutations (in this case, mutations inside a given gene) can be evaluated by integrating the EA scores of these mutations. EA_KS and EA_sum can be used to approximate EA integration.

In the EA_KS approach, mutated genes are ranked by their EA mutational impact profile with a non-parametric Kolmogorov-Smirnov (KS) test against a mutation background, usually random mutations or mutations that occur without the selection of interest. The driver genes are expected to have a bised EA distribution towards high EA scores.

For the EA_sum approach, EA scores for all coding mutations observed in a gene across samples are summed and compared to the expected values from the same mutation background as EA_KS. The expected EA_sum values are calculated as avg_EA $\times$ expected mutation count, where avg_EA is the average EA score of all mutations in the mutation background and the expected mutation count is determined by the mutation count in the samples and gene length. The driver genes are expected to have higher EA_sum than non-driver genes.

#### A frequency-based method

A frequency-based analysis is performed based on the assumption that the probability of x mutations occurring in a protein with given length $l$, follows a Poisson distribution with $\lambda = l \times m$, where $m$ is the average mutation rate in each dataset. The frequency p-value for each gene was calculated by $p = P[X{\geq}x]$.

### Driver gene analysis

Mutation data can be input from variant call format (VCF) files or substitution (SUB) files. One file per sample. The VCF or SUB files should be generated using the reference genome of E. coli K-12 MG1655 (NCBI: U00096.3). A warning will show up if there are entries in the VCF/SUB files that do not match to the reference genome. SUB files should contain two columns, input_id and SUB, that are separated by comma (csv file), where input_id stores the locus_tags (e.g. b4112) or gene names (e.g. basS), and SUB stores the amino acid substitution (e.g. C84R).

Mutations in the founder strains (strains that are sequenced before selection) are less likely to contribute to the adapted phenotype. Thus, they are subtracted out from the evolve strains during the analysis.

#### Step 2: Get background distribution

EA integration compares the EA distribution between mutations in the evolve strains and a mutation background (mutations that are not under selection). This background can be generated by simulating random mutations in the MG1655 genome, or by gathering mutations occurred when passing MG1655 without selective pressure. Using more than 1000 mutations as background distribution is recommended.

#### Step 3. Run analysis

After running the analysis, the gene rankings by EA_KS, EA_sum and Frequency method will be returned as a table and a gene ranking plot. The axes of the gene ranking plot can be adjusted. When selecting a specific gene in the table, that gene will be highlighted in the ranking plot. In addition, the mutations in that gene in the evolve strains will be returned, and their EA distribution will be plotted.

Driver genes for a given phenotype tend to function in the same biological pathway. To evaluate the functional relationship of the top ranked genes, STRING PPI enrichment analysis (Szklarczyk et al., 2021) can be run for the top predictions utilizing STRING API. Significant clustering is expected for top predictions. By default, top 5% of the mutated genes are used.

#### Step 4. Overlapping genes

Genes that are predicted at the top by all methods are more likely to be drivers. Venn diagram is generated for the top genes from all three methods. Click on the Venn diagram to highlight overlapping genes. STRING analysis can be run with the highlighted genes.

#### Quick EA search

Use quick EA search to check the functional impact (EA) of specific mutations of interest.

All tables in the app can be filtered or sorted. Use the "CSV" or "Excel" button on the top of the tables to download the data. Note that only what is shown in the table will be downloaded/copied. To downloaded/copied the entire list, select "show all entries" first.

### Mapping to protein structure

Mutations in driver genes tend to cluster together in protein structure. The Structure Viewer allows users to map values below to E. coli K-12 MG1655 structures from AlphaFold (Jumper et al., 2021).

• Evolutionary Trace (ET):
Estimates the evolutionary importance of a residue position in a protein.

• AlphaFold pLDDT:
The structure prediction accuracy score from AlphaFold.

• Mutation count and sumEA:
Estimates the overall mutational/EA burden on a residue position.

Mutation information can be imported from evolve strains (Driver Gene Analysis, step 1) or Quick EA Search.

The raw values and color code for the structure is available in the Color table. A Pymol session file is available to reproduce the coloring locally in Pymol.

### Acknowledgement

This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA) under BAA-17-01, contract #2019-19071900001. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government.

### References

Marciano DC, Wang C, Hsu TK, Bourquard T, Atri B, Nehring RB, Abel NS, Bowling EA, Chen TJ, Lurie PD, Katsonis P, Rosenberg SM, Herman C, Lichtarge O.
Evolutionary action of mutations reveals antimicrobial resistance genes in Escherichia coli.
Nat Commun. 2022 Jun 9;13(1):3189. [link]

Katsonis P, Lichtarge O.
A formal perturbation equation between genotype and phenotype determines the Evolutionary Action of protein-coding variations on fitness.

Lichtarge O, Bourne HR, Cohen FE.
An evolutionary trace method defines binding surfaces common to protein families.
J Mol Biol. 1996 Mar 29;257(2):342-58. [link]

Szklarczyk D, Gable AL, Nastou KC, Lyon D, Kirsch R, Pyysalo S, Doncheva NT, Legeay M, Fang T, Bork P, Jensen LJ, von Mering C.
The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets.
Nucleic Acids Res. 2021 Jan 8;49(D1):D605-D612. doi: 10.1093/nar/gkaa1074. Erratum in: Nucleic Acids Res. 2021 Oct 11;49(18):10800. [link]

Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, <U+017D><U+00ED>dek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D.
Highly accurate protein structure prediction with AlphaFold.
Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15. [link]

### Search EA for missense mutations

Gene id and substitution combination should be entered to query the Evolutionary Action (EA). One entry per line. Either locus tag or gene name can be used as gene id. We recommend the using the locus tag to avoid the ambiguity of gene name. Subsitution should follow the format as M1C. Gene id and subsitution should be seperated with space or tab. EA scores are scaled within each protein from 0 to 100, with 100 as the most impactful mutation on protein function.