Pedel revamp

In our paper (Firth & Patrick, 2005), we assumed a Poisson distribution to determine the fraction of sequences in an epPCR library that contain exactly 0, 1, 2, 3, ... mutations, given the mean number of mutations, m, per sequence.

Since publication of Firth & Patrick (2005), however, Drummond et al. (2005) have revisited the pioneering work of Sun (1995) and provided experimental evidence in support of his more accurate equation describing the distribution of m. This 'PCR distribution' takes into account the number of PCR thermal cycles ncycles and the PCR efficiency eff (i.e. the probability that any particular sequence is duplicated in a given PCR cycle). We have therefore now included the PCR distribution as an optional alternative to the Poisson distribution in PEDEL.

For large m, small ncycles, or low eff, the PCR distribution is broader than the Poisson distribution. For low m, large ncycles and large eff, the PCR distribution approximates the Poisson distribution. In a 'typical' epPCR (e.g. ncycles = 30, eff = 0.6, m = 4), the estimated total number of distinct sequences in a library typically agrees to within 5% for the two distributions, though the sub-library statistics can show more variation.

If you know ncycles and eff, then we recommend that you use the PCR distribution instead of the Poisson distribution. Drummond et al. (2005) use the formula d = ncycles × eff, where d is the number of doublings. For example, if you start with 10^9 identical parent sequences and amplify them in an epPCR to 10^15 sequences, then you have had about d = 20 doublings (10^9 × 2^20 ~= 10^15), and you can calculate eff = d ÷ ncycles. Actually the d = ncycles × eff formula is wrong. The correct formula is 2^d = (1+eff)^ncycles, so that the efficiency is given by eff = 2^(d/ncycles) - 1 (PCR efficiency calculator).

References

Drummond D.A., Iverson B.L., Georgiou G., Arnold F.H. (2005). Why high-error-rate random mutagenesis libraries are enriched in functional and improved proteins, J. Mol. Biol., 350, 806-816.

Firth A.E., Patrick W.M., (2005). Statistics of protein library construction, Bioinformatics, 21, 3314-3315.

Sun F. (1995). The polymerase chain reaction and branching processes, J. Comput. Biol., 2, 63-86.

PEDEL-AA x = 0, 1 and 2 sub-library statistics

For x = 0, 1 and 2, we calculate the expected number of distinct variants, Cx, precisely. This calculation includes variants with multiple nucleotide substitutions in the same codon.

The total number of each of the 64 codon types in the input sequence is calculated. The 64 x 20 matrix of probabilities for each codon type mutating to each amino acid type is calculated using the input nucleotide substitution matrix and the input substitution rate.

For x = 0, 1 and 2 there are, respectively Vx_2 = 1, 19N and 361N(N-1)/2 total possible variants (i.e. N!/[x!(N-x)!] 19^x, where N is the length of the input sequence in codons). The probability of the input sequence mutating to each of these possible variants is calculated and renormalized by the respective probability sum P0, P1 or P2 (where Px = Sum_{v_i in Vx_2} P(v_i)) to give the normalized probabilities Pn(v_i) of the different variants within the respective sub-libraries Lx, rather than within the whole library. The probability of a particular variant v_i being present in the relevant sub-library Lx is given by 1 - exp(-Pn(v_i) x Lx). These probabilities are quickly summed over all possible variants using the codon counts. Computationally, this is very fast for x = 0, 1 and 2, but can take a few minutes for x = 3; hence the 'exact' calculation is not used on the webserver for x >= 3. The sizes of the sub-libraries Lx are determined separately for the Poisson and PCR distributions as described in the notes on the PEDEL-AA algorithms, thus resulting in separate Cx estimates for the different distributions.

Ideally, for x >= 3, we will enter the Cx ~ Lx region. In this case all the individual Cx estimates, and the estimated total number of distinct variants in the library C = C0 + C1 + C2 + ..., will be fairly good. A warning is printed in the 'notes' column if there are any x >= 3 values for which the Cx ~ Lx approximation may not apply, in which case Cx is estimated with the formula Cx ~ Vx_1(1-exp(-Lx/Vx_1)) (i.e. ignoring, in these particular sub-libraries, any variants of type Vx_2).

The 'Lx < 0.1 Vx_1' criterion for deciding when to use the 'Cx ~ Lx' approximation is sometimes inaccurate, and can be refined as follows.

First consider a single nucleotide substitution in a single codon. There are 9 possible mutated codons. An amino acid mutation that can only be coded by a single codon out of the 9 and that requires a transversion, has only a 1 in 15 probability (assuming a transition:transversion ratio of 3), since if p is the probability of a transversion, then 3p is the probability of a transition, and the total probability of the 9 mutated codons is 6(p) + 3(3p) = 15p.

For example, if the parent codon is GGG (Gly), then the 9 single-nucleotide-substitution codons are

codon	amino acid	relative probability
AGG	Arg	3p
CGG	Arg	p
TGG	Trp	p
GAG	Glu	3p
GCG	Ala	p
GCG	Ala	p
GTG	Val	p
GGA	Gly	3p
GGC	Gly	p
GGT	Gly	p

AA	Probabilities given the codon mutates	Probabilities given the amino acid mutates
Gly	5/15	(wild-type)
Arg	4/15	4/10
Glu	3/15	3/10
Trp	1/15	1/10
Ala	1/15	1/10
Val	1/15	1/10

The 'Lx < 0.1 Vx_1' criterion assumes that all of the single-nucleotide-substitution non-synonymous amino acid substitutions are equiprobable - i.e. 1 in 5 in the above example, but in general represented by the reciprocal of the 'A' factor described in the above section, where typically A ~ 5.8; whereas, in fact, the most common single-nucleotide-substitution amino acid substitution (GGG -> Arg) is 4 x as likely as the rarest (GGG -> Trp or Ala or Val). In cases where some nucleotide substitutions (as defined by the 4 x 4 nucleotide substitution matrix) are particularly rare, the probability difference between the rarest and the most common single-nucleotide-substitution amino acid substitutions at a given site can be much greater.

The 'Lx < 0.1 Vx_1' criterion for being in the 'Cx ~ Lx' region is basically to make sure that there are enough variants in Vx to 'absorb' all Lx sub-library members so that (within a small error) at most one sub-library member is equal to any given variant in Vx. In practice, it doesn't matter what the probability of the rarest variants is. What matters for the 'Cx ~ Lx' approximation is that the mean frequency in Lx of the most common variant is < 0.1. In fact the mean frequency of the most common variant in Lx, which we denote by Rx, is easy to calculate for x = 0, 1, 2, ..., 20, ..., and is shown in the PEDEL-AA output table of sub-library statistics.

Using these Rx values, the 'Lx < 0.1 Vx_1' criterion would be replaced with the criterion 'Rx < 0.1'. In practice this means that if, in the table of sub-library statistics, there are Rx values > 0.1, for which the 'Cx ~ Lx' approximation has been used (i.e. x >= 3 and Lx < 0.1 Vx_1), then the particular corresponding Cx values may be overestimates. A warning and html link are given in the table of sub-library statistics whenever this occurs.

Property	Volles & Lansbury	Firth & Patrick
-	-	Poisson	PCR
Truncations (%)	15	15.6
# Full-length clones	3.1 x 10^6	3.18 x 10^6
Protein mutation freq. per aa	0.016	0.0160
Mean # mutations per protein	2.1	2.12
Unmutated sequences (%)*	14	10.1	14.0
# of unique proteins	1.3 x 10^6	1.32 x 10^6	1.29 x 10^6
# of unique point mutations	1990	1989
# of unique single point mutations	1566	1618	1618

PedelAA

Programme for Estimating Diversity in Error-prone PCR Libraries (amino acid version)

Input

Sequence

Nucleotide mutation matrix

codon	amino acid	relative probability
AGG	Arg	3p
CGG	Arg	p
TGG	Trp	p
GAG	Glu	3p
GCG	Ala	p
GCG	Ala	p
GTG	Val	p
GGA	Gly	3p
GGC	Gly	p
GGT	Gly	p

codon	amino acid	relative probability
AGG	Arg	3p
CGG	Arg	p
TGG	Trp	p
GAG	Glu	3p
GCG	Ala	p
GCG	Ala	p
GTG	Val	p
GGA	Gly	3p
GGC	Gly	p
GGT	Gly	p

codon	amino acid	relative probability
AGG	Arg	3p
CGG	Arg	p
TGG	Trp	p
GAG	Glu	3p
GCG	Ala	p
GCG	Ala	p
GTG	Val	p
GGA	Gly	3p
GGC	Gly	p
GGT	Gly	p