Programme for Estimating Diversity in Error-prone PCR Libraries


Given a library of L sequences, comprising variants of a sequence of N nucleotides, into which random point mutations have been introduced, we wish to calculate the expected number of distinct sequences in the library. (Typically assuming L > 10, N > 5, and the mean number of mutations per sequence m < 0.1 x N).

Example ( show)

Saab-Rincon et al (2001, Protein Eng., 14, 149-155) constructed a library of 5 million clones with a single round of epPCR on a 700 bp gene. Sequencing 10 of these, indicated an error rate of 3-4 nucleotide substitutions per daughter sequence. Entering L = 5000000, N = 700 and m = 3.5 into the base PEDEL sever page, and clicking 'Calculate', shows that the expected number of distinct sequences in the library is 4.153 x 10^6, or about 4.2 million.

If you follow the link to 'detailed statistics' and, once again, enter L = 5000000, N = 700 and m = 3.5 and click 'Calculate', you get a breakdown of library statistics for each of the sub-libraries comprising all those daughter sequences with exactly x base substitutions (x = 0, 1, 2, 3, ...).

For example the first line of the table shows that Px = 3.02% of the library (i.e. Lx = 1.51 x 10^5 daughter sequences) have x = 0 base substitutions (i.e. they are identical to the parent sequence). The total number of possible variants with 0 base substitutions is, of course, Vx = 1 (just the parent sequence) and the total number of distinct sequences with 0 base substitutions present in the library is, similarly, Cx = 1. The completeness of the x = 0 sub-library is Cx/Vx = 100%. The redundancy of this sub-library - i.e. wasted duplication - is Lx-Cx = 1.51 x 10^5.

You also have the option to plot this data by following the 'Plot this data' link. Choose the statistic to plot and whether or not to use a logscale on the y-axis. For example, a plot of Px or Lx gives a Poisson distribution. A plot of Vx shows how the number of possible variants increases very rapidly as the number of base substitutions is increased. A plot of Cx shows how the expected number of distinct sequences in the sub-libraries initially increases - limited by the number of possible variants, Vx - and then decreases - limited by the size of the sub-library, Lx. A plot of Lx-Cx shows the extent of wasted duplication in the lower x-value sub-libraries.

Returning to the base PEDEL server page, you can follow links to plot the expected number of distinct sequences in a library for a range of mutation rates, library sizes or sequence lengths. The third option probably won't be very useful, but the first two will help you to decide what library size to aim for in order to obtain a given diversity, and what mutation rate to use to maximize the diversity for a given library size.

For example, follow the 'mutation rates' link, enter L = 5000000, N = 700 and m = 0.2 - 20, and click 'Calculate'. From the plot, you can see that the expected number of distinct sequences increases rapidly with m until m ~ 5, and then levels off with < 10% redundancy in the library. On the other hand, if you chose m ~ 1.5, then the library would be about 60% redundant. After selecting an optimal mutation rate m, you can go back to the 'detailed statistics' page to check the expected completeness of the x = 0, 1, 2, 3, ... sub-libraries.

Caveats ( show)

PEDEL uses a generic Poisson model of sequence mutations. There are a couple of simplifications that you should be aware of:

  • All base substitution are assumed equally likely. In reality, under error-prone conditions, the polymerase favours some substitutions over others. This has the effect of reducing the expected number of distinct sequences compared with the PEDEL predictions. This is in fact not as big an issue as you might expect. Using the notation from the 'detailed statistics' page (see link on base PEDEL server page), this is not an issue when the number of possible variants Vx is much greater than the sub-library size Lx (i.e. large x values), since here there are so many possible variants that there is little duplication within the sub-library even if there is strong bias. Conversely, if Lx is much greater than Vx (i.e. small x values) then, unless the bias is very strong, nearly all the possible variants will still be sampled. Note that it is now possible, by using sequential PCR amplifications with two different polymerases that have opposite substitution biases, to produce unbiased libraries.
  • Inherent to the PCR process used to produce epPCR libraries, is amplification bias: any mutation introduced in an early PCR cycle, will be present in a significant fraction of the final library. In practice, researchers use a variety of techniques to reduce amplification bias - e.g. reduce the number of epPCR cycles and combine a number of individual libraries. For example, one might start with 10^9 identical parent sequences; amplify them in an epPCR to 10^15 sequences; and, after ligation and transformation of E. coli, end up with a library of 10^7 sequences. Any amplification bias would have a maximum frequency of only 1 in 10^9 so would not show up in the final library.
  • During the PCR cycles, different parent sequences may be amplified a different number of times. However, empirically, the end result is a library with a Poisson distribution of mutations (e.g. Cadwell R.C., Joyce G.F., 1992, Randomization of genes by PCR mutagenesis, PCR Methods Appl., 2, 28-33). But see also this note .
  • Any biases in library construction will decrease the actual number of distinct variants represented in the library. In such cases, PEDEL provides the user with a useful upper bound on the diversity present in the library.
Please refer to Patrick W. M., Firth A. E., Blackburn J.M., 2003, User-friendly algorithms for estimating completeness and diversity in randomized protein-encoding libraries, Protein Eng., 16, 451-457 for further discussion of PEDEL.

A good review of the sources of bias in epPCR (and other directed evolution protocols) can be found in Neylon C., 2004, Chemical and biochemical strategies for the randomization of protein encoding DNA sequences: library construction methods for directed evolution, Nucleic Acids Res., 32, 1448-1459.


Library size

Seq. length


PCR cycles
PCR efficiency