Gene Conservation Laboratory
Statistics Program for Analyzing Mixtures (SPAM) Software

SPAM Sampling Zeros

SPAM software logo

What is a sampling zero? Why should you worry about them?

A sampling zero occurs when an allele that is present in the population is not detected in the baseline sample, thereby producing a population allele frequency estimate of zero. This is in contrast to a structural zero, which occurs when an allele is absent from the population (and hence is an informative indicator for mixture estimation). Unfortunately, given just a baseline sample of alleles, we have no way to distinguish sampling zeros from structural zeros.

Who cares? You should.

  1. The occurrence of sampling zeros increases with marker polymorphism. These 'missed' low frequency alleles don't, individually, account for much of the allele population (i.e., only a small percentage of the population has a specific allele). However, taken together, 'missed' alleles may account for decent portion of the allele population (i.e., sum the frequencies of all the 'missed' alleles at a locus). This proportion can increase quite rapidly when one considers multiple loci (Reynolds, J. H. In preparation).
  2. Conditional maximum likelihood estimation for mixed stock analysis (which is what SPAM does), treats sampling zeros as structural zeros. If Population 1's baseline says that an allele has zero probability of occurring in Population 1, then any mixture genotype that contains that allele has zero probability of having originated in Population 1. This can lead to severe underestimation bias in mixture estimates just due to sampling zeros WHEN one is using highly polymorphic loci and/or a large number of loci of moderate to high polymorphism (Reynolds, J. H. In preparation). In contrast, this isn't much of a problem when using low polymorphism loci such as allozymes.

What can you do about it? (in no particular order of preference)

  1. take bigger baseline samples.
  2. bin alleles to reduce or eliminate zero allele frequencies.

    For example, consider the following sample allele frequencies for two populations:

    Allele a b c d e
    Population 1 0.75 0.15 0.07 0.03 0.00
    Population 2 0.30 0.60 0.03 0.00 0.07

    One approach to alleviating zeros is to combine, or bin, low frequency alleles so as to eliminate observed zero frequencies:

    Allele a b c d & e
    Population 1 0.75 0.15 0.07 0.03
    Population 2 0.30 0.60 0.03 0.07

    At the moment, the literature contains little theoretical guidance for binning. The best approach will depend on the mutation model most appropriate to the marker at hand. For example, assuming a stepwise mutation model suggests binning 'neighboring' allele (for example, see Beacham et al. 1996 and references therein). The infinite alleles model presents a less structured situation that may be more amenable to binning algorithms driven strictly by observed allele frequency magnitude.

    Another alternative is to indirectly bin, as it were, by conducting a multivariate data reduction on the observed allele frequencies across populations. For example, principal components analysis has been used to reduce the dimensionality of the genotype space and, in effect, eliminate or ameliate zero allele frequency issues as a byproduct (Beacham et al. 1998). As yet, there is no published discussion or comparison of the many competing (ad hoc) methods of binning. It is worth noting that the originators no longer seem to use this method, relying instead on binning of neighboring alleles.

  3. [IMPORTANT CHANGE] Adjust your baseline frequency files (*.bse) to give an absolute count of at least 1 to every allele. Unfortunately, the relative frequency adjustments previously suggested in this FAQ do not necessarily alleviate the sampling zero issue by themselves. Why not? SPAM converts relative baseline allele frequencies into absolute frequencies when reading in the baseline. The conversion process basically rounds (relative_freq times observed_sample_size) DOWN to the nearest integer. That means if you used an adjustment that produces an absolute frequency of less than 1 to an allele, then SPAM will internally convert that adjusted frequency back to zero.

    You can still employ of the relative frequency adjustments listed below, BUT ONLY if accompanied by a simultaneous increase in the baseline sample size given in the baseline frequency file so that every allele has an absolute frequency of at least 1. Note that, in effect, this amounts to three choices:

    1. add one to every observed absolute allele frequency and change the baseline sample size in the baseline frequency file to n+k, where k is the known number of alleles for the locus of interest.
    2. add one to every observed absolute allele frequency of zero and change the baseline sample size to n+a, a < k, where a is the number of known alleles for the locus of interest that were not observed in this population's baseline sample.
    3. use one of the relative frequency adjustments below and change the baseline sample size to n', where n' > n and is of sufficient size that every allele has an absolute frequency of at least one. The relevant n's are listed along with each adjustment. The examples below show that this is not an acceptable method (seriously overinflates precision).

    Note that choice (b) distorts the allele frequency distribution the most, choice (a) attempts to balance distortion of the overall allele frequency distribution against the potentially large impact on the baseline resampling of choice (c), and choice (c) distorts the allele frequency distribution the least but may require an implicit assumption of much more precisely known baseline allele frequencies (as n' will likely be much larger than n). One could try all three methods and compare the results to explore the impact of the adjustment choice.

    • Adjust all allele frequencies in every population to RelFreq_adj=(f + 1/k) / (n+1) where f is the observed number of allele copies in the population baseline sample, k is the observed number of different allele types in the collection of population baseline samples, and n is the number of non-missing gene copies in the population baseline sample (e.g., 2 x number of individuals in sample,if you have no missing data and are dealing with diploids). Titterington et al. (1981).

      In order to guarantee each allele a minimum absolute frequency of 1 in SPAM, using this adjustment requires setting n' (the baseline sample size reported in the *.bse file) to n' =ceiling[k x (n+1)/(number of doses per individual)], where 'ceiling[y]' means smallest integer as big as or bigger than y.
      For example, assume you sampled 100 diploids for a 3 allele marker.
      Then you need to report an n' of at least ceiling[(3 x (201)) / 2] = ceiling[603/2] = ceiling[301.5] = 302. If k = 40, then n' = 4020.

      Benefits:
      Relative allele frequency adjustment magnitude (1/k term) varies with allelic diversity of locus under consideration. Note that adjustment is at most 0.05/(n+1) (two alleles).
      Equal treatment of all allele frequencies.

      Drawbacks:
      Equal treatment of allele frequencies (underestimation bias is not equivalent for higher frequency alleles as it is for lower frequency alleles). Assumption of extreme precision in baseline allele frequency estimates due to magnitude of n'.

    • Rannala and Mountain (1997) provide a Bayesian allele frequency adjustment for use in calculating genotype probabilities. The allele frequency adjustment differs depending on whether the genotype being estimated is homozygous or heterozygous and so cannot properly be used to adjust the baseline allele frequencies for use in SPAM. Note that given the very low probability of observing a homozygote with the missing alleles, one could consider employing the following heterozygote genotype probability from eqn. 9 (for X_ijm = hg) in Rannala and Mountain 1997:
      RelFreq_adj = sqrt(2) x (f + 1/k) / sqrt((n+2) x (n+1))
      where f is the observed number of allele copies in the population baseline sample, k is the observed number of different allele types in the collection of population baseline samples, and n is the number of non-missing gene copies in the population baseline sample (e.g., 2 x number of individuals in sample,if you have no missing data and are dealing with diploids).

      In order to guarantee each allele a minimum absolute frequency of 1 in SPAM, using this adjustment requires setting n' (the baseline sample size reported in the *.bse file) to n' =ceiling[k x sqrt((n+2) x (n+1))/{sqrt(2) x (number of doses per individual)}], where 'ceiling[y]' means smallest integer as big as or bigger than y.
      For example, assume you sampled 100 diploids for a 3 allele marker.
      Then you need to report an n' of at least ceiling[3 x sqrt(202 x 201) / {sqrt(2) x 2}] = ceiling[3 x sqrt(40602)/(2^1.5)] = ceiling[sqrt(9 x 40602 / 8)] = ceiling[sqrt(45677.25)] = ceiling[213.7...] = 214. If k = 40, then n' = ceiling[2849.6...] = 2850.

      Benefits:
      Relative allele frequency adjustment magnitude varies with allelic diversity of locus under consideration. Note that adjustment is at most 0.05/(n+1) (two alleles).
      Equal treatment of all allele frequencies.

      Drawbacks:
      Equal treatment of allele frequencies (underestimation bias is not equivalent for higher frequency alleles as it is for lower frequency alleles). Assumption of impossibility of homozygotes for missing alleles.
      Assumption of extreme precision in baseline allele frequency estimates due to magnitude of n'.

  4. Use Bayesian mixture analyses methods. Two Bayesian models of estimating baseline allele frequency distributions were introduced in the Version 3.7 of SPAM: (1) Rannala-Mountain and (2) Pella-Masuda. Both models utilize Dirichlet posterior distributions with different priors. Rannala and Mountain (1997) use an equal-probability prior distribution for the alleles at a locus with mean frequency equal to one over the number of distinct alleles. That is, all alleles at a locus are assumed to be equally abundant for all stocks before the baseline samples become available. Pella and Masuda (2001) use a pseudo-Bayes method to determine the prior distribution of alleles at a locus. The baseline center or unweighted arithmetic mean of the allele frequencies among stocks at a locus is used as the mean of the prior distribution. In both models, mean of the baseline posteriors is an average of the observed allele frequencies and the prior means. As a result, all posterior means for the allele frequencies are positive, so that absence of an allele from the stock's baseline sample implies it is only rare and was missed in sampling rather than it is nonexistent.

    After the baseline allele frequencies are estimated as described above, the mixture analysis can proceed under the conditional maximum likelihood (CML) scheme (as in SPAM 3.7) or continue under the Bayesian context as shown by Pella and Masuda (2001).

References

Beacham, T. D., L. Margolis, and R. J. Nelson. 1998. A comparisonof methods of stock identification for sockeye salmon (Oncorhynchus nerka) in Barkley Sound, British Columbia. NPAFC Bulletin 1: 227 - 239.

Beacham, T. D., R. E. Withler, and T. A. Stevens. 1996. Stock identification of chinook salmon (Oncorhynchus tshawytscha) using minisatellite DNA variation. Can. Jrnl. Fisheries and Aquatic Sciences 53: 380 - 394.

Paetkau, D., W. Calvert, I. Sterling, and C. Strobeck. 1995. Microsatellite analysis of population structure in Canadian polar bears. Molecular Ecology 4:347-354.

Paetkau, D., L. P. Waits, P. L. Clarkson, L. Craighead, and C. Strobeck. 1997. An empirical evaluation of genetic distance statistics using microsatellite data from bear (Ursidae) populations. Genetics 147:1943-1957.

Pella, J., and M. Masuda. 2001. Bayesian methods for analysis of stock mixtures from genetic characters. Fisheries Bulletin 99: 151-167.

Rannala, B., and J. L. Mountain. 1997. Detecting immigration by using multilocus genotypes. Proc. Natl. Acad. Sci. USA 94: 9197-9201.

Reynolds, J. H. In preparation. Microsatellites, rare alleles, and mixed stock analysis: statistical concerns and suggestions.

Titterington, D. M., G. D. Murray, L. S. Murray, D. J. Spiegelhalter, A. M. Skene, J. D. Habbema, G. J. Gelpke. 1981. Comparison of discrimination techniques applied to a complex data set of head injured patients (with discussion). J. Royal Statist. Soc., Series A. 144: 145 - 175.