Ploidy Inference

This vignette will cover infering ploidy of samples from NGS data for any levels of ploidy considered. If you just want to compare 2n and 3n, you can either use the method described here or the binomial likelihood based method described by Delomas (2019) https://doi.org/10.1111/1755-0998.13073 available in the function tripsAndDip within this package.

First, we load the tripsAndDipR package.

library(tripsAndDipR)

Now, you would load you read count data, but here we are just going to quickly simulate some data for two individuals, one 4n and one 5n.

# this data is simulated to be quick to produce, not to be a very good method of simulating NGS data
set.seed(7)
refCounts4n <- c(rbinom(6, 100, .01), rbinom(6, 100, .25), 
              rbinom(6, 100, .5), rbinom(6, 100, .75), rbinom(6, 100, .99))
refCounts5n <- c(rbinom(5, 100, .01), rbinom(5, 100, .2), 
              rbinom(5, 100, .4), rbinom(5, 100, .6), rbinom(5, 100, .8), rbinom(5, 100, .99))
refCounts <- rbind(refCounts4n, refCounts5n)
altCounts <- rbind(100 - refCounts4n, 100 - refCounts5n)
rownames(altCounts) <- rownames(refCounts) <- c("sample_4n", "sample_5n")
colnames(refCounts) <- colnames(altCounts) <- paste0("Locus_", 1:30)
refCounts[,1:5]
#>           Locus_1 Locus_2 Locus_3 Locus_4 Locus_5
#> sample_4n       4       1       0       0       0
#> sample_5n       1       1       0       0       1
altCounts[,1:5]
#>           Locus_1 Locus_2 Locus_3 Locus_4 Locus_5
#> sample_4n      96      99     100     100     100
#> sample_5n      99      99     100     100      99

Now we have two matrices, with rows being samples and columns being loci. One matrix, refCounts, has the counts of the reference allele and altCounts has the counts of the alternate allele. If a sample has no counts for a given locus, then both matrices will have 0 for that sample/locus.

Now, let’s say our samples could be 4n, 5n, or 6n. To use a beta-binomial mixture model with a uniform noise component, and assuming no allelic bias and all sequencing error rates of .01,

fp <- funkyPloid(refCounts, altCounts, ploidy = c(4,5,6), model = "BB_noise")
fp
#>         Ind Loci    LLR_4    LLR_5    LLR_6
#> 1 sample_4n   30 0.000000 2.964398 5.055826
#> 2 sample_5n   30 2.748234 0.000000 1.608657

Note that you can also have AIC and BIC values returned by adding the argument IC = TRUE for funkyPloid. The returned dataframe has the sample name, the number of loci with 1 or more read, and the log-likelihood ratios between the most likely model and the given model. The most likely model will have an LLR of 0 (we see that the most likely model was correct for both in this example). For strategies to deal with uncertainty and quantitatively evaluate the LLR’s, please see the manuscript describing the implemented method (currently submitted, vignette will be updated). If you want to eliminate the noise component or use the binomial mixture model, you can change the model argument. For using different values of allelic bias and sequencing error, see the h and eps arguments.

If you are interested in the actual parameter values of the fit models, not just the LLRs, you can

model_4n <- genoProps(refCounts, altCounts, ploidy = 4, model = "BB_noise")
model_5n <- genoProps(refCounts, altCounts, ploidy = 5, model = "BB_noise")
model_6n <- genoProps(refCounts, altCounts, ploidy = 6, model = "BB_noise")
# let's take a look at one of the ploidies
model_4n
#>         Ind Loci numIter       LLH     ref_0     ref_1     ref_2     ref_3
#> 1 sample_4n   30      29 -118.7384 0.1999973 0.1999721 0.2000308 0.1999733
#> 2 sample_5n   30      57 -124.5906 0.1666574 0.2459947 0.1796873 0.2411130
#>       ref_4        noise      tau_0        tau_1      tau_2      tau_3 tau_4
#> 1 0.1999977 2.866455e-05 0.01173965 0.0008656362 0.00000010 0.00000010 1e-07
#> 2 0.1663408 2.068465e-04 0.00000010 0.0242929061 0.04692081 0.05115934 1e-07

The columns are:

Ind: sample name
Loci: number of loci with 1 or more reads
numIter: the number of iterations of the EM algorithm
LLH: the log-likelihood of the model
ref_x: the weight of the component representing x copies of the reference allele
noise: the weight of the noise component
tau_x: the value of \(\tau\) (the overdisperison parameter) for the state of x copies of the reference allele

You can optionally have AIC and BIC calculated by adding the argument IC = TRUE for genoProps.