This vignette will cover infering ploidy of samples from NGS
data for any levels of ploidy considered. If you just want to compare 2n
and 3n, you can either use the method described here or the binomial
likelihood based method described by Delomas (2019) https://doi.org/10.1111/1755-0998.13073 available in the
function tripsAndDip
within this package.
First, we load the tripsAndDipR package.
Now, you would load you read count data, but here we are just going to quickly simulate some data for two individuals, one 4n and one 5n.
# this data is simulated to be quick to produce, not to be a very good method of simulating NGS data
set.seed(7)
refCounts4n <- c(rbinom(6, 100, .01), rbinom(6, 100, .25),
rbinom(6, 100, .5), rbinom(6, 100, .75), rbinom(6, 100, .99))
refCounts5n <- c(rbinom(5, 100, .01), rbinom(5, 100, .2),
rbinom(5, 100, .4), rbinom(5, 100, .6), rbinom(5, 100, .8), rbinom(5, 100, .99))
refCounts <- rbind(refCounts4n, refCounts5n)
altCounts <- rbind(100 - refCounts4n, 100 - refCounts5n)
rownames(altCounts) <- rownames(refCounts) <- c("sample_4n", "sample_5n")
colnames(refCounts) <- colnames(altCounts) <- paste0("Locus_", 1:30)
refCounts[,1:5]
#> Locus_1 Locus_2 Locus_3 Locus_4 Locus_5
#> sample_4n 4 1 0 0 0
#> sample_5n 1 1 0 0 1
altCounts[,1:5]
#> Locus_1 Locus_2 Locus_3 Locus_4 Locus_5
#> sample_4n 96 99 100 100 100
#> sample_5n 99 99 100 100 99
Now we have two matrices, with rows being samples and columns being
loci. One matrix, refCounts
, has the counts of the
reference allele and altCounts
has the counts of the
alternate allele. If a sample has no counts for a given locus, then both
matrices will have 0 for that sample/locus.
Now, let’s say our samples could be 4n, 5n, or 6n. To use a beta-binomial mixture model with a uniform noise component, and assuming no allelic bias and all sequencing error rates of .01,
fp <- funkyPloid(refCounts, altCounts, ploidy = c(4,5,6), model = "BB_noise")
fp
#> Ind Loci LLR_4 LLR_5 LLR_6
#> 1 sample_4n 30 0.000000 2.964398 5.055826
#> 2 sample_5n 30 2.748234 0.000000 1.608657
Note that you can also have AIC and BIC values returned by adding the
argument IC = TRUE
for funkyPloid
. The
returned dataframe has the sample name, the number of loci with 1 or
more read, and the log-likelihood ratios between the most likely model
and the given model. The most likely model will have an LLR of 0 (we see
that the most likely model was correct for both in this example). For
strategies to deal with uncertainty and quantitatively evaluate the
LLR’s, please see the manuscript describing the implemented method
(currently submitted, vignette will be updated). If you want to
eliminate the noise component or use the binomial mixture model, you can
change the model
argument. For using different values of
allelic bias and sequencing error, see the h
and
eps
arguments.
If you are interested in the actual parameter values of the fit models, not just the LLRs, you can
model_4n <- genoProps(refCounts, altCounts, ploidy = 4, model = "BB_noise")
model_5n <- genoProps(refCounts, altCounts, ploidy = 5, model = "BB_noise")
model_6n <- genoProps(refCounts, altCounts, ploidy = 6, model = "BB_noise")
# let's take a look at one of the ploidies
model_4n
#> Ind Loci numIter LLH ref_0 ref_1 ref_2 ref_3
#> 1 sample_4n 30 29 -118.7384 0.1999973 0.1999721 0.2000308 0.1999733
#> 2 sample_5n 30 57 -124.5906 0.1666574 0.2459947 0.1796873 0.2411130
#> ref_4 noise tau_0 tau_1 tau_2 tau_3 tau_4
#> 1 0.1999977 2.866455e-05 0.01173965 0.0008656362 0.00000010 0.00000010 1e-07
#> 2 0.1663408 2.068465e-04 0.00000010 0.0242929061 0.04692081 0.05115934 1e-07
The columns are:
You can optionally have AIC and BIC calculated by adding the argument
IC = TRUE
for genoProps
.