Load the numericalized marker matrix and genomic relationship matrix we made in Population Structure example:
load("Tuson.Rdata")
##Prepare phenotype data
Get the raw phenotypic data, skip the first 51 rows of the file, which have metadata
pheno = read.csv("doi_10.5061_dryad.8f64f__v1/teixeira_etal_rawdata.csv", skip = 51, header = T)
head(pheno)
## site site_no lat lon plot rep block set check gen_mean gen_slope
## 1 madiWI 1 43.05 -89.53 1735 1 1 1 NA C 4
## 2 madiWI 1 43.05 -89.53 1736 1 1 1 NA C 4
## 3 madiWI 1 43.05 -89.53 1737 1 1 1 NA C 4
## 4 madiWI 1 43.05 -89.53 1738 1 1 1 NA F 10
## 5 madiWI 1 43.05 -89.53 1739 1 1 1 NA E 8
## 6 madiWI 1 43.05 -89.53 1740 1 1 1 NA D 6
## family entry stand pd dta R.dtagdd0630sa dts R.dtsgdd0630sa asi
## 1 99 842_1 NA 20/05/2009 85 981.5000 87 1013.9444 2
## 2 32 176_7 NA 20/05/2009 85 981.5000 88 1030.5556 3
## 3 125 847_3 NA 20/05/2009 90 1060.8889 89 1045.0278 -1
## 4 256 871_3 NA 20/05/2009 81 920.3333 80 902.3333 -1
## 5 223 864_6 NA 20/05/2009 84 968.5000 83 955.6944 -1
## 6 160 854_10 NA 20/05/2009 84 968.5000 87 1013.9444 3
## R.asigdd0630sa plh erh peh tsw spl dta_min4 dta_min4_o
## 1 32.44444 258 168 90.0 NA NA 85 85
## 2 49.05556 266 185.2 80.8 NA NA 85 85
## 3 -15.86111 284 183 101.0 NA NA 90 90
## 4 -18.00000 267 156 111.0 NA NA 81 81
## 5 -12.80556 265 141 124.0 NA NA 84 84
## 6 45.44444 256 147 109.0 NA NA 84 84
## R.dtagdd0630sa_min4 R.dtagdd0630sa_min4_o dts_min4 dts_min4_o
## 1 981.5000 981.5000 87 87
## 2 981.5000 981.5000 88 88
## 3 1060.8889 1060.8889 89 89
## 4 920.3333 920.3333 80 80
## 5 968.5000 968.5000 83 83
## 6 968.5000 968.5000 87 87
## R.dtsgdd0630sa_min4 R.dtsgdd0630sa_min4_o asi_min4 asi_min4_o
## 1 1013.9444 1013.9444 2 2
## 2 1030.5556 1030.5556 3 3
## 3 1045.0278 1045.0278 -1 -1
## 4 902.3333 902.3333 -1 -1
## 5 955.6944 955.6944 -1 -1
## 6 1013.9444 1013.9444 3 3
## R.asigdd0630sa_min4 R.asigdd0630sa_min4_o plh_min4 plh_min4_o erh_min4
## 1 32.44444 32.44444 258 258 168
## 2 49.05556 49.05556 266 266 185.2
## 3 -15.86111 -15.86111 284 284 183
## 4 -18.00000 -18.00000 267 267 156
## 5 -12.80556 -12.80556 265 265 141
## 6 45.44444 45.44444 256 256 147
## erh_min4_o peh_min4 peh_min4_o tsw_min4 tsw_min4_o spl_min4 spl_min4_o
## 1 168 90.0 90.0 NA NA NA NA
## 2 185.2 80.8 80.8 NA NA NA NA
## 3 183 101.0 101.0 NA NA NA NA
## 4 156 111.0 111.0 NA NA NA NA
## 5 141 124.0 124.0 NA NA NA NA
## 6 147 109.0 109.0 NA NA NA NA
Notice that this file does not have genotype IDs that match the marker matrix. Instead it has gen_slope and entry names that if pasted together could match the marker data IDs.
head(pheno[c("gen_slope", "entry")])
## gen_slope entry
## 1 4 842_1
## 2 4 176_7
## 3 4 847_3
## 4 10 871_3
## 5 8 864_6
## 6 6 854_10
Here is an example of some of the sample IDs in geno.num:
tail(rownames(K))
## [1] "C10_877_5" "C10_877_8" "C10_878_1" "C10_878_2" "C10_878_4" "C10_878_5"
Let’s make a new column sampleID to match the rownames of geno.num
pheno = pheno %>% mutate(sampleID = paste0("C", gen_slope, "_", entry))
Most of these should have a match in the rownames of geno.num
head(pheno$sampleID %in% rownames(geno.num))
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
Some of the entries in the field experiment are check hybrids, so they will not be included in row.names of geno.num:
unique(pheno$sampleID)[! unique(pheno$sampleID) %in% row.names(geno.num)]
## [1] "CNA_CML254" "CNA_B97" "C6_853_4" "C6_853_1" "CNA_Ki14"
## [6] "C4_840_1" "CNA_B73"
These are check lines and a few families from the experiment that failed genotyping, so this look OK.
Now we need to summarize the field data for trait ‘dts’ (days to silking) to get a mean value for each entry across environments and field blocks to use in our genomic prediction models. We want adjusted means, or best linear unbiased estimators (BLUEs) to account for the differences in field effects due to blocks and environments in the face of missing data resulting in unbalanced data.
Let’s fit the linear model:
dts = mu + site + rep(site) + block(rep:site) + entry + entry:site + residual
We will fit site, rep, and block as random effects and genotypes as fixed effects
dts.mod = lmer("dts ~ 1 + (1|site) + (1|rep:site) + (1|block:rep:site) + sampleID + (1|sampleID:site)", pheno)
A quick check on the variance components…note that lmer by default reports ‘standard deviations’, i.e., the SQUARE ROOTS of the variance components:
summary(dts.mod)$varcor
## Groups Name Std.Dev.
## sampleID:site (Intercept) 2.51329
## block:rep:site (Intercept) 0.62546
## rep:site (Intercept) 0.57371
## site (Intercept) 10.46608
## Residual 2.27659
Get the BLUEs using emmeans
dts.blues = summary(emmeans(dts.mod, "sampleID"))
## Note: D.f. calculations have been disabled because the number of observations exceeds 3000.
## To enable adjustments, add the argument 'pbkrtest.limit = 4565' (or larger)
## [or, globally, 'set emm_options(pbkrtest.limit = 4565)' or larger];
## but be warned that this may result in large computation time and memory use.
## Note: D.f. calculations have been disabled because the number of observations exceeds 3000.
## To enable adjustments, add the argument 'lmerTest.limit = 4565' (or larger)
## [or, globally, 'set emm_options(lmerTest.limit = 4565)' or larger];
## but be warned that this may result in large computation time and memory use.
Summarize the means by Cycle population
dts.blues = dts.blues %>%
mutate(Cycle = str_extract(sampleID, "^C[0-9]+"),
Cycle = ifelse(is.na(Cycle), "Check", Cycle)) %>%
rename(dts = emmean)
dts.blues %>% group_by(Cycle) %>% summarize(Mean_DTS = mean(dts))
## # A tibble: 7 x 2
## Cycle Mean_DTS
## <chr> <dbl>
## 1 C0 89.9
## 2 C10 70.3
## 3 C2 82.1
## 4 C4 77.1
## 5 C6 74.3
## 6 C8 72.7
## 7 Check 85.5
Unfortunately, they are out of order, but you can see that the days to silking reduced by several days between each two generations of selection.
One other quick thing to check - what is the heritability of the mean values? We want to know this becacuse it helps to understand how high the genomic prediction accuracy can be. We can get at that by fitting a separate model where sampleID is a random effect and then constructing a ratio of genetic variance to other variances.
dts.mod.rand = lmer("dts ~ 1 + (1|site) + (1|rep:site) + (1|block:rep:site) + (1|sampleID) + (1|sampleID:site)", pheno)
varcomps = as.data.frame(VarCorr(dts.mod.rand))
Vg = varcomps[varcomps$grp == "sampleID", "vcov"]
Vge = varcomps[varcomps$grp == "sampleID:site", "vcov"]
Verr = varcomps[varcomps$grp == "Residual", "vcov"]
The heritability of mean values involves the residual variance divided by the number of plots per entry and the GxE variance divided by the number of sites each entry was tested at. The data are not perfectly balanced, but it’s pretty close, so we can just get the average values of these to use in the heritability calculation:
pheno2 = pheno %>% filter(is.na(check)) #drop the checks
pheno2 %>% group_by(sampleID) %>%
summarize(N = n()) %>% summary()
## sampleID N
## Length:297 Min. : 9.00
## Class :character 1st Qu.:18.00
## Mode :character Median :18.00
## Mean :17.27
## 3rd Qu.:18.00
## Max. :18.00
We’ll use a value of 17 in our calculation (should really use the harmonic mean, but this is for quick demonstration purposes).
pheno2 %>% group_by(site, sampleID) %>%
summarize(r = n()) %>%
group_by(sampleID) %>%
summarize(e = n()) %>%
summary()
## `summarise()` has grouped output by 'site'. You can override using the `.groups` argument.
## sampleID e
## Length:297 Min. :9
## Class :character 1st Qu.:9
## Mode :character Median :9
## Mean :9
## 3rd Qu.:9
## Max. :9
All of the entries were evaluated in exactly 9 environments.
So, we want to estimate H = Vg/(Vg + Vge/9 + Verr/17). This is approximate, for formal publication purposes, there are better ways to do this. See the R inti package for some nice functions to do it properly.
Vg/(Vg + (Vge/9) + (Verr/17))
## [1] 0.9725352
The mean heritability is VERY high! So we should be able to get accurate genomic predictions for this trait.
##GBLUP model
We can fit a linear mixed model to the trait data like this:
\[Y_i = \mu + G_{i} + \epsilon_{i}\] Now Y is the vector of the BLUEs and we model the genotype (sampleID) effects as random. This model looks weird because how can we fit both a genotype effect AND a residual effect for that genotype with no information on replicated observations in the BLUEs? If you fit this model assuming the default IID variance structure among the G effects (each genotype is an independent effect), it will not work because there is no way to separate G from \(\epsilon\).
The trick about the GBLUP model is that we can make this model work even without replicated data because we can move away from the assumption that all of the \(G_{i}\) are independent to a new model where the \(G_{i}\) have covariances proportional to their realized additive genomic relationships:
\[G_{i} \sim N(0, \mathbf{K}\sigma^{2}_{A})\] Recall that the K matrix elements indicate the additive genetic relationships between pairs of individuals. Those additive relationships times the additive variance equal the additive genetic covariance for each pair. So, we can fit a mixed model that includes the constraint that the matrix of genotypic effect covariances is proportional to the K matrix, and now the residual effects for each genotype reflect the difference between the observed BLUE and the effect estimated under this model (in other words, the deviation of the phenotype mean from the additive genetic value, which could include experimental error effects, non-additive genetic effects, true additive effects that are not modelled well by the K matrix, and so forth).
The GBLUP model gets its name from Genomic Best Linear Unbiased Predictors, which are analogous to fixed effect BLUEs but are the random effect predictions. Notice that this model can be computationally efficient because we have already summarized thousands of marker genotype scores into a matrix that has dimensions equal to the number of individuals (much much less than the number of markers).
Does this model make sense? It does make sense if the trait is controlled by many genes, each with equal (and therefore, small) effects. In this way, each marker provides equal information on the overall genomic relationships being modelled. If you have lots of markers distributed randomly throughout the genome, you will capture the true underlying relationships reasonably well. In practice, this model in fact performs robustly for most quantitative traits. You can ‘train’ the model on one set of individuals with observed phenotypes, then predict the values of other genotypes that are included in the relationship matrix even although they do not have phenotypic observations. The model will predict these ‘new’ individuals as having phenotypes similar to the individuals in the training set that they are most related to.
Let’s fit a basic GBLUP model with the mixed models sommer package. Use the BLUEs as ‘y’ values (dropping the checks and any other entries not included in the K matrix). Let’s also mask the trait data for a random 20% of the lines, get their genomic predictions, and compare those to observed means.
set.seed(1)
dts.blues2 = dts.blues %>% filter(sampleID %in% row.names(K)) %>%
mutate(sampleID = as.character(sampleID))
train = sample(dts.blues2$sampleID, round(0.8*nrow(dts.blues2)))
test = dts.blues2$sampleID[!dts.blues2$sampleID %in% train]
dts.blues2 = dts.blues2 %>%
mutate(dts.train = ifelse(sampleID %in% train, dts, NA))
sommer.mod <- mmer(dts.train ~ 1,
random = ~vs(sampleID, Gu=K),
data = dts.blues2)
## Adding additional levels of Gu in the model matrix of 'sampleID'
## iteration LogLik wall cpu(sec) restrained
## 1 -84.474 15:46:6 0 1
## 2 -28.1732 15:46:6 0 1
## 3 -28.1732 15:46:6 0 1
## 4 -28.1732 15:46:6 0 1
#get the GBLUPs and subset to only the test set
gblups = sommer.mod$U$`u:sampleID`[[1]] #these are deviations from the overall mean (so centered on zero)
gblups.test = data.frame(sampleID = names(gblups[test]), dts.gblup = gblups[test])
observed.test = right_join(dts.blues2, gblups.test, by = "sampleID")
Before we evaluate the test set predictions, first we can look at the model fit within the training set itself. Here are the genetic and residual variance components:
sommer.mod$sigma
## $`u:sampleID`
## dts.train
## dts.train 18.16546
##
## $units
## dts.train
## dts.train 0
You can see that the model fits the training set observations ‘perfectly’. There is no error variance. This is not typical, but due to the high heritability and good marker coverage of this example. The main point is that even when we have a perfect fit within the training set, we should not expect that the model will predict the held-out test individuals as well.
Here is the correlation of observed and predicted values within the test set:
cor(observed.test[c("dts", "dts.gblup")])
## dts dts.gblup
## dts 1.0000000 0.8496666
## dts.gblup 0.8496666 1.0000000
It’s pretty good!
Plot the observed values against their predictions:
plot(observed.test[c("dts.gblup", "dts")])
If you want to get ‘marginal predictions’ in terms of actual days to flowering, we can add on the intercept from the model fit, as the gblups are deviations from that value.
mu = sommer.mod$Beta$Estimate
observed.test$dts.gblup = observed.test$dts.gblup + mu
plot(observed.test[c("dts.gblup", "dts")])
There is a small difference between the predicted mean and the observed mean in the training set
print("Mean observed:")
## [1] "Mean observed:"
print(mean(observed.test$dts))
## [1] 76.74506
print("Mean predicted:")
## [1] "Mean predicted:"
print(mean(observed.test$dts.gblup))
## [1] 77.0893
and also in the dispersion of the predicted vs observed values:
lm(dts ~ dts.gblup, data = observed.test)
##
## Call:
## lm(formula = dts ~ dts.gblup, data = observed.test)
##
## Coefficients:
## (Intercept) dts.gblup
## -9.224 1.115
The slope of the regression is > 1, indicating that the observed values have a bit more variation than the predicted values due to shrinkage in the predictions. For the purposes of selection, the prediction ability (the correlation between predicted and observed values) is what matters, but for other purposes like actually predicting the absolute value of performance, the bias in the intercept and slope of the predictions can have a big impact.
We can also fit a model that includes an effect for each individual marker:
\[Y_i = \mu + \sum_{k=1}^n m_{ik}\beta_k + \epsilon_{i}\] where, now we are estimating the effects of each marker k (as \(\beta_k\)) and then for each individual, we sum over all markers the product of its numerical marker score (like 0, 1, or 2) times the marker effect.
Like the GBLUP model, at first glance this model seems badly overfitted. How can we model the effects of thousands of markers when we only have a few hundred observations? With ordinary least squares, we cannot.
The solution is to use ridge regression, which imposes ‘regularization’ on the marker effects so that when they are combined: their combined variation is forced to be no larger than the observed genotype variation. What happens is that their effect estimates (\(\beta_k\)) get ‘squished’ strongly toward zero. We are not trying to get good estimates of individual marker effects, instead, we are trying to get good estimates of the sum of their combined effects, and the ridge regression model allows this to happen.
We won’t go into any detail here, but the ridge regression model is widely used in machine learning and predictive modelling, so it’s good to read up on and learn more about. Here we will just demonstrate its use in the rrBLUP package in R. The idea behind the genomic predictions using this model is that we train the model on a set of individuals with both marker data and phenotypes to estimate the \(\beta_k\) effects for each marker. Then we can predict the values in a new set of individuals based on their known marker scores times these previously estimate \(\beta_k\) effects. That will give us the ‘ridge regresion BLUPs’.
geno.num2 = geno.num[dts.blues2$sampleID,]
rr.mod = mixed.solve(dts.blues2$dts.train, Z = geno.num2)
Here is the estimated variance for each marker:
rr.mod$Vu
## [1] 0.005678617
Since the model fits all markers with equal variance, we can relate this to the overall genetic variance as the product of the variance of marker scores (\(m^2_{ik}\)) times the variance of marker effects (\(\beta^2_{k}\)) summed over markers:
#average marker variance
marker.var = apply(geno.num2, 2, var)
sum(marker.var)*rr.mod$Vu
## [1] 18.30662
Compare this to the training set genetic variance from the GBLUP model:
sommer.mod$sigma$`u:sampleID`
## dts.train
## dts.train 18.16546
They are very very close.
Here is the residual variance in training set from rrBLUP:
rr.mod$Ve
## [1] 3.872823e-07
Effectively, this is zero, as we saw for the GBLUP model as well.
Compare the observed to rrBLUP predicted values in test set
betas = rr.mod$u
rr.preds = geno.num[test,]%*%betas
rr.preds = data.frame(sampleID = rownames(rr.preds), dts.rrBLUP = rr.preds)
observed.test = merge(observed.test, rr.preds, by = "sampleID")
cor(observed.test[c("dts", "dts.rrBLUP")])
## dts dts.rrBLUP
## dts 1.0000000 0.8496666
## dts.rrBLUP 0.8496666 1.0000000
Compare the GBLUPS to the rrBLUPs:
cor(observed.test[c("dts.gblup", "dts.rrBLUP")])
## dts.gblup dts.rrBLUP
## dts.gblup 1 1
## dts.rrBLUP 1 1
The GBLUPs and the rrBLUPs in the test set are perfectly correlated. The two models are doing the same thing in different ways. GBLUP estimates individual effects based on their marker relationships, whereas rrBLUP estimates marker effects constrained to have variance equal to the observed genotypic variance.
(by the way you can also fit a simple GBLUP model in rrBLUP as well…but I showed the sommer package because it has more flexibility for more complex mixed models so is worth knowing about)
##Bayesian Prediction Models The GBLUP and rrBLUP models both assume that all markers have equally small effects. This assumption makes these models relatively easy to compute, and in practice they tend to be generally good models so are widely used.
However, you may have a trait/population where you have good reason to think that the genetic control of the trait may be influenced by a mixture of some large-effect genes in addition to small-effect polygenes. The more a trait is influenced by large-effect variants, the farther away we move from the assumption that all genes have equal effects, and we want some way to include differences in QTL effects in our models.
The situation is pretty hopeless from a regular least squares or mixed models framework, we just don’t have enough information to reliably estimate a large number of marker effects simultaneously. This is a situation where Bayesian analysis becomes very useful. A Bayesian model posits a prior distribution of marker effects (which could include equal effects, but can also include a mixture of large and small effects). Given this prior distribution, one can take a random sample of effects from the distribution, and once that specific sample of effects is ‘known’, then the observed data can be fit to that model and all of the marker effects estimated. It’s analogous to how the ridge regression model allows us to estimate all of the marker effects by constraining how much total variation they can explain. Here, we constrain the marker effects by how much total variation they can explain and by forcing their effects to match the sampled distribution.
Then, if we repeat this sampling from the prior distribution many thousands of times, we can summarize over the many iterations of the analysis to get a posterior distribution of effect estimates. Markers that regularly get fitted with higher weights than others are probably picking up the effects of nearby large-effect QTL and modeling them more accurately than a ridge regression model. The details of Bayesian analysis are quite complicated, so you want to be careful fitting these models unless you have some idea of what you are doing, but the BGLR package at least makes the computations easy for you, and the documentation is excellent (https://github.com/gdlc/BGLR-R and https://www.genetics.org/content/198/2/483).
There are many different prior distributions that can be used (https://doi.org/10.1534/genetics.112.147983), and it pays to think about what is reasonable for your particular situation. Here, we will use the Bayes C pi model. The idea here is that there is a proportion of markers (pi) that will have zero effect on the trait, and the remaining markers will explain the observed variation, and some of those effects can be larger than others. This assumption results in fitting some markers with larger effects (by pushing many markers to zero effect, the remaining markers can ‘explain’ more variation). This model requires us to specify what the prior distribution of pi is, and of course we don’t actually know that, but we can choose priors with low confidence, so that the model explores a wide range of distributions and therefore is not overly influenced by our prior specification.
We are going to fit the BayesC model with default prior distribution.
probIn is the probability that a marker gets included in the model, default = 0.5 Counts refers to the confidence in the prior distribution, default = 10, which indicates not very much confidence. Something like 200 would put a lot more weight on the prior distribution, meaning that as the model iterates, it’s going to stick closely to the probIn value.
You can change the values from the defaults and see what happens. Also, in practice, it’s better to run more iterations, try nIter = 20000 and burnIn = 5000. That will take longer, of course.
bayesc.mod = BGLR(y = dts.blues2$dts.train,
ETA = list(list(X = geno.num2, model = 'BayesC')),
nIter = 5000,
burnIn = 1000,
saveAt = 'bc_',
verbose = FALSE)
First let’s plot the posterior distribution of squared marker effects:
bHat<- bayesc.mod$ETA[[1]]$b
plot(bHat^2, ylab='Estimated Squared-Marker Effect',
type='o',cex=.5,col=4,main='Marker Effects')
Let’s check the markers with largest + effects:
head(bHat[order(bHat, decreasing = T)])
## PZE.101214474 SYN32894 SYN15434 SYN5809 PZE.106064564
## 0.04025512 0.03970109 0.03821018 0.03652601 0.03622947
## PZE.105027872
## 0.03584673
head(bHat[order(bHat)])
## PZE.102181199 SYN19549 SYN29958 PZE.107076850 SYN14623
## -0.04350868 -0.04239603 -0.03879730 -0.03821735 -0.03788042
## PZE.102038453
## -0.03768078
The largest marker variance is on chromosome 2 (PZE.102…), and you can see it as the highest peak on the plot of marker variances against their index.
How well did this model predict the test set?
preds.bayesc = bayesc.mod$yHat
names(preds.bayesc) = dts.blues2$sampleID
test.bayesc = preds.bayesc[test]
test.bayesc = data.frame(sampleID = names(test.bayesc), dts.bayesc = test.bayesc)
observed.test = merge(observed.test, test.bayesc, by = "sampleID")
cor(observed.test[c("dts", "dts.bayesc")])
## dts dts.bayesc
## dts 1.0000000 0.8636056
## dts.bayesc 0.8636056 1.0000000
It’s a little bit better than the GBLUP model, and this is a case where we know the genetic architecture is a mix of some large effect variants plus polygenic effects.
##Effect of population structure on genomic predictions So far in this example, we have used a random partition of the families into test and training sets. This is fine for getting started and evaluating models and data. But it may not reflect how you want to use genomic prediction in reality. In a real breeding program, we often have historical data on previous generations of a breeding program and want to use that information to predict a future generation. Otherwise, we may have to rely on collecting training phenotypes on our current populations, which we know will be effective but will cost significantly more time.
Let’s model a more realistic situation where we use Cycles 0 - 8 as our training set and ask how well that training set predicts the last generation of lines (cycle 10). We are going to re-define train and test vectors to do this.
train = dts.blues2[dts.blues2$Cycle != "C10", "sampleID"]
test = dts.blues2[dts.blues2$Cycle == "C10", "sampleID"]
print("Training set size C0 - C8:")
## [1] "Training set size C0 - C8:"
print(length(train))
## [1] 239
print("Test set size C10:")
## [1] "Test set size C10:"
print(length(test))
## [1] 55
The training set size is still around 80%. But now we are hiding all individuals from C10 from the training set.
dts.blues2 = dts.blues2 %>%
mutate(dts.train = ifelse(sampleID %in% train, dts, NA))
sommer.mod <- mmer(dts.train ~ 1,
random = ~vs(sampleID, Gu=K),
data = dts.blues2)
## Adding additional levels of Gu in the model matrix of 'sampleID'
## iteration LogLik wall cpu(sec) restrained
## 1 -86.7154 15:46:53 0 1
## 2 -36.3036 15:46:53 0 1
## 3 -36.3036 15:46:53 0 1
## 4 -36.3036 15:46:53 0 1
#get the GBLUPs and subset to only the test set
gblups = sommer.mod$U$`u:sampleID`[[1]] #these are deviations from the overall mean (so centered on zero)
gblups.test = data.frame(sampleID = names(gblups[test]), dts.gblup = gblups[test])
observed.test = right_join(dts.blues2, gblups.test, by = "sampleID")
cor(observed.test[c("dts", "dts.gblup")])
## dts dts.gblup
## dts 1.0000000 0.2931627
## dts.gblup 0.2931627 1.0000000
Wow! You can see that we really killed the prediction ability by not included individuals from the last cycle when we predict that last cycle. This is a very important aspect that you need to remember: the training set MUST be representative of the individuals you want to perform selection on!
mu = sommer.mod$Beta$Estimate
observed.test$dts.gblup = observed.test$dts.gblup + mu
plot(observed.test[c("dts.gblup", "dts")])
print("Mean observed:")
## [1] "Mean observed:"
print(mean(observed.test$dts))
## [1] 70.28471
print("Mean predicted:")
## [1] "Mean predicted:"
print(mean(observed.test$dts.gblup))
## [1] 72.73037
And we overestimate the mean a bit more in this case. Remember that the C10 population was about 2 days earlier than C8, our model is not capturing all of that change very well.