SNP data

Data are taken from Wisser et al 2019. Genetics https://www.genetics.org/content/213/4/1479

34k SNP array markers on ~381 outbred individual plants sampled from different generations of selection for early flowering in a tropical maize population.

https://datadryad.org/stash/dataset/doi:10.5061/dryad.q573n5tdt

From the read.me:
Quality controlled genotype data (45,718 variant sites) for samples from Hallauer’s Tusón. Tab delimited file. A header row is included. The first column (“label”) lists sample names, the second column (“popdata”) lists the generation to which each sample belongs, and the remaining columns correspond to the genotype data. Unphased diploid genotype calls are recorded in the following format: 1/1. Variant encoding: 1=A, 2=C, 3=G, 4=T, 5=deletion, 6=insertion (5 and 6 are used for the ZmCCT10_CACTA locus). Missing genotype calls are encoded as NA.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.3     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

path = "Q:/My Drive/Teaching/Teaching_videos"
geno = read.table(file.path(path, "doi_10.5061_dryad.q573n5tdt__v2", "HT381_QC47518_gtype.txt.gz"), header = T, sep = '\t')

What does this data set look like?

#str(geno)
geno[1:3,1:6]

##           label popdata PZE.101000060 PZE.101000088 PZE.101000083 PZE.101000108
## 1 C0_062314_002       0           2/1          <NA>           2/3           3/1
## 2 C0_062314_003       0           2/2           3/3           2/2           3/3
## 3 C0_062314_004       0           2/1           3/3           2/2           3/3

Notice the first column is a label for the individual, the second column indicates which generation of selection (sub-population) it belongs to.

Data Filtering and Recoding

There are numerous ways to filter and manipulate SNP data sets. bcftools (http://www.htslib.org/doc/1.0/bcftools.html) has numerous functions that can very efficiently filter SNP data sets in the standard variant call format (vcf files, https://github.com/samtools/hts-specs). It works in linux and is good to know if you will be working with SNP data sets.

TASSEL software (https://www.maizegenetics.net/tassel) also is very flexible and has many powerful tools for manipulating and analyzing SNP data in various formats, including vcf and hapmap formats. It has a GUI with point and click menus that make getting started easy. But the drawback is that you may forget what commands you used to manipulate a data set, making your work non-reproducible. Thus, writing analysis scripts is highly recommended, so you can later describe accurately what you did, and also easily modify the methods or use the same approaches on other data sets. One option is rTassel (https://github.com/maize-genetics/rTASSEL), which allows you to write and analyze scripts in R to execute TASSEL analyses.

This example data set is a bit trickier because it’s not in a standard VCF or hapmap format. This also happens pretty frequently, so it’s also good to have some capability for writing general data handling scripts in R or Python (or whatever language). Here we show some basic ways of manipulating a data set with R.

Let’s do some strong filtering against missing data, dropping lots of markers, mainly because for demonstration I want a smaller marker data set.

missing = colSums(is.na(geno[,-c(1:2)]))/nrow(geno)
hist(missing)

keep.columns = missing == 0
sum(keep.columns)

## [1] 15955

geno2 = geno[, c(F, F, keep.columns)]
dim(geno2)

## [1]   381 15955

Numericalize the genotypes to minor allele counts, 0, 1, 2. Here is a hacky DIY way to do it. This will fail if there are more than 3 genotypic classes per locus, so first make a helper function to identify loci with exactly 3 classes, then we will keep only those before numericalizing.

classes3 = function(x){
  length(table(x)) == 3
}

keep.columns2 = apply(geno2, 2, classes3)
geno3 = geno2[keep.columns2]
dim(geno3)

## [1]   381 11864

Now create and apply a function to numericalize the genotype calls

numericalize = function(x){
  geno.freqs = as.matrix(table(x))
  labels = row.names(geno.freqs)
  split.labels = strsplit(labels, "/", fixed = T)
  which.het = sapply(split.labels, FUN = function(x) x[1] != x[2])
  het.label = labels[which.het]
  homoz.freqs = geno.freqs[rownames(geno.freqs) != het.label,]
  homoz.labels = labels[labels != het.label]
  minor.label = homoz.labels[which.min(homoz.freqs)]
  major.label = homoz.labels[homoz.labels != minor.label]
  translator = c(0, 1, 2)
  names(translator) = c(major.label, het.label, minor.label)
  new.x = as.numeric(translator[x])
}

Try the function on the first two columns

checkit = apply(geno3[,1:2], 2, FUN = numericalize)
checkit[1:15,]

##       PZE.101000108 PZE.101000167
##  [1,]             1             1
##  [2,]             0             0
##  [3,]             0             0
##  [4,]             0             1
##  [5,]             0             0
##  [6,]             0             1
##  [7,]             0             1
##  [8,]             0             0
##  [9,]             0             1
## [10,]             0             0
## [11,]             0             0
## [12,]             0             1
## [13,]             0             0
## [14,]             0             2
## [15,]             0             0

Compare to original data

as.matrix(geno3[1:15,1:2])

##    PZE.101000108 PZE.101000167
## 1  "3/1"         "3/1"        
## 2  "3/3"         "3/3"        
## 3  "3/3"         "3/3"        
## 4  "3/3"         "3/1"        
## 5  "3/3"         "3/3"        
## 6  "3/3"         "3/1"        
## 7  "3/3"         "3/1"        
## 8  "3/3"         "3/3"        
## 9  "3/3"         "3/1"        
## 10 "3/3"         "3/3"        
## 11 "3/3"         "3/3"        
## 12 "3/3"         "3/1"        
## 13 "3/3"         "3/3"        
## 14 "3/3"         "1/1"        
## 15 "3/3"         "3/3"

Now apply the function to all of the columns

geno.num = apply(geno3[,], 2, FUN = numericalize)
str(geno.num)

##  num [1:381, 1:11864] 1 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:11864] "PZE.101000108" "PZE.101000167" "PZE.101000169" "PZE.101000431" ...

Note that the column names represent the marker names. We should also re-attach the individual sample IDs as row names of this matrix:

row.names(geno.num) = geno$label

Compare the genotype frequencies of the original and recoded scores for the 6th SNP (which is a little weird because it has an excess of heterozygotes, it’s always good to check these ‘edge cases’)

table(geno.num[, 6])

## 
##   0   1   2 
## 150 189  42

table(geno3[, 6])

## 
## 1/1 1/3 3/3 
## 150 189  42

One more check is to verify that we have not introduced any NAs during the recoding

any(is.na(geno.num))

## [1] FALSE

Notice some nice properties of this numericalized data matrix:
1. Since the genotypic scores are now COUNTS of minor alleles in each individual, the minor allele frequencies are just the mean of the counts divided by two.
2. The correlations among samples are a measure of the genomic relationships among individuals. With proper scaling, we can estimate the realized additive genetic relationships among the individuals.
3. The correlations among markers indicate similarity of information from marker pairs. This can be related to linkage disequilibrium for highly inbred populations or for gametic data, but not directly for outbred populations.
We provide examples of each of these properties below:

Minor allele frequencies

maf = colMeans(geno.num)/2
hist(maf)

This histogram is another good check on recoding our data. The maximum observed minor allele frequency is 0.5. If minor allele frequencies > 0.5 were observed, it would mean that we coded the minor allele incorrectly.

We can also compute minor allele frequencies by each generation separately and use that information to find markers that have changed allele frequency over generations:

generations = geno[,2]
maf0 = colMeans(geno.num[generations == 0,])/2
maf10 = colMeans(geno.num[generations == 10,])/2
maf.dif = maf10 - maf0
hist(maf0)

Notice something here, we have no maf > 0.5 in the full data set, but when we split the data set by population and recalculate the minor allele frequencies, some of them are > 0.5. This can happen because the definition of the minor allele depends on the sample. If you change the sample or change the reference population, then which allele is minor can change.

Check the histogram of maf after 10 cycles of selection for earlier flowering:

hist(maf10)

Here is the histogram of differences in allele frequency between cycle 10 and cycle 0 across SNPs:

hist(maf.dif)

Genomic Relationships Among Individuals

The correlations among individuals represent genomic relationships. Individuals with more similar marker profiles have higher correlation, indicating they share more alleles:

ind.cor = cor(t(geno.num))
dim(ind.cor)

## [1] 381 381

The individuals are sorted by generation, so if allele frequencies changed over generations, we expect to see a little bit of higher relationship within than among generations (a sign of population structure:

library(RColorBrewer)
heatmap(ind.cor, Rowv = NA, Colv = NA, symm = T, col = colorRampPalette(brewer.pal(8, "Blues"))(8))

It’s pretty limited, but you can see a little bit of correlation among individuals in the first generation (lower left; the individuals are ordered by generation). There are also some pairs of individuals within that first generation that are more distant than expected, this was a surprise, until we learned that the initial generation was created by open-pollination of several accessions that was not random, so there was assortative mating, and you can see that sub-structure within generation 0 here).

The covariances among the individuals, scaled by the heterozygosity of the markers is the so-called VanRaden relationship matrix (https://www.sciencedirect.com/science/article/pii/S0022030208709901) and it is scaled to the additive genetic covariances among the individuals. For example, on average we expect outbred full-sibs to have additive relationship of 0.5, but individual pairs of full-sibs can vary around that average (I am a little more closely related to some of my brothers than others by chance). To get the scaling right we need to center the markers first. Here we compute the realized additive relationship matrix with matrix algebra, so you can see the calculations involved, but you can also use TASSEL or a package like AGHmatrix to compute the matrix from a SNP data set without having to numericalize it first.

Z = scale(geno.num, center = T, scale = F)
ZZpr = Z%*%t(Z)
denom = 2*sum(maf*(1 - maf))
K = ZZpr/denom
#attach the sample IDs as row names
rownames(K) = geno$label 
colnames(K) = geno$label
K[1:5, 1:5]

##               C0_062314_002 C0_062314_003 C0_062314_004 C0_062314_005
## C0_062314_002    1.43550210  -0.055409128    0.02564260   0.055270358
## C0_062314_003   -0.05540913   1.486575726   -0.10074673  -0.006097087
## C0_062314_004    0.02564260  -0.100746735    0.71430211   0.010870568
## C0_062314_005    0.05527036  -0.006097087    0.01087057   0.990505663
## C0_062314_006   -0.02621705   0.675172302   -0.11531939   0.001212625
##               C0_062314_006
## C0_062314_002  -0.026217050
## C0_062314_003   0.675172302
## C0_062314_004  -0.115319390
## C0_062314_005   0.001212625
## C0_062314_006   1.468684204

Some pairs of individuals can have negative additive relationships, indicating that they are LESS related than expected by random chance.

Diagonal elements of this matrix estimate 1 + F where F is the genomic inbreeding coefficient for each individual

F.vals = diag(K) - 1
hist(F.vals)

You can see a few individuals look pretty inbred, and a few have negative inbreeding (meaning they are more heterozygous than expected by chance in Hardy-Weinberg equilibrium). Interpretation of these values is a bit tricky if the individuals were not sampled from a common outbreeding population, as we show below.

We can check how inbreeding has changed over generations:

df.F = data.frame(generations, F.vals)
names(df.F)[1] = "Gen"
df.F %>% group_by(Gen) %>%
  summarize(MeanF = mean(F.vals))

## # A tibble: 6 x 2
##     Gen   MeanF
##   <int>   <dbl>
## 1     0  0.166 
## 2     2 -0.0726
## 3     4 -0.0677
## 4     6  0.0245
## 5     8  0.0377
## 6    10  0.0742

Recall that according to breeding records, the individuals were derived from random mating within generations, so we expect something close to HWE (with F = 0) within generations. When you pool data across multiple HWE populations with different allele frequencies, you will observe F > 0 in the meta-population due to differentiation (Fst). As an example, compare what happens when we estimate the relationship matrix ONLY within the initial generation 10 individuals:

geno10 = geno.num[generations == 10,]
Z10 = scale(geno10, center = T, scale = F)
ZZpr10 = Z10%*%t(Z10)
maf10 = colMeans(geno10)/2
denom10 = 2*sum(maf10*(1 - maf10))
K10 = ZZpr10/denom10
K10[1:5, 1:5]

##              C10_869_10    C10_869_4   C10_869_5    C10_869_6    C10_869_7
## C10_869_10  0.939407791 -0.006279884  0.01581446  0.004359564  0.081695291
## C10_869_4  -0.006279884  0.956273327 -0.02257240  0.024762399 -0.049421715
## C10_869_5   0.015814464 -0.022572399  0.92833858 -0.077086483 -0.077025875
## C10_869_6   0.004359564  0.024762399 -0.07708648  0.888155516 -0.003023585
## C10_869_7   0.081695291 -0.049421715 -0.07702588 -0.003023585  0.908580391

mean(diag(K10)) - 1

## [1] -0.02036216

Compare this value very close to zero based ONLY on generation 10 individuals to the meta-population mean inbreeding estimate near 0.10 for these same samples.

Linkage disequilibrium

Linkage disequilibrium is the non-random association of alleles at different loci. It can occur between unlinked loci, so it’s not a good name, but it has stuck, as there tends to be a general relationship of higher LD between more tightly linked loci. In fact, one way that LD can arise is due simply to population structure - because loci that have different allele frequencies in different sub-populations will be in LD when data are combined across the sub-populations.

The linkage disequlibrium correlation for a pair of markers depends on their frequency IN GAMETES (or ‘haplotypes’):

\(\rho_{AB} = \frac{p_{AB} - p_{A}p_{B}}{\sqrt{p_{A}(1-p_{A})p_{B}(p_{B})}}\)

The problem is that \(p_{AB}\) is the frequency of the A-B gamete, and we don’t know that frequency directly when we have genotypic data. A doubly heterozygous indvidiual has genotype AaBb, but we don’t know if it’s composed of AB/ab pair of haplotypes or the Ab/aB pair. The simple correlation computed from the numericalized genotype values will be inflated by doubly heterozygous genotypes (a ‘1-1’ type score, which will contribute positively to the correlation).

LD is easy to estimate for highly inbred lines (where each individual carries two copies of the same gamete type) or if you have phased haplotype data, in which case you can just estimate the correlations between numericalized markers.

But if we have a more general outbred population with extensive heterozygosity, a simple correlation value for LD will always be inflated compared to the true gametic correlation because individuals heterozygous at both loci will both have numerical values of 1 and this will contribute to a positive correlation, whereas in reality we don’t actually know if the gametes that created that individual are AB/ab (counting toward a positive correlation between alleles A and B) or Ab/aB (counting toward a negative correlation!).

Unfortunately, we don’t have a great solution to this problem for outbred populations. TASSEL software handles this by simply ignoring the heterozygotes (they don’t contribute toward the correlation estimate) and computing the correlation only using individuals that are homozygous for both loci. That avoids the bias problem but doesn’t use the data efficiently. There are maximum likelihood estimate procedures as alternatives (https://www.nature.com/articles/hdy199655), but they are computationally slow to compute for all marker pairs.

The correlations of the columns are the correlations of the marker numerical scores. Unfortunately, this is NOT interpretable in terms of linkage disequilibrium (LD) when you have heterozygotes. Here, for example, we take the markers from Chromosome 10 (most of which start with PZE.110) and compute their correlations

chr10.ids = grep("^PZE.110", colnames(geno.num))
chr10 = geno.num[,chr10.ids[1]:ncol(geno.num)]#get the markers on chr 10 only
chr10.r = cor(chr10)
chr10.r2 = chr10.r^2 #get the r-squared values
heatmap(chr10.r2, Rowv = NA, Colv = NA)

These markers are ordered by position, you can see the small blocks of correlation around the diagonal indicating more similarity in marker scores for tightly linked pairs.

For comparison, now we set all heterozygotes to missing and then compute the correlation matrix using only pairwise complete observations. The elements of this correlation matrix can be squared to get an estimate of the LD r^2 value that is typically reported.

chr10.nohets = chr10
chr10.nohets[chr10.nohets == 1] = NA
chr10.r.ld = cor(chr10.nohets, use = "pairwise.complete.obs")

## Warning in cor(chr10.nohets, use = "pairwise.complete.obs"): the standard
## deviation is zero

chr10.r2.ld = chr10.r.ld^2 #get the r-squared values
heatmap(chr10.r2.ld, Rowv = NA, Colv = NA, labRow = NA, labCol = NA)

This doesn’t look too different than the pattern we saw when computing the correlation matrix using the heterozygote values. But you can see that the two methods are at least a bit different in the correlation estimates:

SNP.sample = colnames(chr10)[1:5]
chr10.r[SNP.sample, SNP.sample]

##               PZE.110000036    SYN17514 PZE.110000347 PZE.110000531
## PZE.110000036     1.0000000 -0.21201883   -0.15354640    0.25554017
## SYN17514         -0.2120188  1.00000000   -0.07226135   -0.08286960
## PZE.110000347    -0.1535464 -0.07226135    1.00000000   -0.08024991
## PZE.110000531     0.2555402 -0.08286960   -0.08024991    1.00000000
## PZE.110000733     0.5222881 -0.01461615   -0.04837540    0.57216343
##               PZE.110000733
## PZE.110000036    0.52228807
## SYN17514        -0.01461615
## PZE.110000347   -0.04837540
## PZE.110000531    0.57216343
## PZE.110000733    1.00000000

chr10.r.ld[SNP.sample, SNP.sample]

##               PZE.110000036     SYN17514 PZE.110000347 PZE.110000531
## PZE.110000036     1.0000000 -0.272243520  -0.204320299   0.249517027
## SYN17514         -0.2722435  1.000000000  -0.049470370  -0.015466130
## PZE.110000347    -0.2043203 -0.049470370   1.000000000  -0.007777084
## PZE.110000531     0.2495170 -0.015466130  -0.007777084   1.000000000
## PZE.110000733     0.5778437  0.009228258  -0.019391192   0.834593153
##               PZE.110000733
## PZE.110000036   0.577843652
## SYN17514        0.009228258
## PZE.110000347  -0.019391192
## PZE.110000531   0.834593153
## PZE.110000733   1.000000000

Here is the distribution of the differences between the elements of the correlation matrices ignoring hets and including hets:

summary(c(chr10.r.ld - chr10.r))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -0.67   -0.04    0.02    0.02    0.07    0.99   49914

Well, that’s a surprise. I expected the correlation that ignored the heterozygotes to tend to be less positive than the correlation including the heterozgotes, but actually, it’s a little bit more positive. I guess that is happening by chance, since we are dropping lots of values and these particular LD correlation estimates are probably poorly estimated because of the reduced sample size. What do I know?

Principal Components Analysis of Population Structure

We can visualize population structure using the principal components of the marker data:

pcs.cov = princomp(t(geno.num), cor = F, scores = T)

The pcs.cov object includes information on the variation associated with each principal component and the PC scores associated with each individual. Here are the standard deviations associated with the first 10 PCs

pcs.cov$sdev[1:10]

##   Comp.1   Comp.2   Comp.3   Comp.4   Comp.5   Comp.6   Comp.7   Comp.8 
## 5.222369 2.077072 1.824224 1.537824 1.337209 1.223739 1.151957 1.079716 
##   Comp.9  Comp.10 
## 1.037596 1.003286

Let’s scale this to the percent of total variation associated with each pc (you can also get this from summary(pcs.cov), but it prints out all the PCs):

total.var = sum(pcs.cov$sdev^2)
(pcs.cov$sdev^2)[1:10]/total.var

##      Comp.1      Comp.2      Comp.3      Comp.4      Comp.5      Comp.6 
## 0.203984268 0.032267442 0.024889599 0.017687844 0.013373954 0.011200543 
##      Comp.7      Comp.8      Comp.9     Comp.10 
## 0.009925076 0.008719272 0.008052261 0.007528547

So, the first PC explains 20% of the total marker variation. We might reasonably hypothesize that it mostly represents allele frequency changes among generations of selection due to selection and drift. Let’s plot the individuals according to their loadings on the first two PCs and include coloring due to generation of selection to see how the PCs correspond to the different selection generations. Create a data frame with the information on generations along with the PCs. This is a surprisingly painful operation since the loadings object within the princomp object can’t be directly converted to a data frame. Here’s the plan:
1. force the loadings object to a numeric matrix - although it looks a lot like a numeric matrix, it isn’t :(
2. coerce the matrix to a data frame

PC.df = data.frame(matrix(as.numeric(pcs.cov$loadings), attributes(pcs.cov$loadings)$dim, dimnames=attributes(pcs.cov$loadings)$dimnames))
PC.df[, c("individual", "Pop")] = geno[,c("label", "popdata"), ]
PC.df = PC.df %>% mutate(Pop = factor(Pop))

Plot the individuals colored by Pop on the first two PCs

ggplot(PC.df, aes(x = Comp.1, y = Comp.2)) + 
  geom_point(aes(colour= Pop)) +
  xlab("PC1 (20% of variation)") +
  ylab("PC2 (3% of variation)")

A few things are clear from this graph:
The variation WITHIN a population is greatest for Cycle 0, see how it is spread most widely in both directions. The variation in Cycle 10 is similar for PC1, but much more compressed for PC2. PC2 really seems to mostly capture variation within Cycle 0 rather than between Cycles. You can clearly see the differences between populations, with C10 narrowed for PC2 and pushed negative for PC1 on average compared to C0.
These are only two axes and they collectively only explain about 23% of total variation, so there is much additional variation that is probably mostly within populations we are not seeing on this graph.

Multidimensional scaling - Principal Coordinates

MDS/PCoA is a similar idea to PC in that we want to represent the relationships among the individuals in just a few dimensions. MDS works by finding coordinates that best represent the total distance between the individuals in a few dimensions, let’s say two dimensions so we can plot them nicely. This is a similar problem to representing the distances between points on earth using a 2-dimensional map. The actual points are in three dimensions on the planet and we cannot represent exactly the distances in two dimensions, but with a good projection we can come close. Here we have the SIMILARITY between individuals measured by their marker covariances and these similarities are in 381 dimensions. We want to convert the similarities to distances and then compress to two dimensions.

Euclidean distance(x,y) = \(\sqrt{(var(x) + var(y) - 2Cov(x,y))}\)

We can get this from the marker covariance matrix we already computed above, where the diagonal elements are the marker ‘variances’ of the individuals and the off-diagonal elements are the covariances.

marker.vars = diag(K) #vector of length 381
row.vars = marker.vars%*%t(rep(1,381)) #381*1 vector times 1*381 vector of ones = 381*381 matrix  
str(row.vars)

##  num [1:381, 1:381] 1.436 1.487 0.714 0.991 1.469 ...

row.vars is a matrix with the variance of each individual repeated across the columns for each row:

row.vars[1:5, 1:5]

##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 1.4355021 1.4355021 1.4355021 1.4355021 1.4355021
## [2,] 1.4865757 1.4865757 1.4865757 1.4865757 1.4865757
## [3,] 0.7143021 0.7143021 0.7143021 0.7143021 0.7143021
## [4,] 0.9905057 0.9905057 0.9905057 0.9905057 0.9905057
## [5,] 1.4686842 1.4686842 1.4686842 1.4686842 1.4686842

We can transpose this to get a similar matrix with the individual variances repeated down the rows for each column. And if we add the two matrices together we get var(x) + var(y) for each element K(x,y). Then we just need to subtract 2K (= 2*Cov(x,y) for each element) and take the square root of every element to get the distance matrix needed:

gen.dist = sqrt(row.vars + t(row.vars) - 2*K)
str(gen.dist)

##  num [1:381, 1:381] 0 1.74 1.45 1.52 1.72 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:381] "C0_062314_002" "C0_062314_003" "C0_062314_004" "C0_062314_005" ...
##   ..$ : chr [1:381] "C0_062314_002" "C0_062314_003" "C0_062314_004" "C0_062314_005" ...

The diagonal elements should all be zeroes (distance of individual x with itself is zero):

gen.dist[1:5,1:5]

##               C0_062314_002 C0_062314_003 C0_062314_004 C0_062314_005
## C0_062314_002      0.000000      1.741521      1.448627      1.521666
## C0_062314_003      1.741521      0.000000      1.549958      1.577744
## C0_062314_004      1.448627      1.549958      0.000000      1.297331
## C0_062314_005      1.521666      1.577744      1.297331      0.000000
## C0_062314_006      1.719483      1.266853      1.553585      1.567407
##               C0_062314_006
## C0_062314_002      1.719483
## C0_062314_003      1.266853
## C0_062314_004      1.553585
## C0_062314_005      1.567407
## C0_062314_006      0.000000

Compare this to the original covariance matrix. The smallest distance observed in genet.dist matrix is between individuals 2 and 5:

K[1:5,1:5]

##               C0_062314_002 C0_062314_003 C0_062314_004 C0_062314_005
## C0_062314_002    1.43550210  -0.055409128    0.02564260   0.055270358
## C0_062314_003   -0.05540913   1.486575726   -0.10074673  -0.006097087
## C0_062314_004    0.02564260  -0.100746735    0.71430211   0.010870568
## C0_062314_005    0.05527036  -0.006097087    0.01087057   0.990505663
## C0_062314_006   -0.02621705   0.675172302   -0.11531939   0.001212625
##               C0_062314_006
## C0_062314_002  -0.026217050
## C0_062314_003   0.675172302
## C0_062314_004  -0.115319390
## C0_062314_005   0.001212625
## C0_062314_006   1.468684204

And, in agreement the genetic covariance between individuals 2 and 5 is the greatest in the original matrix.

Now we can get the multidimensional scaling coordinates for two dimensions (k = 2)

mds.coord = cmdscale(gen.dist, eig = T, k=2)
str(mds.coord)

## List of 5
##  $ points: num [1:381, 1:2] 0.0716 -0.3504 0.3166 -0.2844 -0.3705 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:381] "C0_062314_002" "C0_062314_003" "C0_062314_004" "C0_062314_005" ...
##   .. ..$ : NULL
##  $ eig   : num [1:381] 17.93 12.89 8.9 6.78 5.94 ...
##  $ x     : NULL
##  $ ac    : num 0
##  $ GOF   : num [1:2] 0.0774 0.0774

The list component “points” is a 381 * 2 matrix that has the coordinates in the first two dimensions for each individual.
Add the dimension points to the data frame and plot the individuals in 2-D space, colored by population

PC.df[,c("mds1", "mds2")] = mds.coord$points
ggplot(PC.df, aes(x = mds1, y = mds2)) + 
  geom_point(aes(colour= Pop)) +
  xlab("Dimension 1") +
  ylab("Dimension 2")

Again, you can see the greater diversity of C0 from its wider spread in the plot, and the narrowing and divergence of subsequent cycles. Typically, there is no need to do both PC and MDS plots, pick the method you like and use it.

Cluster analysis

Another way to evaluate the potential groupings of the individuals is via cluster analysis. There are lots of methods, k-means and heirarchical (which is really a big group of methods) are common. Let’s do k-means clustering and specify that we want to group individuals into 6 clusters, since we know a priori that we have 6 subpopulations. The cluster analysis is based on the original marker data matrix or the covariance matrix (but not the distance matrix)

set.seed(1)
kclust = kmeans(K, 6)
str(kclust)

## List of 9
##  $ cluster     : Named int [1:381] 2 6 1 3 6 6 1 6 5 1 ...
##   ..- attr(*, "names")= chr [1:381] "C0_062314_002" "C0_062314_003" "C0_062314_004" "C0_062314_005" ...
##  $ centers     : num [1:6, 1:381] 0.0532 0.4542 -0.0047 -0.0594 -0.028 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:381] "C0_062314_002" "C0_062314_003" "C0_062314_004" "C0_062314_005" ...
##  $ totss       : num 1172
##  $ withinss    : num [1:6] 60.5 21.1 18.9 157.5 278.5 ...
##  $ tot.withinss: num 667
##  $ betweenss   : num 505
##  $ size        : int [1:6] 48 17 14 86 170 46
##  $ iter        : int 4
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

Note that this result shows that 505/1172 = 43% of the variation is between clusters, and that most of the variation is WITHIN clusters. So, the clusters are a bit distinct but not very homogeneous (and we should have guessed that from the PC and MDS plots above).

Let’s look at the assignment of individuals to cluster numbers (which are arbitrary)

PC.df$cluster = kclust$cluster
table(PC.df[,c("Pop", "cluster")])

##     cluster
## Pop   1  2  3  4  5  6
##   0  29 17 14  0  7 38
##   2  17  0  0  2 29  8
##   4   2  0  0  3 50  0
##   6   0  0  0  9 45  0
##   8   0  0  0 29 27  0
##   10  0  0  0 43 12  0

You can see that these clusters don’t correspond very well to the selection populations. C0 is split among five of the clusters, whereas clusters 1 and 6 have individuals from most populations. In general, cluster analysis is a bit crude, as it is trying to coerce the continuous variation we see in the graphs above into discrete groups.

Structure analysis

Structure and fastStructure softwares estimate probabilities that individuals belong to a specific population sub-group. These methods work by first proposing a number of sub-groups, k. Then, given that value, the probabilities that each individual belongs to a particular sub-group is computed, and the inference can also be that the probability value corresponds to the proportion of that individual’s genome derived from the sub-group. Unfortunately, the number of sub-groups is not known a priori, so the user must decide what the best number is. Typically, users re-do the analysis for a range of k values, get the likelihood value for each k value (which typically increases for larger values of k), and then pick a value where the change in likelihood slows down. It can be a bit arbitrary. And it’s computationally intensive, requiring standalone software. But it makes nice graphic representations of population structure:

Figure 2 from Wisser et al 2019

Save files for later use

Let’s save the numericalized and filtered genotypic data set for later use, and also the genomic relationship matrix, which is useful for both association mapping and genomic prediction analyses.

#save files as comma separated flat text files

write.csv(geno.num, file = "Q:/My Drive/Teaching/Teaching_videos/Tuson_geno_num.csv",
row.names = T, quote = F)

write.csv(K,file = "Q:/My Drive/Teaching/Teaching_videos/Tuson_Kmat.csv",
row.names = T, quote = F )

read back in with:

geno.num2 = as.matrix(read.csv(“Q:/My Drive/Teaching/Teaching_videos/Tuson_geno_num.csv”, header = T, row.names = 1)) K2 = as.matrix(read.csv(“Q:/My Drive/Teaching/Teaching_videos/Tuson_geno_num.csv”, header = T, row.names = 1))

save both objects into a single Rdata file

save(geno.num, K, file = "Q:/My Drive/Teaching/Teaching_videos/Tuson.Rdata")

which can be opened in another R session using:

load(“Q:/My Drive/Teaching/Teaching_videos/Tuson.Rdata”)

Population Structure Analysis

Jim Holland

8/13/2021