With the human genome being sequenced we can perform studies on single nucleotide polymorphisms (SNP) of the whole genome. This sort of study is called genome-wide association studies (GWAS). The goal with GWAS is to make associations of SNPs and diseases. Unfortunately, there are challenges to accurately make these associations between SNPs and diseases. Some have proposed that current biostatistical analysis paradigms need to be more holistic to recognize the true complexity of the genotype-phenotype relationship (Moore, Asselbergs and Williams, 2010). In this essay I go over some of the advantages and limitations of GWAS pertaining to sample size, variant frequency and data interpretation.
There is a utility with GWAS in finding genotype-phenotype relationships that are associated with common and complex diseases (Ghazani, 2017). For diseases that are common it is relatively easy to meet the large sample size requirement of GWAS. The cost of processing GWAS is reasonable, due to the low cost of sequencing and the ability to data mine large data sets. GWAS requires the bi-allelic assumption, which is reasonable because only 1-3% of our genome has random copy number variants. Many studies have been performed and data banked on different diseases, but researchers need to be aware of the methodology used to determine if the genotype-phenotype association conclusions are valid. If the controls and cases are of similar distributions then statistical significance is much stronger, which is very important to consider when validating other researchers conclusions of genotype-phenotype association.
GWAS under certain conditions fails to identify new susceptibility loci for some diseases, even though a study might have a very large sample size (Moore, Asselbergs, and Williams, 2010). GWAS performs poorly in determining genotype-phenotype association with rare diseases because it is difficult to obtain the required large sample size. Validation and discovery sample overlap is a potential pitfall of GWAS. The validation sample is an independent sample with known phenotypes, whereas the discovery sample is where SNPs are selected and estimations of their effects determined. Another pitfall with GWAS pertains to the validation sample, whereby prediction accuracy will be overestimated if the validation sample is closely related to the discovery sample than the target sample (Wray et al., (2013). Lastly, Wray et al. (2013) proposed that population stratification similarity can inflate accuracy when discovery and validation sample stratification matches population stratification, but do not match the targeted sample stratification. Researchers should pay close attention to this stratification issue to assure validity in genotype-phenotype association conclusions. Another important limitation of GWAS is the use of standard logistical regression. Moore et al. (2010) suggested the use of more advanced algorithms: data mining with machine learning, use of decision trees and random forests. These more advanced algorithms for the use in GWAS is to help illuminate the relationship with DNA sequences variation, environmental exposure and variation in disease susceptibility.
GWAS is a powerful method to determine genotype-phenotype association, but researchers must keep in mind the limitation of sample size, variation frequency, and specific data interpretations to come up with valid conclusions of association.
Ghazani, A. A. (2017). Introduction to Genomics [PowerPoint slides]. https://canvas.harvard.edu/courses/35084/files/5290181?module_item_id=354933
Moore, J. H., Asselbergs, F. W., & Williams, S. M. (2010). Bioinformatics challenges for genome-wide association studies. Bioinformatics, 26(4), 445-455. doi:10.1093/bioinformatics/btp713
Wray, N. R., Yang, J., Hayes, B. J., Price, A. L., Goddard, M. E., & Visscher, P. M. (2013). Pitfalls of predicting complex traits from SNPs. Nature Reviews Genetics, 14, 507-515. doi:10.1038/nrg3457