Machine Learning for Personalized Medicine

Marie-Curie Action: "Initial Training Networks"

Scientific lecture by Florence Demenais: Statistical Genetics

Statistical genetics is a field of research concerned with the development  and application of statistical methods to decipher the genetic mechanisms underlying diseases (and traits) and to characterize the other factors (environmental, life style factors…) that may modulate their effect on disease. It also includes the implementation of methods into computer programs that are usually made available to the scientific community. Over the last decade, statistical genetics has experienced a drastic shift of paradigm, from a mostly theoretical approach to a heavily data-oriented discipline where the existence of massive amount of genetic data generated by new high-throughput genotyping and sequencing technologies  allows researchers to explore new scientific hypotheses.

In the past decades, major advances have been made to identify the mutations responsible for monogenic diseases and exome sequencing has recently accelerated the gene discovery for these diseases. However, known mutations often display incomplete penetrance and the genetic and environmental factors modulating their expression remain still largely unknown. The characterization of the genetic component of multifactorial diseases, that are frequent in the population (eg cancers, cardiovascular diseases, asthma and allergic diseases, neuro-psychiatric diseases…), has been rather limited up to the beginning of this century. The discovery of novel genes involved in these complex diseases has dramatically increased over the last five years with the advances in high-throughput genotyping technologies that have allowed to embark into large scale genome-wide association studies (GWAS). However, many genes remain to be discovered.

Two main study designs are used to study the genetics of human diseases: family studies and case-control studies. Family studies have been widely used for the study of monogenic diseases and, in some extent, for multifactorial diseases. Family-based methods can be either model-based (ie requiring the specification of the genotype/phenotype relationship through a mathematical model) or model-free (no model specification required). Model-based methods are particularly suited to the study of monogenic diseases or Mendelian entities of multifactorial diseases; they allow gene mapping (linkage analysis) but also estimation of the gene effect on disease (penetrance) and identification of the factors (age, sex, environmental factors) that modulate the penetrance. Model-free methods, that have been mainly used for multifactorial diseases, include the affected sib-pair method for linkage analysis to detect the chromosomal regions (loci) linked to disease and the TDT (Transmission Disequilibrium Test and its extension) to identify the genetic variants associated with disease.   

The case-control study design has been predominantly used in GWAS of complex diseases to conduct association mapping in very large samples (tens or hundreds of thousands of subjects affected and unaffected with disease). These GWAS have led to the identification of many genetic variants associated with numerous diseases and traits (compiled in the NHGRI GWAS catalog at http://www.genome.gov/gwastudies/ that currently includes association results with 11,335 SNPs). However, these variants explain only a part of the genetic component of disease and the causal variants are still mostly unknown. An increasing number of methods are being proposed to mine the massive amount of data that have been generated by these GWAS and various sorts of biological information that are stored in public databases. These methods are attempting to address complex mechanisms such as gene-gene and gene-environment interactions, genetic heterogeneity, pleiotropy... New methods are proposed to examine the whole spectrum of genetic variation (as made accessible by new sequencing technology) and to integrate statistical approaches and biological information as well as various types of “omics approaches” (genomics, transcriptomics, epigenomics…) towards a system biology approach of disease. More and more, this research requires to be conducted in in a multidisciplinary setting and in the framework of many collaborations at the European and international levels.

In this lecture, I will present an overview of the statistical methods that have been proposed in statistical genetics over time. I will address the new challenges we have to face to disentangle the molecular mechanisms underlying complex diseases and to progress towards personalized medicine.  

Return to the overview of lectures ...