Abdulaimma, B (2019) High Dimensional Analysis of Genetic Data for the Classification of Type 2 Diabetes Using Advanced Machine Learning Algorithms. Doctoral thesis, Liverpool John Moores University.
|
Text
2019BasmaAbdulaimmaPhD.pdf - Published Version Download (2MB) | Preview |
Abstract
The prevalence of type 2 diabetes (T2D) has increased steadily over the last thirty years and has now reached epidemic proportions. The secondary complications associated with T2D have significant health and economic impacts worldwide and it is now regarded as the seventh leading cause of mortality. Therefore, understanding the underlying causes of T2D is high on government agendas. The condition is a multifactorial disorder with a complex aetiology. This means that T2D emerges from the convergence between genetics, the environment and diet, and lifestyle choices. The genetic determinants remain largely elusive, with only a handful of identified candidate genes. Genome-wide association studies (GWAS) have enhanced our understanding of genetic-based determinants in common complex human diseases. To date, 120 single nucleotide polymorphisms (SNPs) for T2D have been identified using GWAS. Standard statistical tests for single and multi-locus analysis, such as logistic regression, have demonstrated little effect in understanding the genetic architecture of complex human diseases. Logistic regression can capture linear interactions between SNPs and traits however it neglects the non-linear epistatic interactions that are often present within genetic data. Complex human diseases are caused by the contributions made by many interacting genetic variants. However, detecting epistatic interactions and understanding the underlying pathogenesis architecture of complex human disorders remains a significant challenge. This thesis presents a novel framework based on deep learning to reduce the high-dimensional space in GWAS and learn non-linear epistatic interactions in T2D genetic data for binary classification tasks. This framework includes traditional GWAS quality control, association analysis, deep learning stacked autoencoders, and a multilayer perceptron for classification. Quality control procedures are conducted to exclude genetic variants and individuals that do not meet a pre-specified criterion. Logistic association analysis under an additive genetic model adjusted for genomic control inflation factor is also conducted. SNPs generated with a p-value threshold of 10−2 are considered, resulting in 6609 SNPs (features), to remove statistically improbable SNPs and help minimise the computational requirements needed to process all SNPs. The 6609 SNPs are used for epistatic analysis through progressively smaller hidden layer units. Latent representations are extracted using stacked autoencoders to initialise a multilayer feedforward network for binary classification. The classifier is fine-tuned to discriminate between cases and controls using T2D genetic data. The performance of a deep learning stacked autoencoder model is evaluated and benchmarked against a multilayer perceptron and a random forest learning algorithm. The findings show that the best results were obtained using 2500 compressed hidden units (AUC=94.25%). However, the classification accuracy when using 300 compressed neurons remains reasonable with (AUC=80.78%). The results are promising. Using deep learning stacked autoencoders, it is possible to reduce high-dimensional features in T2D GWAS data and learn non-linear epistatic interactions between SNPs while enhancing overall model performance for binary classification purposes.
Item Type: | Thesis (Doctoral) |
---|---|
Additional Information: | Approved SD 21/05/19. Embargo requested: 2 years from 21/05/19 to 21/05/21 |
Uncontrolled Keywords: | T2D, GWAS, Deep Learning, High-Dimensional Data |
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science R Medicine > R Medicine (General) |
Divisions: | Computer Science & Mathematics |
Date Deposited: | 21 May 2019 08:24 |
Last Modified: | 03 Jan 2023 17:00 |
DOI or ID number: | 10.24377/LJMU.t.00010723 |
Supervisors: | Fergus, P |
URI: | https://researchonline.ljmu.ac.uk/id/eprint/10723 |
View Item |