Whole Genome Sequencing and Machine Learning Data is Using for Identification of Individuals with the Predictive Traits


The researchers from Human Longevity, Inc. (HLI) have published a study in which individual faces and other physical traits were predicted using whole genome sequencing data and machine learning. Prediction of human physical characteristics and demographic information from genomic data challenges privacy and data identification in personalized medicine.

The present study offers innovative approaches for forensics and has significant implications for de-identification, data privacy, and sufficiently informed consent. The research summarizes that considerably superior public deliberation is required as more and more genomes are produced and stored in public databases, noted the study authors.

The IRB approved study involved 1,061 ethnically different participants (age 18–82 years), whose genomes were sequenced for minimum depth of 30x. The researchers also received the phenotype data of participants in the form of age, height, weight, eye and skin color, voice samples, and 3D facial images.

The researchers succeeded to predict sex, Skin color, and eye color, but they faced complications in predicting other complex genetic traits. Even though their predictive models were efficient, large cohorts were required for the reliable prediction. The researchers developed an innovative algorithm, known as maximum entropy algorithm, to find the optimal predictive model’s combination to match whole-genome sequencing data with phenotypic and demographic data. It also enables the precise identification of, on an average, 8 out of 10 participants of different ethnic backgrounds, and 5 out of 10 Afro-American or European participants.

"We set out to do this study to prove that your genome codes for everything that makes you, you. This is clearly a proof of concept with a limited cohort, but we believe that as we increase the numbers of people in this study and the HLI database to hundreds of thousands, we will be able to accurately predict all that can be predicted from individuals’ genomes,” said Venter, the co-founder of HLI.

The public and research community scientific community is not sufficiently focused on the necessity of safeguards and policies for individual privacy in the genomic data and emphasized better technical solutions, in-depth analysis and continued the discussion, Venter noted.

The study shows that imaging techniques are efficient in screening the traits of a large number of people. Machine learning enables comprehensive automated data interpretation and has a significant role in scientific discovery.