In biomedical research, plant breeding, and countless other endeavors, geneticists  are on the hunt for the specific genes responsible for disease susceptibility yield, and other traits of interest. Mostly, they're looking for needles in the enormous haystack that is the genome of an organism.

As a frame of reference, the human genome is made up of 3.2 billion base pairs, an estimated 30,000 genes. Where do geneticists even start?For the past 15 years, many have relied on genome-wide association studies (GWAS)."I view a GWAS as a way to reduce the size of the haystack into genomic regions that potentially could contain causal mutations underlying a trait," says Alex Lipka, assistant professor of biometry in the Department of Crop Sciences at the University of Illinois and author of a new Heredity study expanding the scope of GWAS.

Genomic Region

To run a GWAS, scientists conduct computationally intensive statistical analyses to scour the genetic code for differences. Specific variations in DNA, called markers, that exhibit the highest degree of statistical association is thought to be near genes that make vital contributions to the trait. Sometimes, these associated markers are clustered together in a particular region of the genome, narrowing the haystack.

Lipka says the approach has been used in a wide variety of organisms to identify significant genes contributing to key traits, but it falls short in detecting small-effect genes or gene interactions a phenomenon known as epistasis that may be just as important.

"The state-of-the-art statistical approach for GWAS is to test one marker at a time for the strength of its association with the trait," he says. "If you think about the true genetic underpinnings of a trait, it's not just one gene controlling things. Multiple genes contribute to phenotypic variation in an additive manner and are epistatically interacting with one another.

What we try to do in our study is to explore the use of a statistical approach that is more biologically accurate. Not only are we finding statistical models that include multiple markers at a time, but we also find multiple two-way interaction effects at a time."

The researchers wanted to see if their new approach, which they call SPAEML, could accurately detect the underpinnings of simulated traits with genetic sources similar to Alzheimer's disease in humans and flower structure in corn; these traits have already been described to some extent in the scientific literature.

Using custom-built software, which they have made freely available to other researchers, and large computers at the National Center for Supercomputing Applications, the team tested whether SPAEML could detect simulations of the traits in the dataset.