This is a collaborative project with the Ph.D. candidate Tomas Gonzalez Zarzar (Twitter:@tomszar) from the Anthropology Department at The Pennsylvania State University.
In population genetics, clustering individuals in subpopulations can be relevant to answer evolution-related questions and describe genetic drift, migration, mutation, and selection processes. Currently, to infer population structure based on genetic data, two types of approaches have been used: model-based approaches and distance-based approaches. Model-based approaches have as a limitation the need to assume a priori the number of subpopulations expected, together with the expect that the subpopulations are at Hardy-Weinberg equilibrium (HWE). On the other hand, distance-based approaches can identify clusters by using genetic distances or genetic similarities. Nonetheless, this method has several drawbacks depending on the distance measure used, assessing the significance of the resulting clustering, among others (Greenbaum, Templeton, & Bar-David, 2016).
Clustering nodes into groups has been extensively studied. Nonetheless, the research field of population genetics has been confronted to the misclassification of individuals based on race or geographical location. This misclassification has limited the testable hypothesis regarding the evolution and variation patterns of the human genome, genetic flow events, and the geographic distribution of human variation (National Research Council (US) Committee on Human Genome Diversity, 1997). Moreover, selecting appropriate measures to assess genetic distance between individuals together with exponential increase in size of population genomic datasets has generated several challenges when performing population classifications (Greenbaum et al., 2016).To overcome these limitations, Greenbaum et al. has proposed defining genetic structure in terms of network theory could help to answer genetic questions in genomics and to overcome the limitations of the previous methods (Greenbaum et al., 2016). Moreover, a recent study of Han et al. has shown that Identity by descent (IBD), rather than genotypes are more accurate to identify population groups, and population structure (Han et al., 2017).
Here we estimate communities or subpopulations based in a genetic similarity matrix using Identity by descent (IBD) and applying different thresholds to compare if this estimated communities are consistent with their geographic distribution.