Abstract Detail


Zhou, Wenbin [1], Soghigian, John [2], Xiang, Jenny [3].

Phylogenomics of Hamamelis and Castanea with 353 Angiosperm genes and RAD-seq data – A proposed approach for cleaning the paralogs.

Target enrichment and RAD-seq are two reduced representation genomic methods widely used for phylogenomic studies across a breadth of taxa, but few have compared the congruence and conflict between these methods for the same taxonomic groups. The recently developed Angiosperm 353 universal probe kit has shown tremendous potential for phylogenomic study across a wide range of taxonomic levels. This kit targets 353 nuclear genes and extracts flanking introns. Several pipelines have been developed for analyses of target enrichment data. Among them, HybPiper, PHYLUCE, and SECARP are widely used. However, users often do not take additional effort to assess the impact of potential paralogous sequences in the constructed data sets on phylogenetic inference. Most pipelines use sequence identity (80%-85% similarity to targets as the cut-off) to identify orthologs and the putative paralogs. However, this cut-off may be insufficient for removing paralogs in studies of closely related species, or for lineages with a history of genome duplication. Alternatively, the presence of shared heterozygous sites at a given locus in a high percentage of taxa may be a good indication of paralogy at a locus. While comparing the results of target enrichment and RAD-seq in our study of Hamamelis and Castanea, we developed an approach to improve paralog detection in Angiosperm 353 data by expanding existing pipelines to recover the heterozygous sites at each locus, calculate the percentage of taxa sharing heterozygous sites at a given locus, and then remove loci that exceed a user-specified threshold of heterozygosity. We compared phylogenies inferred from datasets generated from conventional analyses using HybPiper to those using our revised pipeline and further compared results with that from analyses of RAD-seq data. Our results showed substantial differences in support of some nodes, branch lengths, and divergence time estimates although the overall topologies were largely identical. Many more putative paralogs were detected via our pipeline than Hybpiper (48 vs. 11 in Castanea, and 27 vs. 2 in Hamamelis). Our study suggests that the error signals from the minority paralogous gene sequences derived from Angiosperm 353 probes may not be a concern in terms of inferring phylogenetic relationships within a genus but these sequences can lower nodal support due to adding conflicting signals to the dataset and skew downstream phylogeny-based analyses that take into account of branch length information, such as divergence time estimation. Therefore, we highly recommend using our pipeline for paralogous gene filtering of enrichment data.

1 - North Carolina State University, Plant & Microbial Biology, Box 7612, 100 Derieux Place Gardner Hall 2115, Raleigh, NC, 27695, United States
2 - North Carolina State University, Entomology and Plant Pathology, Gardner Hall, Raleigh, 27695, USA
3 - North Carolina State University, Plant And Microbial Biology, Campus Box 7612, Gardner Hall 2115, Raleigh, NC, 27695, United States

Target Enrichment
Angiosperm 353

Presentation Type: Oral Paper
Session: PHYL4, Phylogenomics IV
Location: Virtual/Virtual
Date: Friday, July 31st, 2020
Time: 4:15 PM
Number: PHYL4006
Abstract ID:890
Candidate for Awards:None

Copyright © 2000-2020, Botanical Society of America. All rights reserved