Introduction:
Many of us are familiar with the classic conceptualization of genetic disorders as disorders that run in families, often illustrated in large, multi-generation pedigrees. However, the widespread application of sequencing technologies as a diagnostic tool for rare disorders has now rewritten the textbook. One of the ways it has done so is by demonstrating that many rare disorders are caused not by variants that are passed on from generation to generation, but by variants that can pop up spontaneously in an affected individual, which we call de novo variants. These are variants that are absent from most of the parents' tissues, and (usually) arise during gametogenesis. In a recent large-scale study of developmental disorders in the UK, about 75% of individuals with a genetic diagnosis had a pathogenic de novo variant—highlighting how central de novo status can be for diagnosis.
Why this matters:
The standard way of identifying de novo variants involves sequencing of both parents, as well as of the affected proband (a "trio"). However, clinical geneticists often encounter scenarios where it is not feasible to obtain samples from both biological parents, which can create significant access and equity issues for some families. In these settings, uncertainty about whether a candidate variant is de novo can directly translate into missed diagnoses.
What we studied:
This is the challenge that we set out to address with duoNovo: identifying de novo variants without having to sequence both parents. duoNovo critically relies on long-read sequencing, and in particular its ability to facilitate the reconstruction of haplotypes (blocks of variants that are present together on the same homologous chromosome). The core idea is quite simple. Once we have reconstructed haplotypes for the proband and the available parent, we can then look for variants that are present in the proband and absent in the parent, within a haplotype that is otherwise identical between the two of them (up to expected sequencing and phasing noise). Such haplotype similarity is evidence that the proband inherited this haplotype from the available parent. Therefore, we can tell that in such a case the variant of interest is de novo, without needing to know the other parent's genotype.
What we found:
We showed that this idea works very well in practice, and generally has very low error rate. Importantly, duoNovo has near-perfect accuracy among variants absent from gnomAD, which are enriched for rare pathogenic alleles and are thus the most relevant in practice. We extensively benchmarked duoNovo, showing that it performs as expected in a range of scenarios. One issue related to its practical application is the fact its sensitivity is biologically constrained; a large fraction of de novo variants are transmitted from the father's germline, whereas most single-parent families are single-mother families. Again, however, long-read sequencing comes to the (partial) rescue. We showed that we can use siblings—when available—as surrogates for the missing father, significantly boosting the sensitivity of duoNovo from mother-proband duos by leveraging paternal haplotypes shared across siblings. Finally, we showed duoNovo's ability to lead to new diagnoses by applying it to a cohort of undiagnosed duos. Among 63 duos, duoNovo led to a diagnosis in two. It is worth noting that these were duos that had already undergone extensive prior testing and manual scrutiny. The new diagnoses found are thus a testament to duoNovo's ability to yield information critical for diagnostic purposes in challenging cases.
Limitations:
duoNovo's requires long-read sequencing, which is currently more expensive than short-read sequencing and not available as a clinical test. As mentioned, its sensitivity is biologically constrained; the majority of de novo variants are transmitted from the father's germline, whereas most single-parent families are single-mother families. More broadly, performance will depend on factors that affect haplotype reconstruction (such as haplotype block length and variant density).
Takeaway and call to action:
duoNovo shows that long reads are not only useful for detecting variants undetectable with short reads, but also for interpreting variants which, while detected just as well with short reads, are hard to classify as pathogenic or not because of uncertainty about their de novo status. It promises to increase the diagnostic yield of genetic testing for single-parent families, making genetic testing more equitable. duoNovo is available as an R package, and we hope it will prove broadly useful. It goes without saying that we would love to receive feedback from the community! Send us your real-world use cases, edge cases, and failure modes that can guide further improvements.
TL;DR:
While it is clear that de novo variants cause many rare genetic disorders, their detection currently requires full trio sequencing. duoNovo uses long-read sequencing-derived haplotypes to identify de novo variants using only one biological parent with very low error rate, improving access to the most accurate testing for more patients.