A tool for phylogenetic analysis on a pandemic scale


Phylogenetics is an analytical tool that rapidly analyzes genomic data to provide valuable information about the evolution and spread of a pathogen, thus enabling public health officials and governments to respond in a timely manner.

During the 2019 coronavirus disease (COVID-19) pandemic, phylogenetics, like many other pre-pandemic tools, became redundant due to the massive scale of acute respiratory syndrome coronavirus 2 genome sequencing data severe (SARS-CoV-2) filed online in databases since 2020.

To study: Phylogenetics on a pandemic scale. Image Credit: majcot / Shutterstock.com

About the study

In a recent prepublication study published on the bioRxiv * server, the researchers developed a phylogenetic package that incorporated several optimization and parallelization techniques specific to the pandemic. The package includes four programs UShER, matOptimize, RIPPLES and matUtils.

To construct a complete SARS-CoV-2 phylogeny, SARS-CoV-2 genome sequence data was collected from large online databases such as the Global Initiative on Sharing All Data on influenza (GISAID) and GenBank. The GenBank sequence MN908947.3 was used as a reference for rooting the tree and for invoking variants in individual samples. In the experiments, the sampling date metadata was used to derive two subtrees comprising a 100K sample tree and a 1M sample tree.

All the experiments carried out throughout the study were carried out on the Google Cloud Platform (GCP) for easy reproducibility. Since this phylogenetic package was memory efficient, processor-optimized E2 instances could have been used.

Instead, memory-optimized instances were used in the package for some competing tools, while an iso-cost comparison was done to ensure that the hourly cost stays roughly the same for both instances. Strong and weak scaling analyzes were performed for UShER, matOptimize, and RIPPLES using the 1M sample tree and e2-high CPU-32 instances, varying the number of instances from 2 to 32 .

Innovative optimizations performed in (A) UShER, (B) matOptimize and (C) RIPPLES for phylogenetic placement, tree optimization and recombination detection, respectively.  The left part shows a representative illustration of the previous approaches and the right part illustrates the approach used in our tools.
Innovative optimizations performed in (A) UShER, (B) matOptimize and (C) RIPPLES for phylogenetic placement, tree optimization and recombination detection, respectively. The left part shows a representative illustration of the previous approaches and the right part illustrates the approach used in our tools.

UShER, matOptimize and RIPPLES performance results

The acceleration analysis highlighted the magnitude of the improvement in runtime and edge memory that this phylogenetic package achieves compared to cutting edge tools. For phylogenetic placement, compared to IQ-TREE2, UShER achieved 1439 times acceleration and 1300 times improved memory efficiency, as well as placed 1000 new samples on the 100K sample tree in just 15.4 seconds using 92MB of RAM.

For tree optimization, compared to TNT, matOptimize completed its optimization in just over an hour and remained more parsimonious even after 24 hours. For recombination detection, placing a new sample on the 1M sample tree using UShER and flagging it as recombinant using RIPPLES took 35.65 seconds on average, which allowed for real-time virus monitoring for recombination.

UShER maintained a high scaling efficiency of over 85% by placing 100,000 new samples on the 1M sample tree until 512 vCPUs were used, after which it dropped to 72.6% at 1,024 vCPUs.

For matOptimize, its high scaling efficiency deteriorated rapidly with parallelism. For example, with 1,024 vCPUs, it took only 11.5 minutes to complete matOptimize, with the parallel search phase taking 7.5 minutes in total and less than 1.5 minutes on each iteration.

The authors predict a sharp improvement in scaling efficiency as the tree grows. RIPPLES achieved a strong scaling efficiency of over 80%, the highest of any program, for the complete detection of 1M sample tree recombinants at all levels of parallelism. All tools showed a low scaling efficiency greater than 70% as determined during the low scaling analysis.

Conclusion

The present study addressed the unmet needs imposed by the COVID-19 pandemic and developed a phylogenetic package for comprehensive phylogenetic analyzes of SARS-CoV-2. The phylogenetics of COVID-19 has been crucial for the genomic surveillance of SARS-CoV-2 and its variants, as well as for their identification and naming, thus supporting their potential relevance in epidemiological studies.

This tool therefore helps to estimate the reproduction number (R0) SARS-CoV-2 or its particular variant. Additionally, phylogenetics may establish transmission links between seemingly unrelated SARS-CoV-2 infections.

Of all the programs in the phylogenetic package, UShER and RIPPLES have shown the potential to allow individual research laboratories to incorporate their SARS-CoV-2 genomic sequences into an overall phylogeny, to uncover evidence of recombination from a massive search space and subsequently provide a real response time. RIPPLES could also be used in a high performance computing (HPC) environment to detect recombination events of the broad phylogeny of SARS-CoV-2 within hours. With matUtils, it was possible to quickly interrogate and visualize the massive phylogenies of SARS-CoV-2.

Taken together, these tools have shown the potential to empower the global scientific community to study the evolution and transmission of SARS-CoV-2 at extraordinary scale, resolution and speed.

*Important Notice

bioRxiv publishes preliminary scientific reports that are not peer reviewed and, therefore, should not be considered conclusive, guide clinical practice / health-related behavior, or treated as established information.


Comments are closed.