Press Release
Johns Hopkins APL Cracks the Viral Vernacular With Machine Learning for Language
Just as some people can finish each other’s sentences, scientists at the Johns Hopkins Applied Physics Laboratory (APL) in Laurel, Maryland, believe the same principle may apply to emergent viral strains.
APL data scientist Daniel Berman and virologist Jared Evans, both in the Lab’s Research and Exploratory Development Department, are applying machine-learning techniques that have been used to crack deep problems in language to the effort of defending against new flu strains before they emerge. Think predictive text, but for viruses instead of words.
“Genomic data has the ability to connect an organism’s genome sequence with its ecological behavior,” explained Berman, who, along with Evans, also teamed with former APL bioinformaticians Craig Howser and Tom Mehoke. “Much like words represent different concepts in language, different patterns within genomic sequences can represent different organism behaviors.
“In the case of influenza, knowing the genome sequence of a strain allows us to defend against that strain by creating a vaccine that targets those ecological behaviors.”
This approach has seldom been employed in the realm of biology, but Berman and his colleagues believe that influenza is particularly amenable to this approach, as the global surveillance efforts employed to track its evolution generate high-resolution genomic datasets for analysis.
“One of the hardest parts about applying these methods to biology is that, with language, a sentence might contain a hundred or so characters — but a single gene within a genome can contain thousands of base pairs,” Howser said. “To represent all of that information mathematically in a meaningful way is challenging.”
Their machine-learning model, called MutaGAN — from “mutagen” and “generative adversarial network” — aims to reduce the dimensionality of complex information without sacrificing too much meaning. Think of a three-dimensional sphere reduced to a flat circle, or a sequence of events in time depicted as points along a humble line segment. If successful, the result will be a much higher degree of prognostic precision than is currently possible, by analyzing existing viral strains and predicting new ones with unprecedented accuracy.
“The people who are currently doing the best job at prediction can look at the problem in terms of clusters of possible viruses,’” Howser said. “And they can do that with decent accuracy, but you can’t use that approach to predict at the single-nucleotide resolution, on the level of specific genetic changes. What they do is like predicting the general movement of the stock market versus predicting what will happen with a single stock.”
Looking forward, the team hopes to improve the quality and quantity of data they’re able to feed into MutaGAN, thereby improving the quality of predictions the model can make. Such precision would make it possible to create a better, more specific flu vaccine each year, targeted to the most prevalent strains.
“The result would be far fewer deaths and hospitalizations,” Berman said.
While the project began years before the current COVID‑19 pandemic, Berman hopes in the long term, MutaGAN can be applied to predicting coronavirus mutations. But significant challenges remain.
“The sequences for COVID are roughly 20 times longer than what we had been working with for influenza — instead of looking at a gene of a little over 1,700 base pairs, we’re dealing with an entire genome that has around 30,000,” Berman said. “Hopefully that’s something we’ll be able to solve as we improve our techniques.”