zhaowusoft inc.
Home Industry News The largest AI model in the field of biology has been released, capable of writing DNA on demand
Share
The largest AI model in the field of biology has been released, capable of writing DNA on demand
: August 5,2024

        On February 19, the AI biology model Evo 2, developed by researchers from the U.S. Arc Institute, chip manufacturer NVIDIA, Stanford University, and other institutions, was officially released. The model is now available to global researchers who can access it online and also download its source code, training data, and parameters for free.

        According to a public announcement from the U.S. Arc Institute, Evo 2 has evolved from its predecessor, Evo 1, and is now the largest AI model in the field of biology. Evo 1 was trained on sequences from 80,000 bacterial, archaeal genomes, and viruses, while Evo 2 was trained on over 128,000 genomic data sets consisting of 93 trillion nucleotides. These models enable machines to "read, write, and think" using the language of nucleotides.

        According to Nature, in recent years, scientists have developed increasingly powerful "protein language models," such as the ESM-3 model by Meta, an American internet company. These models, trained on millions of protein sequences, have been used to predict protein structures and design novel proteins, including gene-editing tools and fluorescent molecules.

In contrast to these models, Evo 2’s training data includes not only "coding sequences" that guide protein synthesis but also non-coding DNA that regulates gene activity in terms of time and space.

Eukaryotic genomes are generally longer and more complex than prokaryotic ones. Genes alternate between coding and non-coding regions, and non-coding regulatory DNA may be far from the genes it regulates. To handle this complexity, Evo 2 is designed to learn DNA sequence patterns across a million-base range.

        To validate the model's ability to analyze complex genomes, Patrick Hsu's team at the U.S. Arc Institute used Evo 2 to predict the impact of known mutations in the breast cancer-related gene BRCA1. In related tests, Evo 2 achieved over 90% accuracy in predicting which mutations were benign and which were potentially pathogenic.

      “In assessing whether variations in the coding region are pathogenic, it performs close to the best biological AI models, reaching top-tier levels,” Hsu stated. He added that Evo 2 helps identify difficult-to-interpret variations in patient genomes.

Additionally, the model can be used to design new biological tools or therapies, saving considerable time and research funds typically spent on cell or animal experiments, and accelerating drug development by identifying the genetic causes of human diseases.

Yunha Wang, a computational biologist at Tatta Bio, a U.S.-based biotech company, believes that Evo 2 might excel in applying the patterns of bacterial and archaeal genomes to the design of new human proteins.

"AI tools like protein language models have sparked a revolution in biological design," said Brian Hie, a computational biologist at Stanford University. He and his colleagues hope to use AI models to map entire cells, and they look forward to breakthroughs with genome models like Evo 2.

        The public announcement also highlighted that, given potential ethical and safety risks, the researchers have excluded pathogens that infect humans and other complex organisms from the Evo 2 training dataset and ensured that the model will not return valid answers for queries related to these pathogens.


| https://news.sciencenet.cn//htmlnews/2025/2/539061.shtm