[Introduction] This three-year-old start-up company used deep learning language models for the first time to synthesize a new chicken protein that does not exist in nature, detonating chicken protein design.
The application of human intelligence has greatly accelerated the research of protein engineering.
Recently, a fledgling startup in Berkeley, Calif., made amazing progress again.
Scientists used Progen, a protein engineering deep learning language model similar to ChatGPT, to realize AI prediction of protein synthesis for the first time.
Paper address: https://www.nature.com/articles/s41587-022-01618-2
This experiment also shows that natural language processing, although developed for reading and writing language texts, can also learn some fundamental principles of biology.
Technology comparable to the Nobel Prize
In response, the researchers say the new technique could become even more powerful than directed evolution, the Nobel Prize-winning protein design technique.
"It will energize the 50-year-old field of protein engineering by accelerating the development of new proteins that can be used for nearly everything from therapeutics to degrading plastics."
The company, Profluent, founded by a former Salesforce AI research chief, has secured $9 million in startup funding to build an integrated wet lab and recruit machine learning scientists and biologists. In the past, it was very laborious to mine proteins in nature, or to adjust proteins to desired functions. Profulent's goal is to make this process effortless. They did.
Ali Madani, founder and CEO of Profluent Madani said in an interview that Profulent has designed several families of proteins. These proteins function like exemplar proteins and are therefore highly active enzymes. This task is so difficult that it was done in a zero-shot fashion, which means that no multiple rounds of optimization were performed, or even any data from the wet lab was provided. And the final designed protein is a highly active protein that usually takes hundreds of years to evolve.
ProGen based on language model As a kind of deep neural network, conditional language model can not only generate semantically and grammatically correct and novel and diverse natural language texts, but also utilize input control labels to guide style, theme, etc. Similarly, researchers have developed today's protagonist - ProGen, a 1.2 billion parameter conditional protein language model. Specifically, ProGen based on the Transformer architecture simulates the interaction of residues through a self-attention mechanism, and can generate different artificial protein sequences across protein families according to the input control labels.
Generating artificial proteins with conditional language models To create the model, the researchers fed the amino acid sequences of 280 million different proteins and let it "digest" over a period of several weeks. They then fine-tuned the model with an additional 56,000 sequences from five lysozyme families and information about those proteins. Progen's algorithm is similar to GPT3.5, the model behind ChatGPT, which learns the ordering of amino acids in proteins and their relationship to protein structure and function. In no time, the model was generating a million sequences. The researchers selected 100 to test, based on how similar they were to natural protein sequences, and how natural the amino acid "syntax" and "semantics" were. Of these, 66 produced chemical reactions similar to the natural proteins that destroy bacteria in egg whites and saliva. That said, these new AI-generated proteins can also kill bacteria.
Protein Design, Entering a New Era As you can see, ProGen works in a similar way to ChatGPT. ChatGPT can take MBA and bar exams and write college papers by learning massive amounts of data. And ProGen learned how to generate new proteins by learning the grammar of how amino acids combine to form the 280 million existing proteins.
Applicability of conditional language modeling to other protein systems Besides, for a highly evolved natural protein, it may only take a small mutation to make it stop working. But in another round of screening, the researchers found that even though only 31.4 percent of the AI-generated enzymes were identical in sequence to known proteins, they showed comparable activity and similar structures.
Protein Design, Entering a New Era As you can see, ProGen works in a similar way to ChatGPT. ChatGPT can take MBA and bar exams and write college papers by learning massive amounts of data. And ProGen learned how to generate new proteins by learning the grammar of how amino acids combine to form the 280 million existing proteins.
In the interview, Madani said, "Just like ChatGPT learning English and other human languages, we are learning biological and protein languages." "The performance of artificially designed proteins is much better than that of proteins inspired by evolution," said James Fraser, one of the authors of the paper and professor of bioengineering and therapeutic science at the University of California, San Francisco School of Pharmacy. "The language model is learning all aspects of evolution, but it is different from the normal evolution process. We now have the ability to adjust the generation of these characteristics to achieve specific effects. For example, let an enzyme have incredible thermal stability, or prefer acidic environment, or will not interact with other proteins." As early as 2020, Salesforce Research developed ProGen. It is based on natural language programming and was originally used to generate English text. From their previous work, researchers learned that AI systems can learn grammar and the meaning of words by themselves, as well as other basic rules to keep writing in order. "When you train sequence-based models with a large amount of data, their performance in learning structure and rules is very strong," said Dr. Nikhil Naik, research director of artificial intelligence at Salesforce Research and senior author of the paper. "They will understand which words can appear at the same time and how to combine them." "Now, we have proved that ProGen has the ability to generate new proteins, and have made a public announcement. Everyone can conduct research on our basis."
As a protein, lysozyme has a maximum of about 300 amino acids although it is very small. But with 20 possible amino acids, there are 20^300 possible combinations. This is more than the product of all human beings throughout the ages, multiplied by the number of grains of sand on Earth, multiplied by the number of atoms in the universe. Given the near-infinite possibilities, it's remarkable how easily Progen can design effective enzymes.
"The ability to generate functional proteins from scratch out of the box shows that we are entering a new era of protein design," said Dr. Ali Madani, founder of Profluent Bio and former research scientist at Salesforce Research. "This is a versatile new tool available to all protein engineers, and we look forward to seeing it applied therapeutically." At the same time, researchers continue to improve ProGen, trying to break through more limitations and challenges. One of them is that it is very data dependent. "We've explored ways to improve sequence design by adding structure-based information," Naik said, "and we're also looking at how to improve model generation when you don't have a lot of data on a particular protein family or domain. " It’s worth noting that there are startups trying similar techniques, like Cradle, and Generate Biomedicines from biotech incubator Flagship Pioneering, though none of these studies have yet been peer-reviewed.
References:
https://endpts.com/exclusive-profluent-debuts-to-design-proteins-with-machine-learning-in-bid-to-move-past-ai-sprinkled-on-top/
https://www.newscientist.com/article/2356597-ai-has-designed-bacteria-killing-proteins-from-scratch-and-they-work/
https://www.sciencedaily.com/releases/2023/01/230126124330.htm
(Source: Xinzhiyuan, Biological Island, Frontiers of Life Sciences, please contact to delete if there is any infringement)