Advances include faster and more robust protein structure prediction; higher-quality models of protein/protein interactions; large-scale screening and improved de novo protein design for therapeutics
OpenFold, a non-profit artificial intelligence (AI) research consortium, announced the release of two new tools: 1) SoloSeq, which integrates a new protein Large Language Model (LLM) with its OpenFold structure prediction software, and 2) OpenFold-Multimer software, which creates higher quality models of protein/protein complexes than OpenFold alone. The new SoloSeq model, built on Amazon Web Services (AWS), is the first fully open source integrated protein LLM/structure prediction AI tool, and the first such model to release a critical training code that other organizations can use for fine tuning or new model training on their own proprietary data. This will enable new science that is not possible with closed-source existing models. The new Multimer code is the first fully open source training code system for the generation of protein/protein structures. This work was led by Prof. Mohammed AlQuraishi at Columbia University with Sachin Kadyan, Kevin Zhu, Christina Floristean, Dingquan Yu, Gustaf Ahdritz and Jennifer Wei.
“OpenFold-Multimer and SoloSeq are particularly useful for designed proteins that don’t exist in nature. These are the tools that we need to cure diseases,” said Brian Weitzner, Ph.D. Director of Computational and Structural Biology at Outpace and co-founder of OpenFold. “OpenFold’s commitment to open science includes releasing training code and data sets, making these tools the most accessible to the community in order to accelerate scientific advances and facilitate further improvements to these powerful tools.”
The first new model, SoloSeq, eliminates the need for a separate pre-computational step prior to running OpenFold, thereby making the calculation on average more than 10x faster with nearly identical accuracy. Standard OpenFold and AlphaFold2 (AF2) require a Multiple Sequence Alignment (MSA) algorithmic step, where the phylogenetic tree of protein sequences is queried to find similar sequences to the input sequence. This is effectively a summary of information about other, related protein sequences already identified in nature. In SoloSeq, the LLM has already analyzed most known protein sequences, so it can rapidly summarize that evolutionary information nearly instantaneously. This is very similar to the ability of ChatGPT to “write” new text based on a prompt by finding similar patterns of text in its enormous set of training data from the internet.
The new LLM-integrated architecture of SoloSeq gives it a number of advantages over OpenFold and AF2, and fully takes advantage of the advances in Transformer-based language models since the “Attention is all you need,” 2017 research paper at Google Brain:
- Over 4x faster, with slightly lower accuracy, on inputs of natural proteins. Very useful for large scale screens where speed is critical.
- Takes input of non-natural proteins, such as artificially designed de novo proteins from systems such as ProteinMPNN or RFDiffusion from the Baker lab at University of Washington Institute for Protein Design. These are not handled well by MSA-based systems such as AF2.
Previous work in protein LLMs, primarily out of Meta AI, has produced a series of valuable LLM standalone models and the ESMFold structure model, which pairs ESM-family LLMs with OpenFold for structure prediction. SoloSeq is differentiated from prior work in multiple ways:
- First LLM that enables template-based modeling
- The entire set of code and weights are open source, as well as, for the first time, training code that enables any organization to fine-tune the models with their own proprietary data, or retrain an entirely new set of weights.
- Detailed evaluation of training dynamics, convergence as a function of amount of training, and other details
- Experiments designed to “poke” the in-model weights for protein energy estimation suggest that SoloSeq has learned an extremely robust approximation of the implicit protein energy function.
Also Read: Firefly Bio Debuts With $94 Million Series A Financing
The second new model, OpenFold-Multimer, is the first fully open source protein/protein complex modeling toolkit to be released with its training code included. This follows important work in the AF2-multimer code and model, from DeepMind, which showed that a full retraining of a multimer-specific model can create better structure accuracy than a single-sequence model alone. Now, the new OF-multimer code enables users to not just create new structures, but also to retrain or fine-tune on proprietary data.
Yih-En Andrew Ban, Ph.D., COO at Arzeda and co-founder of OpenFold said, “The fully open source nature of OpenFold-Multimer and SoloSeq continues OpenFold’s commitment to bringing deep learning capabilities to the entire life science community. The ability for both industry and academia to train and fine-tune custom models on their own data sets, as opposed to only running predictions, is a much needed feature that ensures everyone can make the most of these architectures as they are used to bring life science innovations into the world.”
Christina Taylor, Ph.D, Computational Molecular Design Lead at Bayer Crop Science, said, “OpenFold-Multimer and SoloSeq, key additions to the OpenFold Consortium Portfolio, deliver the first open-source protein-complex modeling and LLM-based protein folding tools. The open-source models will enable continued high-speed innovation in BioAI and revolutionary product development in pharmaceutical and agriculture industries.”
“The release of OpenFold-SoloSeq will enable new workflows unaddressed by AF2 or ESMFold, in particular the ability to rapidly and more reliably compute the relative fitness and ranking of decoys to protein sequences,” added Professor AlQuraishi, “Additionally, OpenFold-Multimer will enable training of new OpenFold variants specifically targeting multimeric complexes and assemblies.”
SOURCE: BusinessWire