Computer Science

Why the Best Cures are Hidden in Mud

Structure-aware search allows AI to mine billions of unknown proteins to find specific functions, bypassing the limits of traditional sequence alignment.

Sankalp
Sankalp
Engineering Lead
Why the Best Cures are Hidden in Mud

The vast majority of life on Earth is invisible. In every handful of soil, every drop of ocean water, and every swab of the human gut, there are billions of microbes locked in a constant, unseen arms race. To survive, these microbes have evolved a staggering array of chemical weapons, shields, and tools. They produce enzymes that can break down complex toxins, proteins that can capture heavy metals, and antibiotics that can kill competing bacteria. This microbial "dark matter" contains the answers to many of our most pressing medical and industrial problems.

For decades, the challenge wasn't finding the data, it was reading it. Through a process called metagenomics, scientists can scoop up a bucket of mud, extract all the DNA, and sequence it. This produces a massive, chaotic database of trillions of genetic letters belonging to thousands of unknown species. But having the code is useless if you don't know what it does. If a researcher wants to find a new enzyme that breaks down plastic, they can't just type "plastic-eating protein" into a database. They have to rely on sequence alignment tools to find something that looks similar to the few plastic-eating enzymes we already know.

The problem is that evolution is wildly creative. Two organisms might evolve completely different genetic sequences to solve the exact same problem. Traditional tools like BLAST (Basic Local Alignment Search Tool) work by comparing the "spelling" of the sequences. If a novel protein in the mud shares less than 30% of its letters with a known protein in the database, BLAST is mathematically blind to it. We have been trying to search the library of life using a spell-checker, missing millions of profound discoveries simply because nature used a different vocabulary.

Imagine possessing the genetic code for a revolutionary new antibiotic hidden within a metagenomic sample of deep-sea mud. Because the sequence shares almost zero similarity with penicillin or any other known drug, traditional search tools report it as "unknown function." The AI, however, ignores the spelling entirely. It maps the protein into a "Biological Latent Space" and finds that its three-dimensional shape matches a known toxin binder perfectly. The "Metagenomic Haystack" is solved not by searching for the needle's name, but by searching for its shape.

The axiom of molecular biology is that "structure dictates function." Even if the amino acid sequence diverges wildly over millions of years of evolution, the physical 3D shape of the protein must remain stable to perform its job. If we can search by shape rather than sequence, we can find these distant evolutionary cousins.

This was historically impossible because predicting and comparing 3D shapes was too computationally slow. The release of the AlphaFold database, containing 200 million predicted structures, provided the map, but researchers still needed a search engine fast enough to query it. Van Kempen et al. (2023) solved this with the introduction of Foldseek. Foldseek translates the complex 3D architecture of a protein into a 1D sequence of "geometric tokens." It basically assigns an alphabet to the angles and distances between atoms.

By comparing these structural tokens, Foldseek is 10,000 times faster than previous 3D alignment tools. More importantly, it successfully identifies distant evolutionary relationships that sequence-based tools completely miss. It allows researchers to throw an unknown protein from the mud against the entire known universe of biology and ask, "Have we ever seen anything that folds like this?"

The Biological Latent Space

As an architectural observation, when AI models process millions of these sequences and structures, they create a mathematical map known as a Latent Space. In this high-dimensional map, every protein is assigned a coordinate where the "distance" between two points represents functional similarity rather than sequence identity. Proteins that carry oxygen cluster in one corner; proteins that cut DNA cluster in another.

Functional Search is the ability to navigate this space. If a chemical company has an industrial enzyme that works but breaks down at high temperatures, they can take its coordinate in the latent space and ask the AI to "search the neighborhood" for proteins that belong to extremophile bacteria (organisms that live in boiling vents). The AI retrieves candidates that perform the exact same function but have evolved the necessary physical stability. This shift from keyword matching to coordinate navigation is what allows us to find the "unfindable" in the tree of life.

The Functional Divergence Trap

The primary failure mode of this approach is "Functional Divergence." While structure is a much better proxy for function than sequence, it is not infallible. A structure-aware search might identify two proteins that look perfectly identical in their overall 3D scaffolding. The AI flags them as functionally identical.

However, in the physical world, they perform opposite chemical tasks. This occurs because the function of an enzyme often relies on just two or three specific atoms deep inside a microscopic "active site" pocket. If evolution changed those three specific atoms, the entire chemistry of the protein flips, even if the overall 3D shape remains identical. The AI, looking at the macro-structure in the latent space, misses the microscopic deviation that defines the chemistry.

Functional search has given us the tools to mine the vast, unlabelled wilderness of biological dark matter. We are no longer limited by the vocabulary of the proteins we already know. But as we search by shape, we must remember that biology is a science of microscopic exceptions. A perfect structural match is a profound clue, but the final truth of its function can only be proven in the wet chemistry of a lab.

Insight

Van Kempen et al. (2023) introduced Foldseek, demonstrating that searching by 3D structural embeddings is 10,000x faster than traditional tools and detects distant evolutionary relationships.

Frequently Asked Questions

How is functional search different from a Google search?+
A standard search looks for keywords or exact matches. Functional search looks for 'meaning' and 'capability.' It can find two proteins that look completely different but perform the exact same job in a cell.
What can you find with functional search?+
Researchers use it to find enzymes that can break down plastic, proteins that can act as sensors for toxins, or new antibiotic candidates hidden in the genomes of obscure bacteria.

Join the EulerFold community

Track progress and collaborate on roadmaps with students worldwide.

🐢

Discussion

0

Join the discussion

Sign in to share your thoughts and technical insights.

Loading insights...

Recommended Readings

The author of this article utilized generative AI (Google Gemini 3.1 Pro) to assist in part of the drafting and editing process.

Technical explainers on AI, research, and modern engineering.

Follow us