Fine-tuning a protein model to find new antibiotics

The code and the ranked candidate list are on GitHub. This is the why, the method, and the one experiment that failed in a way worth keeping.

Antimicrobial resistance is projected to cause 10 million deaths a year by 2050. The pipeline of new antibiotics is thin because discovering them the normal way is slow and expensive. Antimicrobial peptides are one of the more interesting alternatives. They attack bacterial membranes directly, they evolve fast, and bacteria rarely build resistance to them. The catch is the search space. There are vastly more possible peptide sequences than anyone can test in a lab, so most candidates never get looked at.

This is exactly the kind of problem a protein language model should help with. Screen computationally, rank by probability, and hand the lab a short list instead of a haystack. I wanted to see how far I could get with one rented GPU and an afternoon, so I fine-tuned Meta's ESM-2 and pointed it at unlabeled bacterial sequences from NCBI.

The setup

ESM-2 is a transformer pretrained on 250 million protein sequences. It already understands the "grammar" of proteins, so the job is not to teach it biology from scratch. It is to add a small classifier on top and nudge the existing representation toward the antimicrobial question. That is what LoRA is for: train a tiny set of adapter weights instead of the whole 650M-parameter model.

ESM-2 (esm2_t33_650M_UR50D) ── pretrained on 250M proteins │ frozen. we don't retrain the backbone ▼ LoRA adapters ── r=16, alpha=32, on the query + value │ projections only. ~tiny fraction of params trained ▼ linear head ── 1280-dim embedding → class logits │ ▼ HuggingFace Trainer, FP16 RunPod A6000 (48GB) · ~2 hours for 10 epochs on 80k peptides

What worked

Trained directly on a binary "is this an AMP or not" task using the GenPept-Curated-2025 dataset, the model hit 88.3% F1 with 86.8% accuracy on an 80/20 split. That dataset is built to be leakage-free, meaning it has no sequence-homology overlap between train and test, so the score is not inflated by the model recognizing near-duplicates it already saw. For a couple hours of compute on a rented card, a reliable binary AMP detector is a genuinely useful result.

Then I let it loose. The detector screened 1,980 unlabeled bacterial peptide sequences pulled from NCBI RefSeq, with anything already annotated as an AMP excluded so the model could only surface novel candidates. The top 100 by probability are ranked in the repo. The top hit scored 0.785, and the top ten land between 0.768 and 0.785. Every one of them is an uncharacterized bacterial protein with no existing AMP annotation. That is the actual deliverable: a short, ranked list a wet lab could start testing.

The experiment that failed

Before the binary detector, I trained on ESCAPE, a different dataset with four mechanism labels: antibacterial, antifungal, antiviral, antiparasitic. The headline accuracy looked great at 97%, but the per-class F1 told the real story.

class	baseline (untrained 8M)	trained (650M + LoRA)
antibacterial	6.8%	81.0%
antifungal	3.7%	40.5%
antiviral	0.7%	31.0%
antiparasitic	0.0%	0.0%

Antibacterial is strong. The others fall off fast, and antiparasitic fails completely because ESCAPE has almost no positive examples for it. A model cannot learn a class it barely sees. That is class imbalance doing exactly what class imbalance does, and the 97% accuracy number is hiding it.

The more important failure came next. I took the ESCAPE-trained model and tested it on GenPept, the leakage-free binary set. It scored 2.2% F1. Essentially random. The model had not learned "what makes a peptide antimicrobial." It had learned "what makes a peptide antibacterial in ESCAPE specifically," and that knowledge did not transfer one inch to a different dataset asking a slightly different question.

This is the result I care about most, and it is a negative one. Cross-benchmark transfer failed. If I had only ever evaluated on ESCAPE, I would have walked away thinking I had a great general AMP model. I had a great ESCAPE model. The only reason I know the difference is that I tested it somewhere it had never seen. That is the whole argument for leakage-aware, multi-dataset evaluation in one experiment.

What I took from it

ESM-2 plus LoRA is sample-efficient. A 650M model gets to strong performance on a small adapter and a couple hours of compute. You do not need a cluster to do real work here.
Headline accuracy lies when classes are imbalanced. The 97% was meaningless on its own. Per-class F1 is where the truth was.
One benchmark is not evaluation. The transfer test is the part that told me what the model actually learned, and it is the part most people skip.

Binary AMP detection at 88.3% on a clean benchmark, a ranked list of novel candidates, and a documented failure mode that sharpens how I evaluate everything else. For an afternoon on a rented GPU, that is the shape of result I am after: a tool that produces something usable, and a finding worth writing down.

Code, configs, and the full candidate list on GitHub. If you work in computational biology and want to talk, reach out.