Which LLM is best for drug discovery in 2026?

There is no single winner. For chemistry-heavy reasoning about small molecules, ADMET, and SAR, TxGemma is currently the strongest open-weight option. For literature-grounded question answering over biomedical text, BioMedLM and its descendants remain competitive. For broad, general-purpose reasoning with wide world knowledge, PaLM 2 and its successors are still useful as an outer wrapper. Most production pipelines route between several models.

Is BioMedLM still relevant?

Yes, in a narrower role than it used to hold. BioMedLM was trained on PubMed and PMC, so it still carries a lot of biomedical literature signal. Its strengths are literature QA, extraction from abstracts, and domain-specific summarization. Its weakness is chemistry: it was not trained on SMILES or structured drug-discovery datasets the way TxGemma was.

How do I actually compare two drug discovery LLMs?

Pick a shared benchmark that reflects your workflow. The Therapeutic Data Commons (TDC) covers 66 drug-discovery tasks and is the closest thing to a standard. Run the same prompts through each model, score with the task's native metric (AUROC for classification, Spearman correlation for regression), and then weight by how much each task matters to your downstream decision. Do not rely on vendor-reported numbers alone.

Can I use PaLM 2 for chemistry without fine-tuning?

You can, but the ceiling is low. PaLM 2 was trained on general text and code. It will answer chemistry questions but will confidently hallucinate SMILES, misread named reactions, and miscount functional groups. You can improve it with retrieval and tool use, but a model that was fine-tuned on chemistry from the start has a meaningful head start.

What does instruction tuning add for drug discovery?

Instruction tuning teaches a base model to follow task-shaped prompts. For drug discovery that means the model learns to respond in the form a chemist expects: a predicted property, a reason for the prediction, and a recommended action. Without instruction tuning you get a model that can pattern-match chemistry but does not answer in a useful shape. TxGemma is instruction tuned. PaLM 2 is general-instruction tuned.

Which model is cheapest to run?

TxGemma 2B is the cheapest to run on your own hardware. A single consumer GPU in 4-bit quantization will handle it. BioMedLM in its original 2.7B form is similar. PaLM 2 is closed-source and only available via API, so cost depends on usage and pricing tier. SciRouter gives you a unified endpoint so you can switch models without rewriting your code.

How does SciRouter fit in?

SciRouter is a gateway that lets you call TxGemma and other therapeutics models from a single API. You send a question with your SciRouter API key, and the gateway routes it to the correct GPU endpoint and returns a structured response. It also exposes the same models through MCP so agents like Claude and GPT can call them directly.

Drug Discovery LLM Comparison 2026: TxGemma vs BioMedLM vs PaLM 2

The drug-discovery LLM space in 2026 is no longer a one-horse race. At least three families of models compete for the “best LLM for drug discovery” slot, and each one has different training data, different tuning objectives, and different cost curves. This post does a side-by-side comparison of the three we see teams evaluating most often: TxGemma, BioMedLM, and PaLM 2.

The short version: they are not interchangeable. They are three points on a triangle with axes of chemistry depth, literature depth, and general reasoning. Your job as an engineer is to know which corner you need to be in.

Note

This comparison focuses on the flavor of reasoning each model is good at. For quantitative benchmarks on specific endpoints, run your own evaluation on TDC tasks that match your workflow. Vendor numbers are a starting point, not a verdict.

TxGemma: the chemistry specialist

TxGemma is Google's therapeutics-tuned derivative of the Gemma family. It is open weight, ships in 2B, 9B, and 27B variants, and was instruction tuned on a large mixture of Therapeutic Data Commons datasets covering 66 drug-discovery tasks. It is the most chemistry-literate of the three models in this post.

What TxGemma adds on top of a base LLM is the ability to reason about molecules in a structured way. It has seen SMILES at scale. It has seen ADMET endpoint datasets. It has seen retrosynthetic disconnections. When you ask it “what is the likely hERG liability of this scaffold and why,” it answers in the shape a medicinal chemist expects, with a rationale.

It is not a binding-affinity oracle, and you should not use it as one. Pair it with physical tools like Boltz-2 or DiffDock when you need numbers. Treat TxGemma as a reasoning layer on top of the hard predictors.

BioMedLM: the literature specialist

BioMedLM was one of the first purpose-built biomedical language models. Its training corpus was drawn from PubMed abstracts and PubMed Central full-text articles. That makes it strong at a specific kind of task: reading biomedical literature and answering questions grounded in what that literature says.

Extracting information from paper abstracts.
Summarizing a body of prior work on a target or mechanism.
Answering clinical and preclinical questions that are settled in the published record.

What BioMedLM is not strong at is chemistry. It was not trained on SMILES, fingerprints, or property datasets. Ask it to reason about a novel scaffold and you will get a polite guess. Ask it to summarize the last decade of work on a well-known target and it will do a respectable job.

Think of BioMedLM as the replacement for a graduate student doing a literature search, not the replacement for a medicinal chemist at a whiteboard. The two roles are complementary.

PaLM 2: the general reasoner

PaLM 2 and its successors are general-purpose large language models. They have broad world knowledge, strong code generation, and good at reasoning over structured prompts. They are not drug discovery models, but they show up in drug discovery pipelines as the outer orchestration layer.

What PaLM 2 brings is the glue. It is the model you use to plan a multi-step workflow, to write a summary of what several specialist models have returned, and to handle the natural-language part of a conversation with a user. It is also the fallback when a specialist model does not have training data on the specific target or molecule you care about.

You should not ask PaLM 2 chemistry questions directly without tools. Give it retrieval, give it a function-calling interface to a real chemistry tool, and let it delegate. That is the pattern that works.

Head-to-head: where each model wins

ADMET property prediction

Winner: TxGemma. This is its home turf. It has been trained on most of the standard ADMET datasets in TDC and gives structured, chemistry-aware answers with rationales. BioMedLM can pattern match if the exact molecule is in its literature corpus. PaLM 2 will give you a general-purpose answer that may or may not be calibrated.

Target literature review

Winner: BioMedLM (or a modern retrieval-augmented successor). When you need to know what has been published about a target, pathway, or mechanism, the literature-specialized model is the right choice. TxGemma will also answer literature questions but was not pretrained on the same biomedical corpus.

Multi-step workflow orchestration

Winner: PaLM 2. Or any large general reasoner. The role of the orchestration layer is to decompose a question into sub-questions, delegate them to specialist tools, and compose the results. Specialist models are less reliable at this meta-task. Give the general reasoner function calling and let it drive.

SMILES understanding

Winner: TxGemma. It was trained on SMILES at scale with chemistry-aware pretraining data. General models can parse simple SMILES but hallucinate on complex ones. Literature models parse SMILES the way they parse any string — without chemistry awareness.

Retrosynthetic planning

Winner: TxGemma. It has seen retrosynthetic datasets in its instruction-tuning mixture. It proposes reasonable disconnections at roughly the level you would expect from a senior graduate student, with the standard caveats.

Cost and deployment

Open-weight models have a very different cost curve from closed APIs. TxGemma and BioMedLM can both be run on your own GPUs, which means fixed cost per machine-hour and full control over data. PaLM 2 is accessed through Google's API with per-token pricing. Neither model is obviously cheaper — it depends on your throughput pattern.

TxGemma 2B on a single consumer GPU. Cheapest to run, good for inline hints.
TxGemma 9B on a 24 GB GPU. The practical sweet spot for most agentic drug discovery workflows.
TxGemma 27B on an A100. Best reasoning quality for offline batch analysis.
BioMedLM on a single GPU. Still a reasonable literature QA model if that is the task you need.
PaLM 2 via API. No infrastructure, but per-token pricing and data leaves your perimeter.

How to pick

Do not think of this as a single-winner decision. A realistic production pipeline uses at least two of these models.

If your workflow is chemistry-heavy — ADMET, SAR, molecular property prediction — start with TxGemma.
If your workflow is literature-heavy — target review, mechanism QA, clinical outcome summarization — add a literature specialist on top.
If your workflow is agentic — the LLM decides what to do next — use a general reasoner as the outer loop and have it delegate to the specialists.

SciRouter makes this pattern concrete. You call a unified API for TxGemma and the other scientific models, and the gateway handles the GPU routing. You can swap between models by changing one parameter, which makes A/B testing different LLMs in the same pipeline trivial.

Bottom line

Drug discovery LLMs in 2026 are not interchangeable. TxGemma is the chemistry specialist, BioMedLM is the literature specialist, and PaLM 2 is the general reasoner. The winning pattern is a pipeline that uses each one for the task it was trained for, with a routing layer that keeps the code clean.

Try TxGemma on SciRouter →