There are two dominant paradigms for generative drug design in 2026. One is diffusion — train a model on real 3D pocket ligand complexes and generate new molecules from the learned distribution. The other is reinforcement learning — generate a molecule, score it, and update the generator to produce higher-scoring ones next time. Both work. Both have weaknesses. And they are best used together.
This post puts DiffSBDD and TargetDiff — the diffusion side — head-to-head with REINVENT4, the most widely used RL-based generator in academic and industrial drug discovery today. We will look at what each paradigm learns, where each one fails, and how to combine them into a pipeline that is better than either alone.
How diffusion generators learn
Diffusion generators like DiffSBDD and TargetDiff are trained on datasets of pocket-ligand complexes. During training, the ligand coordinates are progressively corrupted by Gaussian noise, and the model learns to predict the clean coordinates at each noise level. At inference you start from pure noise and iteratively denoise into a real molecule, conditioned on the fixed pocket atoms.
What the model learns, in the most important sense, is “what a real ligand in a real pocket looks like.” That is a strong structural prior. The model has seen thousands of examples of favorable contacts, correct ring geometries, and well-behaved torsion angles. It generates candidates that respect those patterns by default.
The flip side is that the generator is not trying to maximize anything specific. You cannot easily tell DiffSBDD “please produce molecules with QED above 0.8 and logP between 1 and 3.” You can filter the candidates after the fact, but the generator itself is not aware of those constraints.
How RL generators learn
RL generators like REINVENT start with a pretrained SMILES language model. The generator produces molecules, each molecule is scored by a reward function (which could include predicted affinity, QED, logP targets, synthetic accessibility, and custom objectives), and the generator is updated to produce higher-scoring molecules next time.
What the model learns is “how to generate molecules that maximize the reward function.” That is exactly what you want when you have a clear numeric objective. It is also exactly what you do not want when your reward function has blind spots — and all reward functions do.
Where each paradigm shines
Diffusion wins on pocket fit
DiffSBDD sees the pocket at every denoising step and has been trained on real complexes. The molecules it generates tend to fit the pocket naturally without needing a lot of post-processing. RL generators that work in SMILES space have to defer geometry to a downstream docking step, which often turns up mis-fits that the generator never saw.
RL wins on property optimization
If you know exactly what you want — a specific logP range, a minimum QED, a particular structural motif — RL generators are excellent at hitting it. You write the objective into the reward function and let the optimizer work. Diffusion has no equivalent mechanism for optimizing a numeric target.
Diffusion wins on diversity
Diffusion samples from a distribution. If you run the same pocket twice you get two different candidate sets, both valid. That is useful for exploration. RL optimizers tend to converge on a narrow mode if the reward function has sharp peaks, which can hurt diversity.
RL wins on targeted scaffolds
RL is better when you have a specific scaffold you want to decorate or a specific chemical series you want to explore. You can bias the generator toward that starting point and let the reward drive the optimization. Diffusion is less controllable in that dimension.
Where each paradigm fails
Diffusion failures
- No property control. The generator has no notion of QED or logP. You filter candidates downstream.
- Synthesis is an afterthought. Some candidates are impractical to make. Run a synthesis checker after generation.
- Dataset bias. The generator can only produce things it has seen in training. Novel chemistry is harder to reach.
RL failures
- Reward hacking. The single biggest weakness. Any imperfection in the reward function will be exploited by the generator.
- Geometry blindness. SMILES-level RL does not see the pocket directly and can generate candidates that do not fit.
- Mode collapse. If the reward has sharp peaks, the generator can collapse onto a narrow region of chemical space.
The right way to combine them
Two-stage pipelines work well. Use diffusion to generate a pocket-aware starting population and RL to optimize that population toward specific properties.
- Stage 1 (diffusion). Run DiffSBDD against your pocket to produce a few hundred candidates. These are pocket-aware, structurally plausible, and diverse.
- Stage 2 (filtering). Apply hard filters: synthetic accessibility, drug-likeness, forbidden motifs. Cut the set to your top ~50.
- Stage 3 (RL optimization). Seed a REINVENT run from the top candidates and let it optimize on your numeric objectives. The result is a set that is both pocket-aware and property-optimized.
- Stage 4 (validation). Dock the survivors with a traditional docking program and validate with a physics-based or LLM reasoning layer.
This pipeline gets you the best of both worlds: the structural prior from diffusion and the property control from RL, with validation steps that catch the worst failures of each.
When to use just one
Sometimes the pipeline is overkill. Two situations where a single generator is enough:
- Pure exploration. You want to see what a generator suggests for a new target. DiffSBDD alone is fine — you want diversity and pocket fit, not property optimization.
- Pure optimization. You have a series and you want to push it toward a specific property target. REINVENT alone is fine — you already have a starting point and you want an objective optimizer.
Bottom line
Diffusion models and reinforcement learning are solving different problems within generative drug design. Diffusion learns the distribution of real pocket-ligand complexes and gives you pocket-aware candidates. RL optimizes molecules against a numeric reward and gives you property-optimized candidates. They are more useful together than apart, and the winning pipelines in 2026 are the ones that chain them.