When the first bacterial genomes were sequenced in the late 1990s, the working assumption was that each species had a single reference genome, with minor variation around the edges. By the mid-2000s that idea had collapsed.
When you sequence multiple isolates of the same bacterial species, you find that they share a smaller core of genes than expected, and each isolate also carries genes the others lack. The total inventory of genes seen across the species, called the pangenome, is much larger than what any single cell carries.
E. coli is the textbook case. The species pangenome holds over 160,000 genes. A single E. coli cell carries around 5,000. The same pattern shows up in nearly every bacterial species we’ve looked at carefully. Streptomyces species have pangenomes exceeding 140,000 genes with individual genomes of 5,000 to 12,000. Pseudomonas aeruginosa has a vast accessory gene pool. Even species with tighter genomes, like Helicobacter pylori, show the same architecture in miniature.
This is universal in prokaryotes. It demands an explanation. So far, none of the explanations on offer has done the job properly.
In a new preprint I argue that the answer is bet-hedging. Pangenomes exist because bacteria face unpredictable environments, and the geometric mean of fitness punishes variance much harder than the arithmetic mean rewards it. A distributed gene pool, held across a population rather than inside any single cell, is the architecture that wins. Below I walk through why.
Three things that need explaining
Any decent theory of pangenomes has to account for three patterns at once.
1. Rare genes persist.
Most accessory genes are rare. They show up in a small fraction of genomes and stay there. They don’t drift to fixation, and they don’t disappear. This is hard to explain by neutral drift alone, which would push rare genes toward loss on reasonable timescales.
2. Pangenome size tracks environmental complexity.
Obligate intracellular bacteria like Chlamydia live in stable environments and have small, nearly closed pangenomes. Free-living generalists like E. coli have huge open pangenomes. The pattern is consistent. More environmental variety, more accessory genes. But existing frameworks give no quantitative threshold for when this kicks in.
3. The pangenome grows without bound while individual genomes stay small.
Streptomyces may carry 140,000 genes across the species, but no individual Streptomyces cell has more than ~12,000. Selfish-element theory can explain why genetic parasites accumulate, but most pangenome genes aren’t parasites. They’re metabolic enzymes, transporters, resistance genes. Functional.
A successful theory needs to explain all three at once, and ideally produce them from a small number of starting assumptions. That’s what bet-hedging does.
Why variance is so dangerous
The argument starts with a fact about how fitness works across generations. Fitness compounds multiplicatively. If a population grows by a factor of w every generation, then after n generations the population size is w₁ × w₂ × w₃ × … × wn. The relevant measure of long-term success isn’t the average per-generation fitness. It’s the product of fitnesses, and the long-run growth rate is governed by the geometric mean, not the arithmetic mean.
This matters because the geometric mean is always less than or equal to the arithmetic mean, with equality only when there is zero variance. For small fluctuations, the gap between them is well approximated by:
Here σ² is the variance in fitness across generations and μ is the arithmetic mean. The penalty grows with the square of the fluctuations. Big swings cost you much more than small ones.
The intuition is simple if you think about gambling. Suppose I offer you a bet where you gain 50% on heads and lose 50% on tails. The arithmetic average of those returns is zero. Sounds fair.
Play it four times: heads, tails, heads, tails. Starting with £100, you go to £150, then £75, then £112.50, then £56.25. Two wins and two losses, balanced, and you’re down to 56% of where you started. The order doesn’t matter, because multiplication doesn’t care: 1 × 1.5 × 0.5 × 1.5 × 0.5 = 0.5625 no matter how you arrange the rounds. Keep playing and you head toward zero.
This is the fundamental asymmetry of multiplicative processes. A 50% loss requires a 100% gain to recover. A 90% loss requires a 900% gain. The deeper the hole, the more disproportionately large the climb out. Long losing streaks need even longer winning streaks to recover from, and deep enough losses can’t be recovered at all.
Now replace “lose 50%” with “lose everything,” which is what extinction looks like in biology. There is no winning streak that recovers from zero. Variance does more than slow growth. It creates the risk of catastrophe, and catastrophe is permanent.
The simulations in Figure 1 of the paper make this concrete. Three populations have arithmetic mean fitness of 1.0. One has zero variance. One has moderate variance (σ² = 0.04). One has high variance (σ² = 0.25). After 1,000 generations, the high-variance population is 10⁶³ times smaller than the zero-variance population. They have the same mean. They end up in different universes.
The implication is that any biological architecture that reduces variance in fitness will outcompete architectures that don’t, even if it accepts a slight reduction in mean fitness as the cost. This is bet-hedging. Trading arithmetic mean for geometric mean.
How horizontal gene transfer provides the insurance
Bacteria swap genes between lineages. Conjugation, transformation, transduction, gene transfer agents. The mechanisms vary, but the upshot is that prokaryotic genomes are porous. Horizontal gene transfer (HGT) moves genes between individuals and between species.
Without HGT, a gene that is costly on average would be eliminated by selection. The cell carrying it pays a small fitness cost most of the time, and in the rare generations where the gene is useful, it isn’t enough to offset the cumulative drag. Selection wins. Gene gone.
With HGT, the picture changes. Even after selection purges a costly gene from most carriers, HGT keeps reintroducing it. The gene reaches a stable equilibrium frequency in the population, low but non-zero. When the environment switches and the gene becomes useful, the cells carrying it bloom. The population survives the shift.
This is the insurance reservoir. Each gene at low frequency is a hedge against a future environment in which it pays off. The premium you pay is a small reduction in mean fitness, because some cells are always carrying costly genes. The payout is avoiding extinction when conditions change.
Figure 1 of the paper shows the trade-off directly. Without HGT, costly genes go to zero. With HGT, they sit at stable equilibrium frequencies. Mean fitness drops slightly with increasing HGT rate, which is the cost of the insurance. The optimal HGT rate isn’t zero and isn’t very high. It peaks at intermediate environmental switching rates, which makes sense. Insurance is most valuable when the future is genuinely uncertain.
The first threshold: which genes selection alone keeps
To make this quantitative, define some parameters for a single accessory gene.
- s = the selective benefit when the gene is useful (cells carrying it have fitness 1 + s in the right environment)
- c = the carriage cost when the gene isn’t useful (cells carrying it have fitness 1 − c in the wrong environment)
- p = the probability that the gene’s beneficial environment occurs in any given generation
The net expected selection on the gene is:
This is just the average benefit minus the average cost, weighted by how often each happens. If snet is positive, selection favours the gene. If negative, selection purges it.
The break-even point is where snet = 0. Solving for p gives the selection direction threshold:
This is the first key equation. Read it as follows. A gene is favoured by selection if it is useful more often than p*. Below p*, selection works against it.
Plug in some numbers to see how it behaves. If the carriage cost is small (c = 0.02) and the benefit when useful is substantial (s = 0.3), then p* = 0.02 / 0.32 ≈ 0.063. So a gene needs to be useful in more than about 6% of generations to be net favoured. If it’s useful less often than that, selection alone will eliminate it.
The selection direction threshold is asymmetric in a useful way. Cheap genes with big benefits have low p*, meaning even rarely useful versions are worth keeping. Expensive genes with marginal benefits have high p*, meaning they need to be useful often to justify themselves. This asymmetry is one source of the U-shaped distribution we see in real pangenomes.
The equilibrium frequency: how HGT sets the level
For genes below p* (the ones selection alone would eliminate), HGT keeps them in the population at a stable equilibrium frequency. At equilibrium, the rate at which HGT introduces the gene to new cells balances the rate at which selection removes it. This gives:
Here h is the HGT rate and |snet| is the absolute value of the net selection coefficient against the gene. The bigger the HGT rate, the higher the equilibrium frequency. The stronger selection works against the gene, the lower the equilibrium.
This is mathematically the same form as classical mutation-selection balance, with HGT rate substituting for mutation rate. Migration-selection balance has the same structure too. The general principle is that any process introducing a deleterious allele balances against selection removing it, and the equilibrium is the ratio.
Real HGT rates vary by many orders of magnitude. Conjugative plasmids and integrative conjugative elements can transfer at rates h ~ 10⁻² per cell per generation. Chromosomal genes requiring rare homologous recombination transfer at rates h ~ 10⁻⁶ to 10⁻⁸. Plug those into the equilibrium equation and you get equilibrium frequencies spanning 30% down to below 1%, which matches the empirical spread of accessory gene frequencies in real pangenomes.
Why one genome can’t carry everything
So far the argument has worked one gene at a time. But a real genome carries thousands of genes simultaneously. Each one imposes its own carriage cost. The costs add up. The benefits don’t, because in any given environment only a few of the contingency genes are actually being used.
Imagine a bacterium facing E distinct adaptive challenges, with each challenge requiring its own gene. If the bacterium carries genes for all E challenges, it pays cost cE every generation. But it only benefits from whichever challenge happens to be active that generation, giving expected benefit s.
Looking at a single gene in this collection: it costs c every generation, and benefits s with probability 1/E. Net expected value per gene is:
This is positive only when E < s/c. Define the complexity threshold:
This is the second key equation. Below Ecrit, individual genes still pay their way in expectation. A bacterium can afford to carry coverage for every contingency. Above Ecrit, the maths flips. Each individual gene now has negative expected value, because it’s so rarely useful that its cost can’t be recovered.
What happens above Ecrit? Carrying coverage for everything becomes a losing strategy. Selection forces individual genomes down toward minimal coverage, holding only the genes that are useful often enough to pay for themselves. But the population as a whole still needs broad coverage to handle environmental variety. The only way to square this is to distribute the burden. Different cells carry different genes. No individual is comprehensive, but the population is.
This is the moment the distributed pangenome becomes obligate. The only strategy left standing. Using the numbers from earlier (s = 0.3, c = 0.02), Ecrit = 15. So a bacterium facing more than 15 distinct adaptive challenges cannot afford to be a generalist at the individual level. It must distribute.
This matches the empirical pattern. Bacteria in simple, stable environments (low E) have small closed pangenomes. Bacteria in complex, variable environments (high E) have huge open pangenomes. The threshold predicts where the transition happens.
The U-shaped distribution falls out for free
Once you have p* and HGT-selection balance, the U-shaped distribution of gene frequencies that we see in every pangenome study just emerges. Here’s why.
Each gene has its own p* depending on its specific cost and benefit. Real bacterial genes span a wide range of c/s ratios, so they span a wide range of p* values. Now think about the environment. The probability that a given environment occurs (the gene’s actual p) is also a distribution. Most environments are rare, a few are common. This is a generic feature of ecological variability.
When you cross these two distributions, you get a U-shape. Many genes have p well above their own p*, so they’re nearly fixed (the right peak of the U: the core). Many genes have p well below their own p*, so they’re held at low equilibrium by HGT (the left peak of the U: the rare accessories). Few genes happen to sit in the narrow band where their p is close to their p*, which is what would produce intermediate frequencies.
The U-shape isn’t a separate phenomenon needing its own explanation. It’s a consequence of the two thresholds.
The empirical test: insurance versus niche adaptation
Theory is one thing. The decisive question is whether real pangenomes look like bet-hedging when you actually look.
The alternative to bet-hedging is niche adaptation. Under niche adaptation, accessory genes are tied to specific environments. A gene that helps in the gut is present in gut-isolated bacteria and absent elsewhere. Under bet-hedging, the same gene should be present in gut isolates more often, but it should also be substantially present in bacteria isolated from other body sites, carried as insurance against environmental shifts.
These two models make different predictions about the same data, and the E. coli pangenome has the statistical power to distinguish them.
I used the Horesh dataset of over 10,000 E. coli genomes, focusing on 1,705 phylogroup B2 isolates from three body sites: blood, feces, and urine. These three are physiologically distinct (different pH, oxygen levels, nutrient profiles), so they should impose different selection pressures on accessory genes.
The first observation: gene content really does track body site. PERMANOVA decomposition shows that 13% of variance in accessory gene content is explained by isolation source, with 43% by phylogroup and 44% residual. So there is real environment coupling. Niche adaptation has some grip on the data.
The second observation: niche genes are heavily retained in away environments. For every gene classified as niche-specific, I defined its “home” body site as the one where it occurs most often, and measured its frequency in the best “away” site. The ratio is the retention fraction.
Under strict niche adaptation, retention should be close to zero. Genes adapted to the gut shouldn’t be carried in blood isolates.
The median retention is 0.63. Niche genes are kept at nearly two-thirds of their home frequency in non-home environments. Only 1% of niche genes have retention below 0.20. Even genes with the strongest niche signal (top quartile of effect size) have median retention of 0.61. These genes are not absent from the wrong environment. They are heavily present.
This already looks more like bet-hedging than niche adaptation. But there’s a sharper test.
The decisive test: the slope
Migration-selection balance can in principle produce retention of niche genes in away environments. If migration is fast and selection against the gene in the away environment is weak, the gene will still be present in away environments at some equilibrium frequency.
The way to distinguish migration-selection balance from bet-hedging is to look at how retention scales with niche-effect size. Under migration-selection balance, genes with stronger niche effects (higher Cramér’s V, which measures the strength of the gene-environment association) should face stronger counter-selection in away niches. So you’d expect a steep negative relationship between effect size and retention.
Under bet-hedging, this relationship should be flat. Insurance logic is symmetric across environments, so the strength of a gene’s niche association in its home environment shouldn’t predict how much of it gets retained away.
I ran 500 simulations under each model to see what the slope (R²) looks like in each scenario. Migration-selection balance produces R² ≈ 0.41 (median). Bet-hedging produces R² ≈ 0.07. These distributions don’t overlap.
The observed value in E. coli is R² = 0.076. Sits cleanly inside the bet-hedging distribution. Sits entirely outside the migration distribution. The lowest R² produced by any of 500 migration simulations was 0.383, more than five times the observed value.
The retention of niche genes in away environments has the statistical signature of bet-hedging, not migration-selection balance.
Fitness consequences
Pattern doesn’t automatically imply functional consequence. To check that the bet-hedging architecture actually affects fitness, I compared three strategies on the same gene set:
- Pure niche: carry only genes optimised for the current home environment.
- Pure bet-hedging: carry every gene at its cross-niche mean frequency.
- Observed E. coli: use the actual per-niche frequencies measured in the data.
For each, I computed expected fitness as a product across genes, integrating over environmental switching rates.
The pure niche strategy loses 55% of fitness at p = 0 (no switching). Purging away-niche genes is expensive when the environment shifts. The observed strategy performs within 4% of pure bet-hedging at every switching rate tested. Even in the worst transition, the observed strategy retains 96% of fitness while pure niche retains only 60%.
E. coli is, in fitness terms, almost a pure bet-hedger with some niche flavouring on top.
The cross-species check
If bet-hedging is a general principle, the variance constraint should show up across species, not just within E. coli.
I reanalysed pangenome data from 670 prokaryotic species, comparing functional genes to pseudogenes. Pseudogenes are useful because they evolve without selective constraint and serve as a neutral baseline for gene turnover.
Two patterns emerge. First, functional accessory genes have singleton rates roughly 8.5-fold lower than pseudogenes. Functional genes are shared across genomes at stable frequencies. Pseudogenes turn over rapidly, each confined to the genome it arose in. This is what HGT-selection balance predicts.
Second, the coefficient of variation of gene content is systematically lower than the coefficient of variation of pseudogene content at matched means. And this constraint tightens with selection intensity. Species under stronger purifying selection (lower dN/dS) show proportionally less variance in gene content.
Selection isn’t just retaining useful genes. It is suppressing variance. That is the cross-species signature of bet-hedging.
What this means
A few things follow if this framework is correct.
The core-accessory distinction dissolves.
Core genes are not a different kind of thing from accessory genes. They are simply genes whose p sits comfortably above their own p*, so selection drives them to fixation. Accessory genes are genes whose p sits below their p*, so they are held at low equilibrium by HGT. Same dynamics, same threshold equation. A continuum, no divide.
Accessory genes are not optional.
“Accessory” is a misleading word because it implies dispensable. Under bet-hedging, accessory genes are essential to the population across evolutionary time, even if not to every cell in every generation. A gene rarely needed is not a gene unneeded. It is insurance, and populations without insurance don’t last.
No single prokaryotic genome is complete on its own.
Each genome samples from a larger whole and executes one strategy from a population-level portfolio. The cell is best understood as a participant in a distributed system. Lineage persistence depends on diversity the individual does not itself carry.
Pangenomes are the inevitable consequence of life under uncertainty.
Above the complexity threshold Ecrit = s/c, no single genome can afford full coverage of all contingencies. The cost grows faster than the benefit. Distribution becomes obligate. Bacteria have run this strategy for nearly four billion years, through climates and chemistries no individual genome could have anticipated. The pangenome is a large part of why they’re still here.
A note on what the equations are telling you
If you read only one bit of maths from all of this, read these two lines.
If the environment in which the gene helps occurs more often than this threshold, selection wins and the gene fixes. If less often, selection alone would lose the gene, and HGT has to compensate.
If a bacterium faces fewer adaptive challenges than this threshold, individual genomes can be comprehensive. Above the threshold, comprehensiveness collapses and distribution becomes mandatory.
Everything else in this paper, including the U-shape, the rare-gene persistence, the genome size plateau, the within-species E. coli results, the cross-species variance constraint, follows from these two thresholds and the geometric-mean argument. No additional mechanisms required.
That’s the claim, anyway. Pushback welcome.
Questions from readers
A few questions that came up after the preprint went online. If you have more, please get in touch.
1. Does the model distinguish between presence and expression? If a gene is tightly regulated, isn’t the cost basically just replicating the DNA?
The model rolls regulation into the carriage cost c. A silent or tightly regulated gene has a very small c — basically the DNA replication burden plus any leaky-expression risk — while a constitutively expressed gene has a much larger c. So c in the framework is an effective parameter that aggregates replication, transcription, translation, and any folding or aggregation costs.
This actually strengthens the bet-hedging argument rather than weakening it. Plug a smaller c into p* = c / (s + c) and the threshold drops. Tightly regulated genes can persist at much lower environmental frequencies than unregulated ones. Regulation makes the insurance reservoir cheaper to maintain, so populations can carry broader portfolios. From this angle, conditional regulation is part of the bet-hedging architecture — a way for the cell to pay the expensive part of the gene only when it pays off.
What the model doesn’t do is track expression dynamics explicitly. A more granular version would split c into a baseline replication cost and a conditional expression cost, but the threshold structure is unchanged.
2. How does functional redundancy fit in? If two genes do roughly the same thing, or if one gene changes the cost of another, does the model still treat each gene’s cost and benefit as independent?
The model treats each gene as having independent s and c, which is a deliberate simplification. Both cases the question asks about can be handled within the same framework.
Functional redundancy. If two genes do roughly the same job, the marginal s of the second is small — its benefit only materialises when the first fails or is absent. So the second gene’s effective p* is high and selection’s grip on it is weak. Redundancy is a soft form of within-cell bet-hedging.
Epistasis. A gene’s cost can depend on what else is present (a transporter is only useful with the right downstream enzyme). To handle this, bundle interacting genes as functional units and apply the same thresholds at the unit level. The maths doesn’t change, only the granularity.
Bet-hedging operates on functional capability, not specific genes. The population needs coverage of E adaptive challenges, and which gene combination delivers that coverage is implementation detail. The thresholds describe the architecture; the per-gene parameters are a convenient but somewhat arbitrary partition of it.
3. Costs and benefits aren’t binary in real environments. They vary along gradients. Does the threshold framework hold when s and c are distributions rather than fixed values?
Yes, it smears, but the qualitative structure survives — and the smearing is what we actually observe.
When s and c are distributions, p* = c / (s + c) becomes a distribution of thresholds across genes. A given gene with environmental frequency p doesn’t sit cleanly on one side of “the” threshold; it has a probability of being net favoured. The binary “selection wins / HGT wins” becomes a gradient.
This is why the U-shape comes out as a distribution rather than two spikes. Real pangenomes don’t have all genes at two frequencies — they have a continuous distribution peaked at the extremes, which is what you’d predict if the underlying parameters are themselves distributed.
For temporal gradients — s and c shifting continuously over time rather than switching between discrete states — the variance / geometric-mean argument gets stronger, not weaker. Continuous environmental fluctuation is precisely what generates the fitness variance the model responds to. I use a switching environment for tractability, but the geometric-mean penalty doesn’t care about the shape of the fluctuation, only its variance.
So the framework is a deliberately stripped-down skeleton. The thresholds give the qualitative architecture; the empirical patterns (continuous frequency distributions, the ~0.6 retention values, the cross-species variance constraint) are what you’d expect from that skeleton operating under realistic distributions of s, c, and environmental noise.
4. How should I think about h and f? Is h a bare recombination rate, or does it include the frequency of the gene in the source population?
In the model, h is an effective net rate, not a bare recombination rate.
In the empirical analysis, f is species-level frequency — e.g. the fraction of all E. coli genomes carrying the gene. For the equilibrium relation f ≈ h / |snet| to give a non-zero f against selection, h has to represent input from outside the focal lineage. Pure intra-species recombination among carriers and non-carriers redistributes f but can’t sustain it against purifying selection. So h is, in mechanism, a rate of gene re-entry from a broader gene pool: other strains, related species, MGEs, environmental DNA.
Mechanistically, h decomposes as h = r × pext, where r is the bare per-cell per-generation recombination/transfer rate and pext is the frequency of the gene in the external source pool. The h values quoted in the paper (~10⁻² for conjugative plasmids down to ~10⁻⁶–10⁻⁸ for chromosomal homologous recombination) are intended as effective rates that already absorb pext. They’re calibrated to empirically observed transfer rates, not bare encounter rates with pext = 1.
This matters when reading the predictions. AMR cassettes carried on broad-host-range conjugative elements have substantial pext across Enterobacteriaceae and beyond, so heff stays moderate. Genes that are uniformly rare across the broader pool do give small heff, and the model predicts correspondingly low equilibrium f. The U-shape architecture is robust to this; the rare peak just sits closer to the floor.
A more complete version would track a coupled system: focal-species f alongside a broader gene pool with its own dynamics, with heff = r × pext doing the bookkeeping. I haven’t done that here. It’s a natural follow-up paper, and would let one ask which genes are pool-supported vs. self-supported, why some species are more “open” than others (gene-pool diversity rather than just r), and how the pool itself evolves. The present paper treats h as an effective net rate to keep the model minimal and the predictions testable; the decomposition is the next step rather than a footnote in this one.
Thanks to Paul Carini (questions 1–3) and Ben Good (question 4) for raising these in discussion of the preprint on Bluesky.