All Projects // project_03 · master's thesis · population genetics · simulation

Q_ST–F_ST Simulation Framework

Mapping the bias and failure modes of a classic evolutionary test across 500 replicates, thousands of conditions, and two population structures.

TypeMaster's Thesis

InstitutionGoudet Group, UNIL

TimelineJun 2024 – Sep 2025

StackPython · R · SLiM 5.0 · SLURM

The Question

The Q_ST–F_ST comparison is a cornerstone method in evolutionary genetics. The logic: if phenotypic divergence between populations (Q_ST) exceeds neutral genetic divergence (F_ST), then diversifying selection is likely acting on the trait. If Q_ST ≈ F_ST, drift alone can explain the pattern.

The theoretical framework is elegant, but its assumptions are rarely satisfied in real data. This thesis asked: under what conditions does the test produce false positives or false negatives — and can we quantify those failure modes precisely?

The specific axes of investigation were trait architecture (number of QTLs, dominance, epistasis), population structure (island model vs. stepping-stone), and MAF filtering thresholds.

Simulation Design

All simulations were run in SLiM 5.0, a forward-time population genetics simulator. Each condition was replicated 500 times to obtain stable false-positive rate estimates.

500 Replicates per condition

1–1K QTL range tested

2 Population models

4 Trait architectures

Demographic models: Island model (all demes equally connected, panmictic assumption holds) and stepping-stone model (demes connected only to neighbours, creates isolation-by-distance).

Trait architectures: Additive, dominant (h=1), recessive (h=0), and epistatic (pairwise interactions). Effect sizes were drawn from an L-shaped distribution to mimic empirical QTL studies.

# SLiM simulation dispatch — Python wrapper
def run_condition(n_qtls, arch, demography, replicate_id):
    slim_script = generate_slim_script(
        n_loci     = n_qtls,
        h          = DOMINANCE[arch],      # 0, 0.5, 1
        n_demes    = 10,
        model      = demography,           # 'island' | 'stepping_stone'
        seed       = replicate_id * 42
    )
    result = subprocess.run(
        ['slim', '-s', str(replicate_id), slim_script],
        capture_output=True
    )
    return parse_slim_output(result.stdout)

# SLURM array — 500 replicates × N conditions in parallel
# sbatch --array=0-499 run_simulation.sh ${CONDITION_ID}

Key Findings

Additive polygenic models (≥5 QTLs) are safe. Under island structure, the Q_ST–F_ST test stays within the expected 5% false positive rate once the trait is controlled by 5 or more additive loci. Monogenic traits are a special case where FPR reaches ~30%.
Stepping-stone structure inflates FPR regardless of polygenicity. Under isolation-by-distance, FPR remains elevated even for highly polygenic additive traits. This is the most practically important finding — most real populations have some spatial structure.
Non-additive architectures break the test. Dominant and recessive effect sizes produce systematic departures from the expected null distribution, with FPR well above 5% even under the island model. The test was derived under the assumption of additivity — violating it undermines the theoretical guarantees.
MAF filtering has limited effect on FPR under stepping-stone. Filtering out low-frequency variants reduces noise in the F_ST estimate but does not correct the spatial autocorrelation problem. It helps under the island model but is not a solution for geographically structured populations.
The framework is most reliable for additive, island-model traits. When these conditions are met, Q_ST–F_ST is a well-calibrated test. Practitioners should be aware of when these assumptions are violated — and the DeepLocalAdaptation project is a direct response to closing that gap.

Statistical Validation

All FPR estimates are reported with exact binomial confidence intervals at the 95% level. Differences between conditions were tested with Wilcoxon signed-rank tests (non-parametric, since Q_ST distributions are rarely normal).

The 500-replicate design was chosen to give confidence intervals narrow enough to distinguish FPR = 5% (the nominal level) from FPR = 10% with adequate power. Power analysis confirmed this was sufficient for the expected effect sizes.

# R — exact binomial CI for FPR estimate
compute_fpr_ci <- function(n_replicates, n_false_pos, alpha = 0.05) {
  binom.test(
    x       = n_false_pos,
    n       = n_replicates,
    p       = alpha,
    alternative = "two.sided"
  ) |> broom::tidy()
}

# Wilcoxon test for QST distribution differences
wilcox.test(
  qst_island ~ architecture,
  data   = sim_results,
  paired = FALSE,
  exact  = FALSE
)

Infrastructure

Running 500 replicates across dozens of conditions required careful HPC job management. The pipeline was orchestrated with SLURM job arrays — each condition dispatched as a separate array job, with output files collected and merged post-hoc.

The full simulation-to-result pipeline is fully reproducible: a single shell script regenerates all figures from raw SLiM output. All random seeds are recorded, and intermediate files are versioned.

Tech Stack

Python 3.11 SLiM 5.0 R 4.3 SLURM / HPC NumPy / pandas broom (R) ggplot2 Bash scripting

BioVAE-Phenotyper 3 / 4 DeepLocalAdaptation

QST–FST Simulation Framework