Mapping the bias and failure modes of a classic evolutionary test across 500 replicates, thousands of conditions, and two population structures.
The QST–FST comparison is a cornerstone method in evolutionary genetics. The logic: if phenotypic divergence between populations (QST) exceeds neutral genetic divergence (FST), then diversifying selection is likely acting on the trait. If QST ≈ FST, drift alone can explain the pattern.
The theoretical framework is elegant, but its assumptions are rarely satisfied in real data. This thesis asked: under what conditions does the test produce false positives or false negatives — and can we quantify those failure modes precisely?
The specific axes of investigation were trait architecture (number of QTLs, dominance, epistasis), population structure (island model vs. stepping-stone), and MAF filtering thresholds.
All simulations were run in SLiM 5.0, a forward-time population genetics simulator. Each condition was replicated 500 times to obtain stable false-positive rate estimates.
Demographic models: Island model (all demes equally connected, panmictic assumption holds) and stepping-stone model (demes connected only to neighbours, creates isolation-by-distance).
Trait architectures: Additive, dominant (h=1), recessive (h=0), and epistatic (pairwise interactions). Effect sizes were drawn from an L-shaped distribution to mimic empirical QTL studies.
# SLiM simulation dispatch — Python wrapper def run_condition(n_qtls, arch, demography, replicate_id): slim_script = generate_slim_script( n_loci = n_qtls, h = DOMINANCE[arch], # 0, 0.5, 1 n_demes = 10, model = demography, # 'island' | 'stepping_stone' seed = replicate_id * 42 ) result = subprocess.run( ['slim', '-s', str(replicate_id), slim_script], capture_output=True ) return parse_slim_output(result.stdout) # SLURM array — 500 replicates × N conditions in parallel # sbatch --array=0-499 run_simulation.sh ${CONDITION_ID}
All FPR estimates are reported with exact binomial confidence intervals at the 95% level. Differences between conditions were tested with Wilcoxon signed-rank tests (non-parametric, since QST distributions are rarely normal).
The 500-replicate design was chosen to give confidence intervals narrow enough to distinguish FPR = 5% (the nominal level) from FPR = 10% with adequate power. Power analysis confirmed this was sufficient for the expected effect sizes.
# R — exact binomial CI for FPR estimate compute_fpr_ci <- function(n_replicates, n_false_pos, alpha = 0.05) { binom.test( x = n_false_pos, n = n_replicates, p = alpha, alternative = "two.sided" ) |> broom::tidy() } # Wilcoxon test for QST distribution differences wilcox.test( qst_island ~ architecture, data = sim_results, paired = FALSE, exact = FALSE )
Running 500 replicates across dozens of conditions required careful HPC job management. The pipeline was orchestrated with SLURM job arrays — each condition dispatched as a separate array job, with output files collected and merged post-hoc.
The full simulation-to-result pipeline is fully reproducible: a single shell script regenerates all figures from raw SLiM output. All random seeds are recorded, and intermediate files are versioned.