All Projects // project_01 · deep learning · population genetics · simulation-based inference

DeepLocalAdaptation

A multi-branch CNN with attention mechanisms that learns to distinguish local adaptation from genetic drift — trained on spatially explicit forward-time simulations.

Status✓ Model trained — results available

Training~27 epochs to convergence

GitHubJikaelN/DeepLocalAdaptation

StackPython · PyTorch · SLiM 5 · tskit

Motivation

Distinguishing local adaptation from neutral genetic drift is a fundamental challenge in population genetics. Classical approaches rely on Q_ST–F_ST comparisons — but as my Master's thesis demonstrated, this framework structurally breaks down under isolation-by-distance and non-additive trait architectures, where false positive rates remain elevated regardless of filtering strategy.

Rather than relying on these traditional comparisons, this project trains a neural network directly on spatially explicit forward-time simulations to learn the complex genomic and phenotypic signatures that differentiate adaptation from drift. The key innovation is using Simulation-Based Inference (SBI) — bypassing the need for confirmed empirical datasets by training on ground-truth simulations where the answer is always known.

Disclaimer: While the model performs exceptionally well on simulated data, its performance on real-world datasets may vary due to factors not captured in the simulations. This project is only a proof-of-concept and should not be used for making real-world biological inferences without further validation.

Results

100% Validation accuracy — 3,000 simulations

~27 Epochs to convergence

3,000 Training simulations

CNN Multi-branch + attention + spatial gradient

The model achieves 100% validation accuracy on held-out simulations, converging in approximately 27 epochs. This demonstrates that the spatial genomic and phenotypic patterns generated by selection are sufficiently distinct from drift patterns for a neural network to learn a reliable decision boundary — even across the demographic scenarios where Q_ST–F_ST fails.

Architecture

The model is a multi-branch CNN combining three components: a genomic branch processing spatial allele frequency matrices, a phenotypic branch processing per-deme trait summary statistics, and attention mechanisms + spatial gradient detection to weight the most informative regions of the input.

# Multi-branch architecture — genomic + phenotypic fusion
# Branch A: spatial allele-frequency CNN
#   → 1D-CNN detecting haplotype sweeps & linkage patterns
#   → Attention mechanism over spatial (deme) axis
#   → Spatial gradient detection for selection signals

# Branch B: phenotypic MLP
#   → Per-deme trait means, QST, variance components

# Fusion head: concatenate → binary classification
#   → P(local adaptation) vs P(drift)

# Training data: 3,000 forward-time SLiM simulations
#   → Island model + 1D stepping-stone demographics
#   → Balanced: adaptation vs drift scenarios

Why attention mechanisms? Not all loci and not all demes contribute equally to the selection signal. Attention lets the model learn which spatial positions and genomic regions are most informative for each scenario — particularly important under stepping-stone structure where the relevant signal is localised in the geographic gradient.

Why SLiM + tskit over VCF? Tree Sequence Recording stores the full genealogy in a compact format, making it practical to generate thousands of replicates on personal hardware without HPC infrastructure. Each simulation produces a complete ground-truth label (adaptation or drift), making it ideal for supervised training.

Scientific Context

This project is directly connected to my Master's thesis findings. The thesis identified that Q_ST–F_ST fails under stepping-stone demography and non-additive architectures. DeepLocalAdaptation is the natural computational response: if classical statistics cannot handle these regimes, can a model that learns directly from simulations do better?

The 100% validation accuracy on the current simulation set is a strong proof-of-concept. Next steps include testing generalisation across a broader range of demographic scenarios, QTL numbers, and effect size distributions — particularly the hard cases identified in the thesis.

Tech Stack

PyTorch Python SLiM 5 tskit NumPy Simulation-Based Inference Population Genetics

Back to projects 1 / 4 BioVAE-Phenotyper