All Projects // project_02 · deep learning · software engineering

BioVAE-Phenotyper

A production-grade Variational Autoencoder for unsupervised cell phenotype classification — with CI/CD from day one.

StatusComplete
Timeline2024 – 2025
StackPyTorch · GitHub Actions · Poetry · Pytest

Overview

BioVAE-Phenotyper is a deep learning project built to answer a practical biology question: can we classify cell phenotypes from imaging data without any labels? The answer is a Convolutional Variational Autoencoder (Conv-VAE) that learns a compressed latent representation of cell morphology, which is then visualised with t-SNE to reveal natural phenotypic clusters.

Beyond the science, this project was deliberately built to production standards — full CI/CD, dependency management with Poetry, unit tests on every commit, and structured code organisation. The goal was to demonstrate that bioinformatics code doesn't have to be a collection of messy notebooks.

Pipeline

01
Data Ingestion
Cell images + metadata
02
Conv Encoder
μ, σ → latent z
03
Conv Decoder
Reconstruct image
04
t-SNE
Latent → 2D clusters
05
Phenotype Labels
Post-hoc cluster IDs

Architecture

The model is a Convolutional VAE. The encoder progressively downsamples the input image through conv layers, producing a mean (μ) and log-variance (σ²) vector. A sample z is drawn via the reparameterisation trick. The decoder mirrors the encoder with transposed convolutions to reconstruct the input.

The loss is the standard ELBO: reconstruction loss (MSE) + KL divergence regularisation. The KL term encourages a well-organised, continuous latent space — which is precisely what makes t-SNE visualisation meaningful rather than arbitrary.

# VAE core — reparameterisation trick
def reparameterise(self, mu, log_var):
    """Sample z ~ N(mu, exp(log_var)) during training,
       use mu directly at inference."""
    if self.training:
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + eps * std
    return mu

# ELBO loss
def loss(recon_x, x, mu, log_var, beta=1.0):
    recon = F.mse_loss(recon_x, x, reduction='sum')
    kld   = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
    return recon + beta * kld

Engineering: CI/CD Pipeline

Every commit to the repository triggers an automated GitHub Actions workflow. This ensures the model code stays correct as it evolves, and makes the project immediately trustworthy to any collaborator or reviewer.

git push → GitHub Actions trigger on: push, pull_request
Poetry install dependencies reproducible env
Pytest — unit tests model forward pass, loss, shapes
Lint & type checks flake8 / mypy
Status badge → README always green or fix it

This CI setup catches shape mismatches, broken imports, and regressions immediately — discipline that distinguishes reproducible research from throwaway code.

Results

CI passing on every commit
t-SNE Clear cluster separation in latent space
100% Reproducible environment (Poetry lock)

The t-SNE projection of the VAE's latent space shows clearly separated phenotypic clusters that align with biological expectations — without any labels being provided during training. This validates the VAE as a useful dimensionality reduction tool for cell morphology data.

Tech Stack

PyTorch Python 3.11 GitHub Actions Poetry Pytest t-SNE NumPy Matplotlib
DeepLocalAdaptation 2 / 3 QST–FST Framework