A production-grade Variational Autoencoder for unsupervised cell phenotype classification — with CI/CD from day one.
BioVAE-Phenotyper is a deep learning project built to answer a practical biology question: can we classify cell phenotypes from imaging data without any labels? The answer is a Convolutional Variational Autoencoder (Conv-VAE) that learns a compressed latent representation of cell morphology, which is then visualised with t-SNE to reveal natural phenotypic clusters.
Beyond the science, this project was deliberately built to production standards — full CI/CD, dependency management with Poetry, unit tests on every commit, and structured code organisation. The goal was to demonstrate that bioinformatics code doesn't have to be a collection of messy notebooks.
The model is a Convolutional VAE. The encoder progressively downsamples the input image through conv layers, producing a mean (μ) and log-variance (σ²) vector. A sample z is drawn via the reparameterisation trick. The decoder mirrors the encoder with transposed convolutions to reconstruct the input.
The loss is the standard ELBO: reconstruction loss (MSE) + KL divergence regularisation. The KL term encourages a well-organised, continuous latent space — which is precisely what makes t-SNE visualisation meaningful rather than arbitrary.
# VAE core — reparameterisation trick def reparameterise(self, mu, log_var): """Sample z ~ N(mu, exp(log_var)) during training, use mu directly at inference.""" if self.training: std = torch.exp(0.5 * log_var) eps = torch.randn_like(std) return mu + eps * std return mu # ELBO loss def loss(recon_x, x, mu, log_var, beta=1.0): recon = F.mse_loss(recon_x, x, reduction='sum') kld = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp()) return recon + beta * kld
Every commit to the repository triggers an automated GitHub Actions workflow. This ensures the model code stays correct as it evolves, and makes the project immediately trustworthy to any collaborator or reviewer.
This CI setup catches shape mismatches, broken imports, and regressions immediately — discipline that distinguishes reproducible research from throwaway code.
The t-SNE projection of the VAE's latent space shows clearly separated phenotypic clusters that align with biological expectations — without any labels being provided during training. This validates the VAE as a useful dimensionality reduction tool for cell morphology data.