Rare cell types play a pivotal role in biology due to their potential significance in various physiological and pathological processes. Identifying these rare cell types from organism is crucial for uncovering novel insights into disease mechanisms. In recent years, the development of single-cell RNA sequencing technology (scRNA-seq) has made breakthroughs in understanding much deeper into cell types across biological systems.
However, due to highly unbalanced nature of cell type composition, it remains a challenge to identify both common and rare cell types accurately. To address this challenge, we developed an ensemble-based approach for rare cell types, called RareEnsemble. This method utilises both deep generative modelling (scVI) and Bayesian approaches to improve the detection task.
Briefly, we hypothesise that the cell-cell distance of a scRNA-seq dataset with multiple cell types calculated by scVI should follow a Gaussian Mixture Model (GMM) with two components. The first component represents the cell-cell distance within the same cell type, and the second distribution reflects the distance for cells across different cell types. Based on this assumption, we classify cells into highly reliable small cell groups through ensembled outcomes then merge relatively larger cell groups based on estimated distance distribution to main cell types. Following that, we construct a Bayesian model to test the remaining cells to find robust rare cell groups.
We systematically evaluated the performance of different distance calculation methods including spatial information-based Euclidean distance, City block distance, and orientation-based cosine distance in integrated clustering tests. We investigated methods for obtaining the most reliable small cell groups through binarization based on ensemble and feature information. Finally, we implemented RareEnsemble across different biological and simulated datasets. RareEnsemble is a robust method to address the long-standing need to reliably find rare cell types and we expect it to be broadly applicable to multiple systems and projects.