Non-negative Matrix Factorization (NMF) is one of the most popular methods used to analyze high-dimensional count data. It is used in various fields of research, and this dissertation serves as a guideline for analyzing count data with NMF, with a particular focus on its applications to mutational counts from tumors in cancer. The statistical properties and challenges of NMF are explained, along with proposed solutions.
In broad terms, NMF reduces high-dimensional count data into a factorization of two smaller non-negative matrices, with the goal of retaining the essential information in the data. To achieve this, several challenges of NMF need to be considered. These include the possible non-uniqueness of the factorization, the effects of the underlying distributional assumptions, how to choose the rank of the factorization, and how to regularize the results. These challenges and their interconnections are elaborated upon in the introduction of this dissertation, where Paper A-D explore solutions to these issues, with a particular focus on applications to mutational counts in cancer. The final paper discusses the application of NMF to spatial count data, where the locations of the observations are known. This is relevant to spatial transcriptomics data, where both the location and gene expression of single cells are known within tissue slices.
Although all the applications discussed focus on cancer genomics data, this dissertation should equip the reader to use NMF to analyze any type of count data and obtain an interpretable and robust factorization.