Aarhus Universitets segl

The Reach of Manifold Learning – Uniqueness Bounds for Latent Representations

by Helene Hauschultz
PhD Dissertations February 2023

In manifold learning the aim is to represent data sets in lower dimension by fitting a manifold to the data. Here the word manifold refers to the more general idea of a subsets which can be described in fewer variables than the ambient space, as opposed to the stringent mathematical definition of a manifold. Lower dimensional representations can be found by projecting a data point to the (unique) nearest point on the manifold. However, for any non-linear manifold, such a unique projection does not exist everywhere.

One way to study where a unique projection does exist is through reach. Reach gives a global bound on how far one can move from the manifold while ensuring that a unique nearest point exists. However, this bound is too restrictive in a practical setting, as many points further away will still have a unique nearest point. Instead we introduce a new uniqueness bound called the pointwise normal reach, which gives a bound on how far one can move in a normal direction while ensuring unique projections. In addition to the pointwise normal reach, we introduce a related unique- ness bound on immersed manifolds. This bound is beneficial as it utilises standard analytical tools to compute the bound.

Finally, we use these bounds in practice, as we aim to understand the uniqueness of representation in the autoencoder algorithm. We create a test which asks if a datapoint is within reach of the assigned latent representation. A datapoint which fails this test is not ensured to have a unique best choice of latent representation. We employ Monte Carlo Sampling to estimate the pointwise normal reach of three autoencoders. We find that the algorithms fail this test for most data points. Thus, almost no data points are certain to have a unique best choice of representation. To improve this, we introduce a regularising term into the training algorithm. This significantly improves the amount of points which passes the reach test. However, the experimental setup is currently too expensive to employ for non-toy datasets

Format available: PDF (4 MB)
Dissertation supervisor: Andrew du Plessis