Diffusion Geometry User's Guide¶

A practical introduction to the Diffusion Geometry package¶

Starting from a finite set of points (pointcloud), we compute extensions of classical geometric and analytic objects, such as gradients, vector fields, differential form, and differential operators, directly from the data, infering the necessary geometry using diffusion processes. Importantly, this is done without assuming meshes, coordinates, or a manifold structure.

Design goals¶

This codebase is designed to be fast and ergonomic.

All computations are intended to be efficient enough for interactive use on moderate-to-large point clouds, relying on sparse and low-rank linear algebra and reusing intermediate results wherever possible.

At the same time, the API is deliberately close to the notation of classical differential geometry: functions, gradients, vector fields, inner products, and operators behave as their mathematical counterparts. Once the diffusion geometry backend is constructed, working with data should feel as close as possible to working with smooth geometry — without assuming a manifold.

Point Clouds¶

We take a minimal view of data. A point cloud is just a finite collection of points, possibly embedded in a high-dimensional space, noisy, and even non-manifold. The following code shows an example.

In [1]:

import sys
import os
sys.path.append(os.path.abspath('..'))  # Add parent directory to path
from diffusion_geometry.visualisation import plot_scatter_2d
import numpy as np

from plotly.subplots import make_subplots

np.random.seed(0)

# Generate synthetic 2D point cloud M
n = 150
xx = np.linspace(0, 2*np.pi, n)
M = np.array([[np.cos(x), np.sin(x)] for x in xx]) + 0.2*np.random.randn(xx.shape[0], 2)
M = 1*np.array([[np.cos(x)/(1 + np.sin(x)**2), np.sin(x)*np.cos(x)/(1 + np.sin(x)**2)] for x in xx]) + 0.1*M + 0.0*np.random.randn(xx.shape[0], 2) 
M = np.concatenate((M, 0.18*np.random.randn(200,2) - [0.2,0]))

plot_scatter_2d(M, color = 'black').show()

The point cloud exhibits two key features that are immediately apparent to the human eye and serve as motivating examples for the methods we develop:

Two loops: The dataset forms two distinct loop structures.
Dimensionality variation: While the outer loops are essentially 1-dimensional, the interior region spans a 2-dimensional area.

The backbone of the package is the DiffusionGeometry class. Given a point cloud $M$ embedded in $\mathbb{R}^d$ consiting of $n$ points, it constructs a diffusion-based representation of the data by first computing a Markov triple $(M, \mu, K)$, where $\mu$ is a measure and $K$ is a kernel matrix. Starting from a variable bandwidth kernel: $$ k(x,y)=\mathrm{exp}\left(- \frac{||x-y||^2}{\rho(x)\rho(y)} \right) $$ for some bandwidth function $\rho$. We obtain a Markov triple by

Computing the kernel matrix $K_{ij} = k(p_i, p_j)$.
Defining $D_i = \sum^n_{j=1} K_{ij}$ to be the row sums, and then normalising $D$ into a probability measure $$ \mu = \frac{D}{\sum_{k=1}^n D_k}. $$ All of this data is computed upon the creation of a DiffusionGeometry instance.

In [2]:

from diffusion_geometry.core import DiffusionGeometry


dg = DiffusionGeometry.from_point_cloud(M)

# We can query the coordinates M and the measure of the Markov triple.
# The measure mu gives us the weights associated to each point in M
# that we can use for integration.
triple = dg.triple
M = triple.immersion_coords
mu = triple.measure


print("Shapes of M and mu:", M.shape, mu.shape)

Shapes of M and mu: (350, 2) (350,)

In the next chapter, we introduce functions that will immediately make use of this Markov triple to define gradients.