Graph embeddings

A graph embedding rolls a graph’s per-node features up into a single fixed-length vector, so each graph becomes one row you can cluster, compare, or feed to a model. NEExT offers four algorithms behind one call.

embeddings.py
embeddings = nxt.compute_graph_embeddings(
  graph_collection=graphs,
  features=features,
  embedding_algorithm="approx_wasserstein",
  embedding_dimension=16,
)

embedding_dimension is required and sets the output vector length.

The four algorithms

embedding_algorithmFamilyNotes
approx_wassersteinDistribution-basedFast approximate Wasserstein embedding (a good default)
wassersteinDistribution-basedExact Wasserstein embedding
sinkhornvectorizerDistribution-basedSinkhorn-based vectorizer
gnnGraph neural networkPure-PyTorch GCN / GraphSAGE / GIN

The three distribution-based algorithms come from the vectorizers library (included in the core install) and accept feature_columns, random_state, and memory_size.

GNN embeddings

The gnn algorithm trains a graph neural network unsupervised (node-feature reconstruction) and pools node representations to the graph level. It is pure PyTorch — no DGL or PyTorch Geometric — and requires the gnn extra (pip install "NEExT[gnn]").

gnn.py
embeddings = nxt.compute_graph_embeddings(
  graph_collection=graphs,
  features=features,
  embedding_algorithm="gnn",
  embedding_dimension=16,
  architecture="GraphSAGE",     # "GCN", "GraphSAGE", or "GIN"
  hidden_dims=[64, 32],
  epochs=100,
  learning_rate=0.01,
  weight_decay=5e-4,
  dropout=0.0,
  pooling="mean",               # "mean", "sum", or "max"
  early_stopping_patience=10,
)

These GNN-only parameters are ignored by the other algorithms:

ParameterDefaultMeaning
architecture"GCN""GCN", "GraphSAGE", or "GIN"
hidden_dims[64, 32]Hidden layer sizes
epochs100Training epochs
learning_rate0.01Adam learning rate
weight_decay5e-4Adam weight decay
dropout0.0Dropout between layers (0–1)
pooling"mean"Node-to-graph pooling ("mean", "sum", "max")
early_stopping_patience10Epochs without validation improvement before stopping

Each graph is processed with a dense adjacency matrix, which suits NEExT’s typically small graphs. NEExT warns when a graph exceeds ~5,000 nodes.

The Embeddings container

compute_graph_embeddings returns an Embeddings object:

  • embeddings.embeddings_df — DataFrame with graph_id plus emb_0 … emb_{D-1}.
  • embeddings.embedding_name — the algorithm used.
  • embeddings.embedding_columns — the embedding column names.
  • emb_a + emb_b — merge two embeddings on graph_id, prefixing columns with each algorithm name so you can stack representations.

Next: train a model on these embeddings.