ML models

Once you have embeddings, train_ml_model trains a supervised model that maps each graph’s vector to its graph_label, and evaluates it over several train/test splits.

train.py

import numpy as np

results = nxt.train_ml_model(
  graph_collection=graphs,
  embeddings=embeddings,
  model_type="classifier",   # or "regressor"
  balance_dataset=False,
  sample_size=5,             # number of train/test iterations
)

print("mean accuracy:", np.mean(results["accuracy"]))

Parameters

Parameter	Default	Meaning
`model_type`	—	`"classifier"` or `"regressor"` (required)
`balance_dataset`	`False`	Balance classes with SMOTE before training (classification)
`sample_size`	`5`	Number of train/test iterations to average over
`n_jobs`	`-1`	Parallel workers (`-1` = all CPUs)
`parallel_backend`	`"process"`	`"process"` or `"thread"`

The result

train_ml_model returns a dictionary of evaluation metrics, each a list with one entry per iteration. Classifiers report results["accuracy"]; regressors report results["rmse"]. Averaging over the iterations gives a stable estimate:

metrics.py

# Classification
print("accuracy:", np.mean(results["accuracy"]))

# Regression
# results = nxt.train_ml_model(graphs, embeddings, model_type="regressor")
# print("rmse:", np.mean(results["rmse"]))

Models used

NEExT trains XGBoost by default (XGBClassifier / XGBRegressor). If XGBoost isn’t installed, it falls back to scikit-learn’s random forest. Class balancing uses SMOTE from imbalanced-learn when balance_dataset=True and the package is available.

For finer control (e.g. test_size, choosing random_forest explicitly), use the underlying MLModels class from NEExT.ml_models directly — see the API reference.

To understand why a model works, rank the structural features that feed it with Feature importance.