GraphBEAN: A Powerful Tool for Anomaly Detection on Bipartite Graphs

Introduction

This tutorial provides a hands-on introduction to the GraphBEAN model, a novel graph neural network architecture for unsupervised anomaly detection on bipartite node-and-edge-attributed graphs. This model was originally presented in the paper “Interaction-Focused Anomaly Detection on Bipartite Node-and-Edge-Attributed Graphs” by Fathony et al. (2023), which we have implemented as part of our FinTorch project. Note that we generalized the concepts of GraphBEAN from bipartite networks to k-partite networks.

GraphBEAN addresses the limitations of existing anomaly detection models, which typically focus on homogeneous graphs or neglect rich edge information. It leverages an autoencoder-like approach, employing a customized encoder-decoder structure to effectively encode both node and edge attributes, as well as the underlying graph structure, into low-dimensional latent representations. These representations are then used to reconstruct the original graph, and reconstruction errors are used to identify anomalous edges and nodes.

This tutorial will guide you through the core concepts of GraphBEAN, demonstrating its usage with a practical example using the Elliptic dataset. You will learn how to:

Load and explore bipartite node-and-edge-attributed graph data.
Define and train a GraphBEAN model using PyTorch Lightning.
Analyze and interpret anomaly detection results.

This tutorial will enable you to effectively apply GraphBEAN to diverse applications involving bipartite graphs, such as fraud detection in financial transactions, malicious activity detection in network security, or anomaly detection in user-item interaction networks.

Install FinTorch

[ ]:

!pip install fintorch

[ ]:

import torch

# Installation of PyTorch Geometric and dependencies based on detected versions
def install_pyg_and_dependencies():
  !pip install pyg-lib -f https://data.pyg.org/whl/torch-{torch.__version__}.html
  !pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-{torch.__version__}.html

# Detect PyTorch version
if torch.__version__ >= "1.13.0":
  print("PyTorch version 1.13.0 or newer detected. Installing PyG and dependencies...")
  install_pyg_and_dependencies()
else:
  print("PyTorch version is older than 1.13.0. PyG might not work correctly. Please upgrade PyTorch or use the pip install torch_geometric method.")


# Verify installation
try:
  import torch_geometric
  print(f"PyTorch Geometric successfully installed. Version: {torch_geometric.__version__}")
except ImportError:
  print("PyTorch Geometric not found. Installation might have failed.")

Code

The tutorial kicks off by importing the necessary libraries: PyTorch Lightning for streamlined training, PyTorch Geometric for powerful graph convolution operations, and FinTorch modules for loading the Elliptic dataset and utilizing the GraphBEAN model. We then create an instance of the EllipticppDataModule, clearly defining the dataset’s bipartite structure with “wallets” and “transactions” as node types and “to” as the edge type. This module takes care of data loading, splitting, and generating data loaders for efficient training.

[1]:

import lightning as L
from torch_geometric.nn.conv import TransformerConv

from fintorch.datasets.ellipticpp import EllipticppDataModule
from fintorch.models.graph.graphbean.graphBEAN import GraphBEANModule

Next, we prepare the dataset by initializing the data module and displaying its structure, revealing the node types, their attributes, and edge connections. We then delve deeper into the dataset’s structure by retrieving its metadata, gaining a high-level understanding of the relationships within the bipartite graph.

[2]:

# We use an example data module from the elliptic dataset which is bipartite
data_module = EllipticppDataModule(("wallets", "to", "transactions"))

[3]:

data_module.setup()

data_module.dataset

Start download from HuggingFace...

  0%|          | 0/7 [00:00<?, ?it/s]100%|██████████| 7/7 [00:01<00:00,  6.68it/s]

[3]:

HeteroData(
  wallets={
    x=[1268260, 55],
    y=[1268260],
    train_mask=[1268260],
    val_mask=[1268260],
    test_mask=[1268260],
  },
  transactions={
    x=[203769, 182],
    y=[203769],
    train_mask=[203769],
    val_mask=[203769],
    test_mask=[203769],
  },
  (transactions, to, transactions)={ edge_index=[2, 234355] },
  (transactions, to, wallets)={ edge_index=[2, 837124] },
  (wallets, to, transactions)={ edge_index=[2, 477117] },
  (wallets, to, wallets)={ edge_index=[2, 2868964] }
)

[4]:

data_module.dataset.metadata()

[4]:

(['wallets', 'transactions'],
 [('transactions', 'to', 'transactions'),
  ('transactions', 'to', 'wallets'),
  ('wallets', 'to', 'transactions'),
  ('wallets', 'to', 'wallets')])

We proceed to create a GraphBEAN model instance, specifying the node and edge types we use in the convolution, learning rate, convolution type, and the number of layers in the encoder, decoder, and hidden layers. We also set up a PyTorch Lightning trainer, defining the maximum number of training epochs and enabling GPU acceleration for faster training.

[ ]:

mapping = dict()
for key in data_module.dataset.metadata()[0]:
    mapping[key] = data_module.dataset[key].x.shape[1]

# Create an instance of the GraphBEANModule
module = GraphBEANModule(
    ("wallets", "to", "transactions"),
    edge_types=[("wallets", "to", "transactions"),
                ("transactions", "to", "wallets")],
    mapping=mapping,
    learning_rate=0.001,
    encoder_layers=5,
    decoder_layers=5,
    hidden_layers=50,
)

# Create a PyTorch Lightning Trainer and train the module
trainer = L.Trainer(max_epochs=1, accelerator="auto")

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

Finally, we train the model by using the PyTorch Lightning trainer with the data loaders provided by the EllipticDataModule. The trainer automatically manages the training loop, logging, and progress tracking, ensuring a smooth and efficient training experience.

[6]:

# Train the module using the dataloaders
trainer.fit(module, datamodule=data_module)

You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision

Start download from HuggingFace...

100%|██████████| 7/7 [00:00<00:00,  9.29it/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/marcel/Documents/research/FinTorch/.conda/lib/python3.11/site-packages/lightning/pytorch/utilities/model_summary/model_summary.py:454: A layer with UninitializedParameter was found. Thus, the total number of parameters detected may be inaccurate.

  | Name      | Type                      | Params
--------------------------------------------------------
0 | accuracy  | MulticlassAccuracy        | 0
1 | f1        | MulticlassF1Score         | 0
2 | recall    | MulticlassRecall          | 0
3 | precision | MulticlassPrecision       | 0
4 | confmat   | MulticlassConfusionMatrix | 0
5 | aucroc    | MulticlassAUROC           | 0
6 | model     | GraphBEAN                 | 165 K
--------------------------------------------------------
165 K     Trainable params
0         Non-trainable params
165 K     Total params
0.664     Total estimated model params size (MB)

`Trainer.fit` stopped: `max_epochs=1` reached.

Conclusion

This concludes our tutorial on GraphBEAN, a powerful tool for anomaly detection on bipartite graphs. You’ve learned how to load and explore bipartite node-and-edge-attributed graphs using the Elliptic dataset and FinTorch’s data module. You’ve also gained experience in defining and configuring a GraphBEAN model, customizing its parameters to suit your specific needs. Finally, you’ve trained the GraphBEAN model using PyTorch Lightning.

This tutorial provided a foundation for applying GraphBEAN to various real-world applications involving bipartite graphs. You can now adapt these concepts and further explore GraphBEAN’s capabilities. Experiment with different datasets, explore hyperparameter tuning, and delve into advanced anomaly detection techniques.

Remember, GraphBEAN is a powerful tool for uncovering hidden anomalies in complex bipartite data, potentially leading to valuable insights in domains like fraud detection, network security, or user behavior analysis. We encourage you to continue exploring and implementing this model, pushing the boundaries of anomaly detection in bipartite graphs.