{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Forecasting in real-world financial market data\n",
    "In this tutorial, we will load data from the Jane Street Kaggle competition and apply advanced neural forecasting models. We'll evaluate the performance of these models and understand their suitability for this task.\n",
    "\n",
    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/AI4FinTech/FinTorch/blob/main/docs/tutorials/marketdata/marketdata.ipynb\">\n",
    "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
    "</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## WARNING\n",
    "This tutorial requires a large amount of RAM. It is currently not yet possible to stream load the data. Experiments are conducted on a machine with 128GB of RAM."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "First, we import the necessary libraries and configure logging"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/marcel/Documents/research/FinTorch/.conda/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n",
      "2024-11-30 12:04:18,606\tINFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.\n",
      "2024-11-30 12:04:18,697\tINFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.\n"
     ]
    }
   ],
   "source": [
    "import logging\n",
    "import torch\n",
    "from neuralforecast import NeuralForecast\n",
    "from neuralforecast.losses.numpy import mae, mse\n",
    "from neuralforecast.models import NBEATS, NHITS, BiTCN, NBEATSx\n",
    "from fintorch.datasets.marketdata import MarketDataset\n",
    "\n",
    "# Set up logging\n",
    "logging.basicConfig(level=logging.INFO)\n",
    "torch.set_float32_matmul_precision(\"medium\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining parameters\n",
    "We define hyperparameters such as the input size (past time steps for prediction), forecast horizon, and batch sizes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Hyperparameters\n",
    "input_size = 2  # Number of past time steps used for prediction\n",
    "days = 1  # Number of days to forecast\n",
    "steps_per_day = 5  # Number of steps per day\n",
    "horizon = days * steps_per_day  # Forecast horizon\n",
    "max_steps = 100  # Max training steps\n",
    "\n",
    "# Validation and test set sizes\n",
    "val_size = horizon\n",
    "test_size = horizon\n",
    "\n",
    "# Batch sizes\n",
    "batch_size = 16\n",
    "windows_batch_size = 16\n",
    "valid_batch_size = 16\n",
    "batch_size = 1  # Single series per batch\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataloading\n",
    "We begin by loading the dataset using the MarketDataset class from the fintorch library:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:root:Load train market data\n",
      "INFO:root:Downloading dataset from Kaggle\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading jane-street-real-time-market-data-forecasting.zip to ~/.fintorch_data/marketdata-janestreet/raw\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 11.5G/11.5G [07:15<00:00, 28.3MB/s]  \n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:root:Processing: apply transformations to train market data\n",
      "INFO:root:Processing: apply transformations to train market data\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of unique series: 39\n"
     ]
    }
   ],
   "source": [
    "# Load dataset\n",
    "df = MarketDataset(\"~/.fintorch_data/marketdata-janestreet/\")\n",
    "df = df.data.collect()\n",
    "\n",
    "# Number of unique series\n",
    "n_series = len(df[\"unique_id\"].unique())\n",
    "print(f\"Number of unique series: {n_series}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This code initializes the dataset from the specified path and collects it into a DataFrame for further processing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Models\n",
    "We utilize the NeuralForecast library to implement four models: NHITS, BiTCN, NBEATS, and NBEATSx. These models are well-suited for time series forecasting due to their unique architectures:\n",
    "\n",
    "* NHITS: Builds upon NBEATS by specializing its outputs in different frequencies of the time series through hierarchical interpolation and multi-rate input processing, enhancing long-horizon forecasting accuracy. \n",
    "\n",
    "\n",
    "* BiTCN: Employs bidirectional temporal convolutional networks to capture both past and future dependencies in time series data, making it effective for sequential data modeling. \n",
    "\n",
    "\n",
    "* NBEATS: A deep neural architecture with backward and forward residual links, capable of modeling trend and seasonality components in time series data. \n",
    "\n",
    "\n",
    "* NBEATSx: Extends NBEATS by incorporating exogenous variables, allowing the model to leverage additional information for improved forecasting accuracy. \n",
    "\n",
    "\n",
    "We configure each model with parameters such as input_size, horizon, max_steps, and batch sizes, and specify any exogenous features to be included."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Seed set to 1\n",
      "Seed set to 1\n",
      "Seed set to 1\n",
      "Seed set to 1\n"
     ]
    }
   ],
   "source": [
    "# Initialize models\n",
    "models = [\n",
    "    NHITS(\n",
    "        input_size=input_size,\n",
    "        h=horizon,\n",
    "        futr_exog_list=[\"feature_00\", \"feature_01\", \"feature_02\"],\n",
    "        scaler_type=\"robust\",\n",
    "        max_steps=max_steps,\n",
    "        windows_batch_size=windows_batch_size,\n",
    "        batch_size=batch_size,\n",
    "        valid_batch_size=valid_batch_size,\n",
    "    ),\n",
    "    BiTCN(\n",
    "        input_size=input_size,\n",
    "        h=horizon,\n",
    "        futr_exog_list=[\"feature_00\", \"feature_01\", \"feature_02\"],\n",
    "        scaler_type=\"robust\",\n",
    "        max_steps=max_steps,\n",
    "        windows_batch_size=windows_batch_size,\n",
    "        batch_size=batch_size,\n",
    "        valid_batch_size=valid_batch_size,\n",
    "    ),\n",
    "    NBEATS(\n",
    "        input_size=input_size,\n",
    "        h=horizon,\n",
    "        max_steps=max_steps,\n",
    "        windows_batch_size=windows_batch_size,\n",
    "        batch_size=batch_size,\n",
    "        valid_batch_size=valid_batch_size,\n",
    "    ),\n",
    "    NBEATSx(\n",
    "        input_size=input_size,\n",
    "        futr_exog_list=[\"feature_00\", \"feature_01\", \"feature_02\"],\n",
    "        h=horizon,\n",
    "        max_steps=max_steps,\n",
    "        windows_batch_size=windows_batch_size,\n",
    "        batch_size=batch_size,\n",
    "        valid_batch_size=valid_batch_size,\n",
    "    ),\n",
    "]\n",
    "\n",
    "# List of model names\n",
    "model_names = [\"NHITS\", \"BiTCN\", \"NBEATS\", \"NBEATSx\"]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cross-validation\n",
    "To assess model performance, we perform cross-validation using the cross_validation method of the NeuralForecast object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "GPU available: True (cuda), used: True\n",
      "TPU available: False, using: 0 TPU cores\n",
      "HPU available: False, using: 0 HPUs\n",
      "Missing logger folder: /home/marcel/Documents/research/FinTorch/docs/tutorials/marketdata/lightning_logs\n",
      "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n",
      "\n",
      "  | Name         | Type          | Params | Mode \n",
      "-------------------------------------------------------\n",
      "0 | loss         | MAE           | 0      | train\n",
      "1 | padder_train | ConstantPad1d | 0      | train\n",
      "2 | scaler       | TemporalNorm  | 0      | train\n",
      "3 | blocks       | ModuleList    | 2.4 M  | train\n",
      "-------------------------------------------------------\n",
      "2.4 M     Trainable params\n",
      "0         Non-trainable params\n",
      "2.4 M     Total params\n",
      "9.591     Total estimated model params size (MB)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 2:  56%|█████▋    | 22/39 [00:37<00:28,  0.59it/s, v_num=0, train_loss_step=14.50, train_loss_epoch=14.40, valid_loss=0.270]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "`Trainer.fit` stopped: `max_steps=100` reached.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 2:  56%|█████▋    | 22/39 [00:37<00:28,  0.59it/s, v_num=0, train_loss_step=14.50, train_loss_epoch=14.40, valid_loss=0.270]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "GPU available: True (cuda), used: True\n",
      "TPU available: False, using: 0 TPU cores\n",
      "HPU available: False, using: 0 HPUs\n",
      "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Predicting DataLoader 0: 100%|██████████| 3/3 [00:12<00:00,  0.24it/s] \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "GPU available: True (cuda), used: True\n",
      "TPU available: False, using: 0 TPU cores\n",
      "HPU available: False, using: 0 HPUs\n",
      "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n",
      "\n",
      "   | Name          | Type          | Params | Mode \n",
      "---------------------------------------------------------\n",
      "0  | loss          | MAE           | 0      | train\n",
      "1  | padder_train  | ConstantPad1d | 0      | train\n",
      "2  | scaler        | TemporalNorm  | 0      | train\n",
      "3  | lin_hist      | Linear        | 80     | train\n",
      "4  | drop_hist     | Dropout       | 0      | train\n",
      "5  | net_bwd       | Sequential    | 1.1 K  | train\n",
      "6  | lin_futr      | Linear        | 64     | train\n",
      "7  | drop_futr     | Dropout       | 0      | train\n",
      "8  | net_fwd       | Sequential    | 3.2 K  | train\n",
      "9  | drop_temporal | Dropout       | 0      | train\n",
      "10 | temporal_lin1 | Linear        | 48     | train\n",
      "11 | temporal_lin2 | Linear        | 85     | train\n",
      "12 | output_lin    | Linear        | 49     | train\n",
      "---------------------------------------------------------\n",
      "4.6 K     Trainable params\n",
      "0         Non-trainable params\n",
      "4.6 K     Total params\n",
      "0.018     Total estimated model params size (MB)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 2:  56%|█████▋    | 22/39 [00:37<00:28,  0.59it/s, v_num=2, train_loss_step=13.90, train_loss_epoch=14.70, valid_loss=0.275]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "`Trainer.fit` stopped: `max_steps=100` reached.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 2:  56%|█████▋    | 22/39 [00:37<00:28,  0.59it/s, v_num=2, train_loss_step=13.90, train_loss_epoch=14.70, valid_loss=0.275]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "GPU available: True (cuda), used: True\n",
      "TPU available: False, using: 0 TPU cores\n",
      "HPU available: False, using: 0 HPUs\n",
      "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Predicting DataLoader 0: 100%|██████████| 3/3 [00:12<00:00,  0.24it/s]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "GPU available: True (cuda), used: True\n",
      "TPU available: False, using: 0 TPU cores\n",
      "HPU available: False, using: 0 HPUs\n",
      "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "  | Name         | Type          | Params | Mode \n",
      "-------------------------------------------------------\n",
      "0 | loss         | MAE           | 0      | train\n",
      "1 | padder_train | ConstantPad1d | 0      | train\n",
      "2 | scaler       | TemporalNorm  | 0      | train\n",
      "3 | blocks       | ModuleList    | 2.4 M  | train\n",
      "-------------------------------------------------------\n",
      "2.4 M     Trainable params\n",
      "77        Non-trainable params\n",
      "2.4 M     Total params\n",
      "9.534     Total estimated model params size (MB)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 2:  56%|█████▋    | 22/39 [00:37<00:28,  0.59it/s, v_num=4, train_loss_step=0.441, train_loss_epoch=0.454, valid_loss=0.234]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "`Trainer.fit` stopped: `max_steps=100` reached.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 2:  56%|█████▋    | 22/39 [00:37<00:28,  0.59it/s, v_num=4, train_loss_step=0.441, train_loss_epoch=0.454, valid_loss=0.234]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "GPU available: True (cuda), used: True\n",
      "TPU available: False, using: 0 TPU cores\n",
      "HPU available: False, using: 0 HPUs\n",
      "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Predicting DataLoader 0: 100%|██████████| 3/3 [00:12<00:00,  0.24it/s] "
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "GPU available: True (cuda), used: True\n",
      "TPU available: False, using: 0 TPU cores\n",
      "HPU available: False, using: 0 HPUs\n",
      "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "  | Name         | Type          | Params | Mode \n",
      "-------------------------------------------------------\n",
      "0 | loss         | MAE           | 0      | train\n",
      "1 | padder_train | ConstantPad1d | 0      | train\n",
      "2 | scaler       | TemporalNorm  | 0      | train\n",
      "3 | blocks       | ModuleList    | 2.4 M  | train\n",
      "-------------------------------------------------------\n",
      "2.4 M     Trainable params\n",
      "77        Non-trainable params\n",
      "2.4 M     Total params\n",
      "9.663     Total estimated model params size (MB)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 2:  56%|█████▋    | 22/39 [00:37<00:28,  0.59it/s, v_num=6, train_loss_step=0.428, train_loss_epoch=0.467, valid_loss=0.234]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "`Trainer.fit` stopped: `max_steps=100` reached.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 2:  56%|█████▋    | 22/39 [00:37<00:28,  0.59it/s, v_num=6, train_loss_step=0.428, train_loss_epoch=0.467, valid_loss=0.234]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "GPU available: True (cuda), used: True\n",
      "TPU available: False, using: 0 TPU cores\n",
      "HPU available: False, using: 0 HPUs\n",
      "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Predicting DataLoader 0: 100%|██████████| 3/3 [00:12<00:00,  0.24it/s] \n"
     ]
    }
   ],
   "source": [
    "# Create NeuralForecast object\n",
    "nf = NeuralForecast(models=models, freq=\"10s\")\n",
    "\n",
    "# Perform cross-validation\n",
    "Y_hat_df = nf.cross_validation(\n",
    "    df=df,\n",
    "    val_size=val_size,\n",
    "    test_size=test_size,\n",
    "    n_windows=None,  # Expanding window\n",
    ").to_pandas()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This approach provides insights into how each model generalizes to unseen data. We then compute evaluation metrics such as Mean Squared Error (MSE) and Mean Absolute Error (MAE) to quantify the models' predictive accuracy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation\n",
    "For each model, we calculate the Mean Squared Error (MSE) and Mean Absolute Error (MAE) as performance metrics:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Model: NHITS\n",
      "MSE: 0.1669229418039322\n",
      "MAE: 0.28757739067077637\n",
      "\n",
      "Model: BiTCN\n",
      "MSE: 0.17169682681560516\n",
      "MAE: 0.3052093982696533\n",
      "\n",
      "Model: NBEATS\n",
      "MSE: 0.12209504097700119\n",
      "MAE: 0.257291316986084\n",
      "\n",
      "Model: NBEATSx\n",
      "MSE: 0.13161614537239075\n",
      "MAE: 0.27656877040863037\n"
     ]
    }
   ],
   "source": [
    "from neuralforecast.losses.numpy import mae, mse\n",
    "\n",
    "# Iterate over models to compute metrics\n",
    "for model_name in model_names:\n",
    "    # Extract true values and predictions\n",
    "    y_true = Y_hat_df.y.values\n",
    "    y_hat = Y_hat_df[model_name].values\n",
    "\n",
    "    # Reshape arrays\n",
    "    y_true = y_true.reshape(n_series, -1, horizon)\n",
    "    y_hat = y_hat.reshape(n_series, -1, horizon)\n",
    "\n",
    "    # Compute metrics\n",
    "    mse_score = mse(y_true, y_hat)\n",
    "    mae_score = mae(y_true, y_hat)\n",
    "\n",
    "    # Print results\n",
    "    print(f\"\\nModel: {model_name}\")\n",
    "    print(f\"MSE: {mse_score}\")\n",
    "    print(f\"MAE: {mae_score}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "By leveraging these advanced neural forecasting models, we can effectively tackle the time series forecasting challenges presented in the Jane Street Kaggle competition. Each model offers distinct advantages, enabling us to capture various patterns and dependencies within the financial data, ultimately enhancing our predictive capabilities."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}