High-frequency trading happens in a world where decisions are made in milliseconds, and sometimes even faster. At the center of this world is the limit order book: a constantly changing record of buy and sell orders where prices emerge from the interaction of many market participants.
For my bachelor’s thesis in computer science, I am studying how reinforcement learning agents behave in this setting when they are not only trained to make money, but also trained to care about risk.
The working title of the project is:
A Controlled Comparison of Risk-Sensitive Objectives in Multi-Agent Reinforcement Learning for High-Frequency Trading using GPU Acceleration
The core idea is simple to state but challenging to study: if two trading agents interact in the same market, does the way one agent thinks about risk change not only its own behavior, but also the behavior and performance of the other agent?
Why risk matters in trading agents
Many reinforcement learning approaches to trading optimize expected return. In other words, the agent is rewarded for making as much profit as possible on average. That is a useful starting point, but it misses something central to real trading: not all profits and losses are equal.
A strategy with a high average return can still be unacceptable if it occasionally produces very large losses. This is especially important in high-frequency trading, where market participants must manage inventory, transaction costs, slippage, and sudden changes in liquidity.
Classical mathematical finance already treats risk as a central part of the problem. For example, the Avellaneda–Stoikov model for market making includes inventory risk, while the Almgren–Chriss model for order execution balances expected cost against timing risk. These models do not treat risk as an afterthought.
In contrast, many reinforcement learning systems add risk through hand-tuned penalty terms. That makes it hard to know whether the learned behavior comes from a meaningful risk preference or simply from a convenient training trick.
My thesis aims to make this comparison more controlled.
The main research question
The project asks:
To what extent do mean-variance, CVaR, and entropic utility induce different behaviors and risk-return trade-offs in interacting market-making and order-execution agents trained concurrently in a two-agent limit-order-book environment?
In this environment, there are two main agents.
The first is a market maker. This agent continuously places buy and sell quotes and tries to profit from the bid-ask spread. Its main challenge is inventory risk: if it accumulates too much of an asset and the price moves against it, losses can grow quickly.
The second is an order execution agent. This agent needs to buy or sell a target quantity within a fixed time period. Its goal is not just to trade, but to trade efficiently, avoiding unnecessary market impact and slippage.
These two agents do not act in isolation. Their decisions affect the same limit order book, which means their risk preferences may influence each other.
The three risk-sensitive objectives
The thesis compares three different ways of making reinforcement learning agents risk-sensitive.
Mean-variance balances expected return against return variability. It penalizes strategies that have unstable outcomes, even if the average result is good. This connects naturally to classical portfolio theory and to the Almgren–Chriss approach to optimal execution.
CVaR, or Conditional Value at Risk, focuses specifically on the worst outcomes. Instead of caring about all variability equally, it asks: what happens in the bad tail of the distribution? This is especially relevant in trading, where rare but severe losses can dominate practical risk management.
Entropic utility comes from exponential utility and corresponds to constant absolute risk aversion. It gives a mathematically structured way to represent risk aversion and is also connected to robust decision-making.
These objectives are not interchangeable. Mean-variance cares about overall dispersion, CVaR cares about the downside tail, and entropic utility applies a different form of aversion to bad outcomes. Comparing them in the same environment should help isolate how the structure of the risk objective changes agent behavior.
Why a multi-agent setup is important
A single-agent trading experiment can tell us how one agent behaves against a fixed environment. But real markets are not fixed. They are made up of many participants adapting to one another.
That is why this thesis uses a multi-agent reinforcement learning setup. The goal is not only to ask whether a risk-sensitive market maker behaves differently, but also whether that behavior changes the execution agent’s learning problem, and vice versa.
For example, a more risk-averse market maker may quote differently, manage inventory more conservatively, or reduce exposure to unfavorable market conditions. Those changes could affect liquidity, execution costs, and the strategies learned by the other agent.
This interaction effect is one of the main reasons the project uses a two-agent limit-order-book environment rather than a simpler single-agent simulation.
The technical foundation
The project builds on JaxMARL-HFT, a GPU-accelerated multi-agent reinforcement learning framework for high-frequency trading. This framework makes it possible to train heterogeneous trading agents concurrently on limit-order-book data.
The first stage of the project is infrastructure-focused: setting up the GPU environment, installing JAX with CUDA, cloning and running JaxMARL-HFT, and reproducing the reference two-agent training setup. This baseline is important because every later experiment depends on having a reliable and reproducible starting point.
The thesis plan also considers data availability. If direct LOBSTER access is limited, an alternative path is to convert Databento market-by-order and market-by-price data into a LOBSTER-like format. This would allow the experiments to remain close to the original framework while still being feasible within the thesis timeline.
Planned experiments
The experimental design is built around a controlled comparison.
First, both the market-making agent and the execution agent will be trained under four objectives:
- risk-neutral expected return,
- mean-variance,
- CVaR,
- entropic utility.
Then, the project will train the full combination of objective pairings between the two agents. This creates a 4×4 grid: each market-maker objective is paired with each execution-agent objective.
This setup makes it possible to study both individual and interaction effects. For example, I can compare a CVaR-sensitive market maker against a risk-neutral execution agent, or an entropic market maker against a mean-variance execution agent.
The key question is not just which objective performs best, but what kind of behavior each one produces.
How the agents will be evaluated
The evaluation will look at both performance and behavior.
On the performance side, the project will measure quantities such as mean profit and loss, variance, CVaR, drawdown, inventory tails, and slippage. These metrics help describe the risk-return profile of each trained agent.
On the behavioral side, the project will examine action distributions, inventory trajectories, quote skewing, and execution schedule shapes. This is important because two agents may achieve similar returns while using very different strategies.
The goal is to see whether learned policies resemble the behavior predicted by classical models such as Avellaneda–Stoikov for market making and Almgren–Chriss for order execution.
Current thesis progress
At this stage, the project has a clear research direction, defined terminology, a set of risk-sensitive objectives, and a staged methodology.
The main pieces now in place are:
- the problem formulation,
- the motivation for comparing risk-sensitive objectives,
- the main research question and sub-questions,
- the choice of benchmark models,
- the implementation plan inside JaxMARL-HFT,
- the experimental matrix,
- and the evaluation strategy.
The next major step is to move from planning into implementation: reproducing the baseline JaxMARL-HFT training setup, implementing the three risk-sensitive objectives, and validating them against simple analytical cases before running the full experimental grid.
What I hope to learn
By the end of the thesis, I hope to better understand how different mathematical notions of risk shape the behavior of learning agents in financial markets.
A risk-neutral agent may learn to maximize average profit, but a risk-sensitive agent may learn to avoid dangerous inventory positions, reduce tail losses, or execute trades more cautiously. In a multi-agent setting, these choices may also affect the other participants in the market.
That is the broader motivation behind the project: not just to train trading agents that perform well, but to understand how their risk preferences change the market interactions they participate in.
In high-frequency trading, intelligence is not only about making fast decisions. It is also about knowing which risks are worth taking.