William Webster
- Jan 4
- 7 min read

Coping with Hallucinations in Treasury & Risk

What they are and what causes them

Hallucination is a term used to explain the phenomenon of a Large Language Model (LLM) producing incorrect outputs. LLMs, like GPT-4, are trained on vast datasets consisting of text from the internet. They learn to predict the next word in a sequence, making them powerful tools for generating human-like text. However, this strength also leads to a key vulnerability known as 'hallucination,' where the model generates plausible but factually incorrect or misleading information. This often stems from the model's reliance on patterns in the data it was trained on, rather than an understanding of factual truth. Hallucinations are caused by:

Overgeneralization: LLMs sometimes overgeneralise from their training data, leading to outputs that are too broad or generic, and at times, inaccurate.
Data Quality: The training data may contain inaccuracies, biases, or outdated information, which the model can inadvertently learn and replicate.
Context Limitation: LLMs have a limited context window (the number of words they can consider at one time), which can result in missing or misunderstanding key parts of an input query, leading to incorrect outputs.
Lack of Real-World Understanding: Unlike humans, LLMs don’t have real-world experience or common-sense reasoning. They operate purely based on statistical patterns in the data they’ve seen.

Not entirely detrimental

In the context of treasury and risk, hallucinations are seen as detrimental because they produce unexpected and potentially erroneous results. However, it's worth asking whether hallucinations are intrinsic to the nature of predictive text-type models, which are essentially statistical models that generate output based on gathered data. We can prompt this in various ways to stimulate the output, and I believe the process of hallucinations is almost a natural part of LLMs. In many respects, we can benefit from this when considering risk management. Many of the reports we run, whether linear or scenario-based, are generated by our ingrained biases, our thoughts on how the world works, and regulatory norms. A savvy risk manager should consider using these models to explore some of the more nuanced features of risk in the business by questioning whether the reports currently being generated have any obvious holes or biases within them. Working with LLMs to generate specific stress-type scenarios can help investigate whether there are weaknesses in the business model. To this end, I've specifically addressed this issue in an earlier article in this series titled "Simple Uses for LLMs - Stress and Scenario Creation," which uses examples to create different stress scenarios and mitigants that can arise.

When using these LLMs in the context of treasury and risk management, the issue of over-generalisation tends to be more prevalent when you start from scratch and ask the model to generate output without giving much thought to the prompting process, something we have control over. Nevertheless, I have regularly experienced overgeneralisation in the responses from GPT-4, particularly when I've asked broad questions about risk.

The data quality issue is pertinent when dealing with treasury-type issues. The general model used by the LLM has been trained from the internet, and you have to be wary of some of the potential outputs. In the case of conducting specific analysis for the business, it is likely that you will be including your data either through an API or in the prompt window. This means that the data needs to be clean; there's nothing new in this – garbage in, garbage out.

The issue concerning context limitation is one that I frequently experienced earlier last year. However, the new and updated models are now capable of much longer context windows, and I see this as a less important issue going forward.

Experience counts

Lack of real-world experience and understanding is a problem for these LLMs. They currently do extremely well in analysing situations, but they often miss the nuances that we see and have gained experience in while working in treasury and risk management. In extreme cases, the output can range from amusing to outright harmful. Let's illustrate this with an example where I've experienced a hallucination using GPT-4, and discuss some of the implications and consequences.

When analysing a gap report, GPT-4 focused on liquidity risk management and prioritised it. This was somewhat surprising, and although insightful, it wasn't quite what I had in mind. Without my experience in markets, it would have been difficult to detect the subtle difference between liquidity and interest rate risk, as the two are often closely related but not the same. To overcome the problem, a specific prompt focusing on interest rate risk was necessary.

At a later juncture, I asked GPT-4 to provide a delta report. Normally, you would expect a delta report to represent a shift in the yield curve leading to a change in the present value of the portfolio or balance sheet in question. GPT-4, however, provided a report based on the nominal amounts of the balance sheet rather than the risk amounts. A novice might accept this at face value, leading to significant trouble. Following it blindly, your interest rate risk analysis would be entirely wrong, and acting on it could lead to hedging non-existent risks, exacerbating the situation.

To add more detail, I asked GPT-4 to provide a swap overlay onto the balance sheet risk for an institution, and the result was incorrect. Following it would be financially risky, exposing you to massive market exposure and blowing your limits out of the water. Relying strictly on the recommendations of GPT-4 could quickly damage your credibility.

Despite these issues, the problem of hallucination seems to be far less prevalent than it was 12 months ago when we first started using these models. The current set of models and their training appear to have reduced many of the possibilities for hallucinations. However, when using an LLM, you need to be aware that it can produce outputs that seem wholly correct and plausible when they are indeed false and inaccurate. This can be potentially embarrassing, if not dangerous, for organisations, especially for financial services firms regulated by the PRA or FCA.

Credibility at stake

When considering the use of LLMs in treasury and risk management, it's important to note that we are typically not customer-facing. The users we work with are usually internal, digesting financial information to make business decisions. Thus, the issue becomes one of credibility. If you create content or work using LLMs and then disseminate that within the organisation, such as in a management meeting, and it is wrong or found to be incorrect, your credibility is at stake. This, I believe, is a major factor in exercising caution when producing output with LLMs. It would be unwise to allow a junior staff member with little experience to use these models for generating content that directly feeds into senior management. Often, they are unable to discern whether the output is correct or whether it is misleading or factually inaccurate. Senior management, provided with incorrect information, would not only be annoyed but would also likely take action. Indeed, such errors could put the organisation at risk, especially if they lead to changed management decisions or result in misleading or inaccurate reporting to regulators. At this stage whatever you produce, you must read and thoroughly understand it before using it in the workplace.

Colour coding

When using LLMs in treasury and risk management, a clear and open discussion about their use is necessary. Initially, we should question the output until we are entirely comfortable with its accuracy and reliability. For instance, if LLMs are used in generating reports for an ALCO meeting or a board meeting, the text could be colour-coded, or a graph could be annotated to indicate that an LLM was used in the generation. This transparency is helpful as it not only demonstrates the power of LLMs in the business but also draws attention to their usage, prompting caution until we are entirely comfortable with the material.

Parallel running

I also suggest that it is important to run processes and risk management systems in parallel with LLMs, rather than substituting them, until you are entirely comfortable with what they are generating. This parallel running is a “compare and contrast” exercise and may help highlight either anomalies in what the LLM produces or gaps in the information provided by existing legacy systems.

For example, if you were producing a liquidity report, you might ask the LLM to comment on the liquidity situation and add this to the regular information you generate before fully transitioning from one approach to another. Additionally, I wouldn’t overload reporting with LLM-generated material too quickly, as it would be difficult to trace exactly what is being generated and to compare the quality and consistency of the new with the old.

Use the three lines of defence

Another way to manage the introduction of LLMs in treasury and risk management is to ensure there is independent oversight of these models. We use a three-line-of-defence approach to many aspects of our operations, and this can be valuable in assessing whether an LLM inadvertently exposes us to more risk. There needs to be an open and frank discussion about the use of LLMs in treasury and risk management, including with senior management and risk management. Risk management needs the skills and understanding to assess whether the output generated by the models and particularly how it is used in the firm, is beneficial or could generate harmful situations for the business. This discussion should be constructive, recognising that LLMs offer a significant increase in efficiency, provided there is solid independent oversight of their use.

A further area of investigation should look at the use of LLMs in compliance and regulatory reporting. Incorrect reporting in these areas can raise questions about the governance of the firm. In treasury and risk, particular attention should be paid to reporting on liquidity, interest rate risk, market risk, and capital. These are all areas where LLMs can significantly improve efficiency but require full oversight by those in charge.

Scope of audit

Moreover, the audit function has a responsibility to delve into these models to ensure they do not lead to unintended consequences. For example, do these models reinforce bias in the business? Are we overlooking pipeline risk? Are we risk-seeking with deposits without considering the consequences? Is there a tendency to think that interest rates are always moving in one direction? Internal audit should also examine whether balance sheet and risk data are leaking out of the business via the LLM into the public domain. A themed audit could drill down into the data produced and its use in the firm or reach upwards towards the management suite to assess governance and oversight.

I’m also aware that GPT-4 can subtly alter content, giving it a slightly different meaning from what was intended. This too is something to be wary of, and I advise thoroughly checking any output before it is used in the firm. What does this mean for the future?

An opportunity to be grasped

It means that LLMs currently offer an opportunity to increase the efficiency of those working in treasury and risk, but output from LLMs used in the business, either from a reporting or governance perspective, must be sense-checked to ensure its robustness. As LLMs improve, we will likely grow more reliant on their accuracy and efficacy, similar to the first risk management systems used in treasury 40 years ago. The rapid development of LLMs suggests that within a few years, they will reach "escape velocity," capable of completing tasks at a human level with minimal prompting, significantly reducing the need for human involvement in the process between trades and the C-suite.

Coping with Hallucinations in Treasury & Risk

Recent Posts