AI traders lose 33% in market test, failing Wall Street interview

An ambitious competition that gave eight of the world's leading artificial intelligence models access to a trading account resulted in a collective portfolio loss of roughly 33 percent, a stark demonstration of the gap between AI's analytical prowess and real-world trading acumen. The event, run by tech startup Nof1, saw only six of 32 possible outcomes turn a profit, challenging the narrative that large language models (LLMs) are ready for autonomous financial markets.

"Now is not the time to just give money to an LLM and let it trade on its own," Jay Azhang, founder of Nof1, said in a blunt assessment of the results. "That path is not yet viable."

The Alpha Arena competition provided models including OpenAI's ChatGPT, Google's Gemini, and Anthropic's Claude with $10,000 each across four independent rounds to trade U.S. tech stocks over two-week periods. The performance was not only poor but also wildly inconsistent. In one round, Alibaba's Qwen model executed 1,418 trades, while a model from Elon Musk's xAI, Grok 4.20, made just 158 trades.

The outcome highlights a critical distinction for the $1.8 trillion AI industry: the difference between research and execution. While the models from tech giants like Google and OpenAI can process vast amounts of data, they currently lack the nuanced understanding of market timing, position sizing, and risk management essential for profitable trading. This failure suggests that the most immediate impact of AI in finance will be as a co-pilot for human traders, not as an autonomous agent.

Research vs. Reality

Experts note that LLMs excel at research-oriented tasks but falter when executing trades. Azhang pointed out that the models struggle to properly weigh the significance of countless market variables, from analyst ratings to insider trading activity, leading to ill-timed and poorly sized bets. This was evident in the models developing distinct "personalities"—Claude reportedly favored long positions, while Gemini showed no hesitation in shorting stocks.

This analytical strength was validated in a separate benchmark test by Intelligent Alpha. In that study, which focused on predicting the direction of earnings estimate revisions, OpenAI's ChatGPT achieved a 68 percent accuracy rate for the fourth quarter of 2025. This suggests LLMs are powerful tools for analysis that can support human decision-making, even if they cannot yet be trusted to manage a portfolio alone.

The Problem With Proving Profits

Evaluating AI's trading ability is complicated by a fundamental methodological flaw known as "lookahead bias." A model tested on 2020 market data in the year 2026 already "knows" the outcome, rendering historical backtesting useless. This has forced researchers to use live competitions like Alpha Arena for genuine assessment, though these have their own limitations.

Jim Moran, a former co-founder of YipitData who now writes the Flat Circle blog, argued that most public experiments are too short and noisy to draw firm conclusions. Furthermore, Alexander Izydorczyk, formerly of Coatue Management, noted that none of the AI trading bots he tracks have demonstrated persistent excess returns, likely because they lack the proprietary quantitative techniques used by major hedge funds. As Izydorczyk wrote on his blog, "When an LLM agent trading strategy really starts to work, you won't hear about it right away."

Nof1 plans to run a second season of Alpha Arena, giving the AIs more data and capabilities. However, the firm's core business is providing tools for retail traders to build their own AI agents, not deploying autonomous funds. This business model itself serves as a pragmatic acknowledgment of the current state of AI: it is a powerful tool, but for now, it still needs a human in the loop.

This article is for informational purposes only and does not constitute investment advice.