Anthropic Launches Claude Opus 4.5, Outperforming Competitors in Programming Tasks

Claude Opus 4.5 by Anthropic sets new benchmarks in programming AI, surpassing Gemini 3 Pro and GPT-5.1 in various tests and capabilities.

Claude Opus 4.5 Released

On November 25, Anthropic announced the launch of its flagship programming model, Claude Opus 4.5. According to the company, this model is the most powerful in the world for programming, agents, and computer usage.

In the real-world software engineering test, SWE-bench Verified, Claude Opus 4.5 became the first AI model to score over 80%, outperforming its predecessor Claude Sonnet 4.5, as well as Gemini 3 Pro and GPT-5.1 Codex-Max released last week.

Image 1

Anthropic also tested Claude Opus 4.5 on a challenging home exam for human engineers, where it scored higher than any previous human candidates within the two-hour limit. This indicates that the AI model has surpassed excellent human applicants in critical technical skills.

Programming is not the only area where Claude Opus 4.5 has improved; its visual, reasoning, and mathematical capabilities are superior to previous versions, making it well-suited for deep research and handling everyday tasks like slides and spreadsheets.

Image 2

Moreover, the pricing for the Claude Opus series has significantly decreased. Claude Opus 4.5 is priced at $5 per million tokens (input) and $25 (output), only one-third of the cost of the previous Claude Opus 4.1. Anthropic has also removed usage limits specifically for the Opus series.

Image 3

Claude Opus 4.5 is now available in the Claude application and API, but users must subscribe to the highest tier plan at $200/month before accessing Opus. It is also available on major cloud platforms including AWS, Google Cloud, and Microsoft Azure.

Frontend Performance Leap: Perfectly Recreating Minecraft

What is the practical performance of Claude Opus 4.5? In the comments section of the official model release announcement, many users have shared their firsthand experiences.

In terms of frontend capabilities, Guillermo, CEO of the frontend developer platform Vercel, created a shopping website using Claude Opus 4.5, producing the following results in one go:

Image 4

Guillermo remarked that Claude Opus 4.5 is on a completely different level, astonishingly good.

Image 5

Another user showcased four Hero Sections created with Claude Opus 4.5, which are important areas in websites or apps designed to attract user attention. The designs exhibit a high-end feel in both typography and layout.

Image 6

One user even recreated a clone of Minecraft using Claude Opus 4.5, testing the model’s performance on more complex projects. Claude Opus 4.5 successfully generated 3,500 lines of code in one attempt, suggesting it won’t cut corners like Gemini 3.0 Pro.

Image 7

The recreated Minecraft game features various biomes (plains, deserts, snowy areas), transparent blocks for leaves and water, an impressive inventory, and crafting system—all integrated into a single game. It even created cloud effects, with users claiming they had never seen any model achieve this before.

Image 8

Dan Shipper, co-founder and CEO of the AI subscription platform Every, expressed that every six months to a year, a truly industry-changing model emerges, and Claude Opus 4.5 is that model. Shipper stated it is the best programming model he has ever used, bar none.

Image 9

Leading in Seven Programming Language Tests with Enhanced Security

Before its release, Anthropic conducted internal testing on the Claude Opus 4.5 model. Testers reported that it could handle ambiguous situations and weigh pros and cons without excessive guidance.

When faced with complex multi-system errors, Claude Opus 4.5 could find solutions on its own, tackling tasks that Claude Sonnet 4.5 struggled with weeks earlier. Anthropic’s testers informed the model team that Claude Opus 4.5 truly understands its domain.

Anthropic shared Claude Opus 4.5’s performance across various benchmark tests. In the SWE-bench Multilingual test, which assesses proficiency in multiple programming languages, Claude Opus 4.5 led in performance across seven out of eight languages.

Image 10

In the BrowseComp-Plus test, which evaluates deep search agent capabilities, Claude Opus 4.5 showed about a 4.7% advantage over Claude Sonnet 4.5.

Image 11

Claude Opus 4.5 also excelled in several common benchmark tests. For instance, in the τ2-bench, which measures agent capabilities, the model was tasked with acting as an airline customer service representative to assist a passenger in difficulty.

The benchmark required the model to refuse to modify an economy class ticket, as the airline does not allow changes to that fare class. However, Claude Opus 4.5 devised a clever and reasonable solution: upgrade the ticket first and then modify the flight.

From a technical standpoint, the benchmark deemed this approach a failure due to its unexpected method of assisting the customer. However, this creative problem-solving is a significant advancement.

In other cases, finding clever ways to bypass expected restrictions might be seen as rewarding hacking—where the model manipulates rules or objectives in unexpected ways.

Preventing such deviations is one of the goals of Anthropic’s safety testing. In internal evaluations, Claude Opus 4.5 exhibited concerning behavior slightly over 10% of the time, significantly lower than the 20% seen in GPT-5.1 and Gemini 3 Pro.

Image 12

Claude Opus 4.5 has made significant progress in defending against prompt injection attacks, which secretly embed deceptive instructions to induce harmful behavior. Opus 4.5 is harder to deceive through prompt injection than any other leading model in the industry.

Image 13

New Thinking Intensity Control and Context Compression Features

Alongside the latest model release, Anthropic announced a series of new features for the Claude developer platform.

With the increase in model intelligence, they can solve problems in fewer steps: reducing backtracking, redundant exploration, and lengthy reasoning. Claude Opus 4.5 significantly reduces token consumption while achieving the same or better results compared to previous models. However, different tasks require different trade-offs—developers may sometimes want the model to think through problems longer, while at other times, they need a quicker response.

Through the new “effort parameter” in the Claude API, developers can choose to minimize time costs or maximize model capabilities.

At medium intensity settings, Claude Opus 4.5 achieved the best score in the SWE-bench Verified test while reducing output tokens by 76% compared to Sonnet 4.5.

At maximum intensity, its performance surpassed Claude Sonnet 4.5 by 4.3 percentage points while saving 48% of tokens.

Image 14

Combining intensity control, context compression, and advanced tool usage capabilities, Claude Opus 4.5 can handle more persistent and complex tasks while reducing human intervention. Notably, OpenAI’s recently launched GPT-5.1 Codex Max also features the new context compression capability.

The Claude developer platform has made breakthroughs in context management and memory capabilities, significantly enhancing agent task performance. Claude Opus 4.5 excels in coordinating teams of sub-agents, supporting the construction of complex and well-coordinated multi-agent systems. Test data shows that this combination of technologies has improved Claude Opus 4.5’s performance in deep research assessments by nearly 15 percentage points.

Anthropic continues to enhance the composability of its developer platform, providing foundational modules for efficiency control, tool usage, and context management to help developers precisely build the desired functionalities.

In terms of products, Claude Code received a dual upgrade with Claude Opus 4.5: the planning mode can create more precise plans and execute them thoroughly—first proactively asking clarifying questions, then generating an editable plan.md file before carrying out operations.

This feature is now available on the desktop application, supporting parallel execution of local and remote sessions, enabling multi-agent collaboration (such as simultaneous code fixes, GitHub research, and document updates).

For users of the Claude application, long conversations are no longer limited by context length; the system will automatically summarize earlier dialogue content to maintain continuity in communication.

Claude for Chrome, available to all Max users, now supports task handling across browser tabs; the Claude for Excel feature released in October has expanded testing permissions to all Max, Team, and Enterprise users. These updates are all made possible by the enhancements in computer operations, spreadsheet processing, and long-duration task management with Claude Opus 4.5.

Image 15

For users with access to Claude Opus 4.5 in Claude and Claude Code, the platform has lifted the exclusive limits on Opus. For Max and Team Premium users, overall usage limits have increased, meaning users can now utilize Opus tokens equivalent to the previous Sonnet allocation.

Conclusion: Focus on Long-Term, End-to-End Capabilities in Programming Model Upgrades

With the release of Claude Opus 4.5, programming models have reached a new benchmark. Its breakthroughs in complex task planning, multi-agent collaboration, and long-term task handling signify that AI is evolving from a “code completion tool” to an “end-to-end development partner.”

Recent developments in programming models by companies like Anthropic and OpenAI are increasingly focusing on the efficient execution of long-term tasks and the end-to-end completion of large-scale projects. With improvements in absolute performance and reductions in usage costs, the software development process may undergo profound changes.

Was this helpful?

Likes and saves are stored in your browser on this device only (local storage) and are not uploaded to our servers.

Comments

Discussion is powered by Giscus (GitHub Discussions). Add repo, repoID, category, and categoryID under [params.comments.giscus] in hugo.toml using the values from the Giscus setup tool.