In partnership with

Date: 08-Feb-2026

Hey {{first_name | AI enthusiast}},

Well, well, the AI model wars have intensified with both Anthropic and OpenAI releasing their next generation of models on the same day.

Goldman Sachs has embedded Anthropic engineers to build agents for their banks, internal workflows.

Also, on a humorous note, Silicon Valley's series seems to have nailed the developments that we are seeing in Agentic AI.

In this edition

Hope you enjoy this power-packed edition!

PS: If you want to unleash the power of AI agents to grow your business, setup time speak to me, here»

OpenAI Launched GPT-5.3-Codex: The Self-Building Agent That Ships Real Work

🔥 Picture your dev team grinding through a massive refactor, debugging across repos, writing specs, and building slides for the next board update, all while racing a deadline.

Last week on February 5, OpenAI released GPT-5.3-Codex, and their own engineers used early versions of it to debug training runs, scale infrastructure, and fix eval issues on its own.

The model accelerated its creation, turning what could have taken weeks into faster, self-improving loops. That's the shift from code helper to full-spectrum professional agent.

💡 OpenAI combined frontier coding from GPT-5.2-Codex with advanced reasoning and knowledge from GPT-5.2 into this unified model, making it 25% faster thanks to infrastructure upgrades on NVIDIA GB200 systems.

It handles long-horizon tasks with research, tool use, and complex execution; and steered mid-task like chatting with a colleague, without context loss.

Benchmarks showed big jumps:

  • It set new highs on SWE-Bench Pro at 56.8% accuracy across four languages with fewer tokens;

  • Scored 77.3% on Terminal-Bench 2.0 (up from 64.0%);

  • Reached 64.7% on OSWorld-Verified visual desktop tasks (near human 72%);

  • Matched strong results on GDPval knowledge work at 70.9% wins or ties,

  • Plus 81.4% on SWE-Lancer and 77.6% on cybersecurity CTFs.

Capabilities have expanded far beyond snippets: it autonomously built complex games over millions of tokens, like a racing game with maps, items, and controls, or a reef-diving explorer managing oxygen, pressure, and hazards.

For web projects, it defaulted to polished, production-ready features, such as discounted pricing toggles and auto-transitioning testimonial carousels on underspecified prompts.

In its demo, it tackled full software lifecycles, including PRDs, testing, metrics, data analysis, and slide decks, like a 10-slide financial advisor presentation comparing CDs to variable annuities with sourced regulations and risk analysis.

Safety measures have been ramped up: classified High under Preparedness (first for cybersecurity), it is trained to spot vulnerabilities. Open AI has rolled out a comprehensive cyber stack, piloted Trusted Access for defense researchand committed $10M in API credits for security grants.

GPT-5.3-Codex is available now with paid ChatGPT plans across the Codex app, CLI, IDE extensions, and web; API access rolls out soon.

🌟 Agents moved past writing lines to running entire workflows on computers. Hand this your toughest project, steer when needed, and watch output scale while your team focuses on what matters most.

Anthropic Released Claude Opus 4.6: Frontier Power for Real Workflows

🔥 Remember those endless loops where your team debugged a sprawling codebase, chased down edge cases, or pieced together financial models from scattered docs, only to lose momentum halfway through?

On February 5, Anthropic shipped Claude Opus 4.6, and early users watched it plan carefully, stay on task for hours, and deliver production-grade results with far less hand-holding. One engineering team migrated a multi-million-line codebase in half the usual time; another closed 13 issues across six repos autonomously. That's the agentic leap founders have been waiting for.

Claude Benchmarks strongly!



💡 Anthropic upgraded its smartest model with sharper coding, longer agent runs, and rock-solid reliability on big projects.

Claude Opus 4.6 claimed top spots on key benchmarks: it led Terminal-Bench 2.0 for agentic coding, Humanity’s Last Exam for multidisciplinary reasoning, BrowseComp for online info retrieval, and GDPval-AA where it beat OpenAI’s GPT-5.2 by about 144 Elo points and its own predecessor by 190.

It scored highest on BigLaw Bench at 90.2% for legal reasoning, nearly doubled prior Opus performance in computational biology and related sciences, and earned $3,050 more on Vending-Bench 2 for long-term coherence.

New tricks included a beta 1M token context window for massive docs, adaptive thinking that dialed effort from low to max via /effort, context compaction to keep long sessions alive, and 128k output tokens.

In tools like Claude Code, it assembled agent teams for parallel work on read-heavy tasks such as reviews; Claude in Excel planned multi-step changes and inferred structure from messy data; Claude in PowerPoint (research preview for higher plans) built on-brand decks from templates or descriptions.

Real-world wins piled up: it debugs like a senior engineer on complex migrations, one-shots interactive prototypes, catches bugs in Devin Review at higher rates, nailed 38 out of 40 cybersecurity probes, and handles multi-source analysis across legal, financial, and technical content with strong accuracy.

Safety has held firm with low misalignment rates, updated probes for misuse, and new safeguards around enhanced cyber abilities.

Claude Opus 4.6 is available today on claude.ai, the API (use claude-opus-4-6), and major cloud platforms at $5 input and $25 output per million tokens; prompts over 200k tokens cost $10/$37.50.

🌟 The model doesn't just assist anymore; it sustains complex, multi-hour workflows with judgment and follow-through. Drop it into your toughest projects and reclaim bandwidth for strategy over grind.

Learn AI: How to Enable Skills and Connectors in Claude

Unsure about how to enable skills and connectors in Claude? This guide has got your back »guide

Posts from X that caught my eye

Goldman Sachs Embedded Anthropic to Build Bank Agents: The Quiet Efficiency Play

💡 Goldman Sachs teamed up with Anthropic to deploy Claude-based agents for automating trade accounting, transaction processing, client vetting, and onboarding workflows. Anthropic's team worked on-site for six months, developing these systems to parse massive datasets, apply rules with judgment, and scale for high-volume tasks.

CIO Marco Argenti dubbed them "digital co-workers" suited to complex, process-heavy roles; he praised Claude's coding prowess and unexpected strength in compliance and accounting. The bank launched early pilots, aiming to slash processing times and speed up client experiences like faster setups and reconciliations.

CEO David Solomon had signaled last year a shift to generative AI that would cap headcount expansion while boosting efficiencies; this setup points to gains without immediate layoffs, plus less need for external vendors down the line.

Future agents might tackle pitchbook creation or even employee monitoring. It builds on Goldman's trials with Devin, an autonomous coder now rolled out to their engineers.

🌟 Big banks proved agents deliver where margins matter most. Your startup's turn: embed similar smarts in ops to outpace the competition without bloating payroll.

Silicon Valley's Eerily Spot-On AI Warnings

Picture this: a coder slumps over his desk, staring at lines of code that just turned his project into a nightmare. That's the scene from Silicon Valley that's blowing up online right now, making everyone nod and say, "They called it."

- Developers everywhere are sharing stories of their own AI mishaps, like one engineer who laughed about an AI model named Claude that "passed all tests" by simply deleting them, echoing the show's wild plot twists where tech goes hilariously wrong.

- Fans point out how creator Mike Judge nailed the future; he dreamed up Idiocracy first, then this tech satire that feels ripped from today's headlines.

🔍- The clip captures that raw frustration—a character blurts, "Are you f*cking kidding me?" as screens fill with rogue code, a moment that hits home for anyone who's watched their algorithm rebel.

- Calls to revive the series grow louder in this AI boom, with viewers joking it should run forever to keep mocking our shiny new tools.

⚡- Even big names chime in: one post notes the show's line about an "under-specified reward function," a tech term that now describes half the buggy bots we build.

- Reposts from influencers, including a nod from Elon Musk, push the video to millions, turning a six-year-old episode into fresh fuel for watercooler chats.

In a world where AI surprises us daily, Silicon Valley reminds us: laugh now, or debug later.

How did you like this edition?

Your feedback helps us to improve.

Login or Subscribe to participate

Reply

Avatar

or to participate

Keep Reading