Cohere's New Open Model Finally Makes Sense for Small Teams

Cohere's New Open Model Finally Makes Sense for Small Teams

Key Takeaways - Command A+ hits 218B params but only fires 25B per token. Hence two H100s, 4-bit quantized, first Apache 2.0 model from a Western AI lab. - Benchmarks jump: 85% on Telecom agent tasks (vs. 37% before), 25% harder coding wins, sits at 37 on the Artificial Analysis Intelligence Index. Basically Claude 4.5 Haiku territory. - Every factual claim links back to its source document. Native. Traceable. One button for "where'd this come from." - Build, sell, keep the money. Apache 2.0. No NC clause. No tiered anything. - Hugging Face, vLLM, Cohere's API. All live as of May 20, 2026.

---

Something dropped on May 20 that I keep thinking about.

Cohere published Command A+. Full Apache 2.0. Not "open weights with restrictions." Not "research access only." The same license Apache HTTP Server runs on. You can ship it in a product, charge for it, pocket the check.

No permissions needed.

218 billion parameters.

MoE architecture. Only 25 billion kick in per token though. Hence the crazy efficiency. Two NVIDIA H100s. 4-bit quantization (W4A4). Two years ago that hardware would've struggled to fit a 7B model.

The speed number that changes the calculus

Most people lock onto benchmark comparisons. "Ties Sonnet 3 on coding." "Beats GPT-4o on Math." Cool.

Not useful though.

What matters: 281 output tokens per second on Cohere's own API. That's not batch-processing speed. That's real-time. Conversational. Production.

Also: 37 on the Artificial Analysis Intelligence Index. Par with Claude 4.5 Haiku—people actually pay for that model. Not "close enough to par" like Llama 3.1 70B claims. Actual par.

And the third thing. 85% on τ²-Bench Telecom agent tasks. Previous version hit 37%. That's not an incremental bump.

That's a completely different product.

If you're doing AI automation for clients—especially telecom, finance, anything regulated.

This is the jump you've been waiting to see.

The citation feature that nobody's highlighting

Everyone's writing about the license and the hardware. Makes sense, they're the easy headlines.

But I'm actually watching the grounding spans.

Every factual claim in the model's output can link directly back to the source document or database row it came from.

Not "this answer is grounded." Each claim. Individually. Traced.

For a small shop building a RAG pipeline?

Finally auditable outputs. "Show me where this came from" stops being a research project. It's a button you press.

For healthcare, finance, government clients. Air-gapped deployment. Two H100s. Native citations. That's a product spec, not a research paper. First open model I've seen built for that constraint specifically.

The licensing thing nobody expected

Previous Cohere models shipped under Creative Commons BY-NC 4.0.

That non-commercial clause silently killed a lot of projects. Experiment all you want. But charge for access or embed it in client work? You needed a separate commercial agreement. Always.

Command A+ runs Apache 2.0.

Here's why that matters more than people write about it: build something on this, sell it to a client next quarter, keep every dollar. No royalties. No revenue sharing. No "contact our team for enterprise pricing."

For solo operators who got burned by Llama's custom license amendments or Mistral's tiered terms.

This is the first time a Western lab handed over frontier-grade capability with zero strings attached.

What you actually do with this

Stop waiting for the "right time" to evaluate open-source models in your stack. It's available on Hugging Face and vLLM right now. Run it on your own hardware today.

That W4A4 quantized version fits on two H100s.租赁 a couple for a week, benchmark against whatever you're paying for via API.

If you're serving regulated clients: this quarter is your window to pitch the air-gapped story. Citation feature + footprint + Apache license.

That combo didn't exist six months ago.

And if you run an AI automation agency and haven't touched your model mix since Q1? This is your sign.

Inference bill drops to zero. Still getting 281 tokens/second.

The cost math actually works now.

First-mover window on new open models tends to be about 90 days before the noise drowns the signal.

Use it.

---

Command A+ ships on Hugging Face, vLLM, and Cohere's API. Cohere's merger with Aleph Alpha closed earlier this month, beefing up regulatory and infrastructure depth for enterprise customers.