Qwen3-30B: The first AI model that is good enough for local deployment

Finally good enough

In the past a few months, I’ve been asked by a number of New Zealand startups about the best local AI model. I’ve reluctantly recommended Llama 3.3 70B with 8-bit quantization but I would always quickly follow up with the caveat that this setup is “really only a toy”. “If you want to do real work, you should at least use Claude Haiku 3.5 or even Sonnet 4.0”, I would say.

But With the arrival of Qwen3-30B-A3B-Instruct-2507, I think we finally have an AI model that is good enough for fast local interactive exploration and production use. There are two main reasons I feel this way:

Higher output quality. Based on my own benchmarks, Qwen3-30B’s performance in coding tasks is slightly behind than ChatGPT-4o, but noticeably higher than Claude’s haiku-3.5-20241022, a workhorse model that has been my go-to for most of my personal projects. To surpass haiku 3.5 means that it has a good-enough core cognitive capabilities that it can now be used to build real applications.

Faster inference speed. Qwen3-30B runs at 78 tokens/second on M4 Max with 128GB RAM (with MLX optimisation turned on), this feels a lot faster than haiku 3.5 streamed from Claude’s API, which typically runs at 52-68 tokens/second speed.

Read on for a deeper dive on 6 reasons why you should consider Qwen3-30B-A3B for your local AI needs.

1. Benchmark result matching larger models

Despite having just 30B parameters, Qwen3-30B-A3B competes with models 4-30 times its size:

Benchmark	Qwen3-30B-A3B	GPT-4o	Claude 3.5 Sonnet	Gemini 1.5 Pro	Notes
ArenaHard	91.0	85.3	87.1	-	Complex reasoning & instruction following
AIME’24/25	80.4	41.4	-	52.7	Advanced mathematical problem-solving
GPQA	70.4%	65.1%	72.3% (Opus)	-	Graduate-level science questions
LiveBench	69.0	68.2 (GPT-4)	-	65.8 (Flash)	Real-world task performance
Creative Writing	86.0	84.2	83.7 (Haiku)	-	Writing quality assessment

For real-world applications, Simon Willison confirms these results in practical applications, noting performance “approaching GPT-4o and larger Qwen models.”

2. The speed advantage through MoE architecture

Qwen3-30B-A3B uses the Mixture of Experts (MoE) architecture. Think of Qwen3-30B-A3B as having 128 specialist consultants on staff, but only calling on the 8 most relevant experts for each task. This architecture means that the model runs at the speed of a much smaller 3.3B parameter system, yet it is still capable for a wide range of cognitive intensive tasks.

3. (Almost) one-click local deployment

The MLX 8-bit model is already available to download from Hugging Face. Simon Willison’s deployment guide provides additional details to get you started.

4. Production deployment made affordable

For deployment, the easiest way is to run the model via LM Studio on a Mac Mini M4 Pro with 64GB RAM. 79 tokens/second for $4,299 NZD. Not bad at all.

Or, you can go with one RTX 5090 GPU (NZD $6,799 on PBTech), which gives about 48 tokens/seconds.

Note that even though the GPU approach appears to be slower than M4 Pro, it does open up the options to explore more scalable runtime/pipeline options, e.g. unsloth. Also, you get to choose different parameters for Thinking and Non-Thinking Mode or use GRPO for fine tuning.

5. Tool use and hybrid thinking mode

Qwen3-30B-A3B brings excellent function calling capabilities to local deployment — a feature notably absent in Llama 3.3 70B. Qwen3-30B-A3B further simplifies this task with the Qwen-Agent framework, providing built-in support for:

MCP (Model Context Protocol) configuration for standardized tool definitions
Native tool integration including code interpreters and API calls
Hybrid thinking modes that switch between deep reasoning and fast response

6. Apache 2.0 licensing

Qwen3-30B-A3B adopts the Apache 2.0 license, which is also a much-welcomed change from the Llama 3.3 license. The Apache 2.0 license is one of the most permissive and widely accepted open-source licenses, allowing you to use, modify, and distribute the model with minimal restrictions. It’s the same license powering many open source projects so your legal team probably already knows it. This contrasts sharply with Llama’s custom license, which imposes user count thresholds and revenue restrictions that can complicate commercial use.

But, isn’t Qwen3 from China?

Yes. But the beauty of local deployment lies in its complete neutrality. Whether a model comes from Silicon Valley, Beijing, or Paris becomes irrelevant when it runs exclusively on your hardware. Qwen3-30B-A3B offers the same data sovereignty guarantees as any locally-deployed software: your data stays on your servers, processed by your infrastructure, governed by your policies.

Models come and go

You see, the AI landscape changes weekly. New models, new capabilities, new price points. Without a robust evaluation framework, you’re flying blind — making decisions based on vendor marketing rather than measured performance that’s relevant to your specific use cases.

BTW, your evaluation framework should be the bedrock of your AI strategy, not just a tool for comparing models. It should help you answer three questions:

Does this model solve our users’ actual problems?
Can we measure improvement and progress objectively?
How do we capture feedback to improve continuously?

It worth noting that these capabilities matter more than which model you choose today, because they determine how well you’ll adapt to whatever comes next. While models depreciate rapidly—today’s state-of-the-art becomes tomorrow’s baseline—your evaluation framework appreciates with use. Each test case refined, each edge case captured, each performance metric validated adds to an irreplaceable asset.

So, while models come and go, your evaluation framework remains a long-term asset. If your business is serious about AI, invest in building a robust evaluation framework.