SCHEDULE A REPAIR APPOINTMENT in San Diego 858-342-6984 (TEXT or CALL)

The Whole Block At Once

News · Local AI · RTX

THE WHOLE
BLOCK AT ONCE

Google DeepMind's new DiffusionGemma doesn't write one word at a time. It resolves a whole block of text at once — and NVIDIA tuned it to do that locally, on the same RTX hardware makers already build around. If you've watched an MSLA resin printer flash an entire layer while an FDM machine traces it bead by bead, you already understand the idea.


Noise — step 0Resolved text — final step →

What dropped

A new open model that flips how text gets made

On June 10, 2026, Google DeepMind released DiffusionGemma, an experimental open model built for unusually fast text generation. NVIDIA published optimizations to run it across GeForce RTX GPUs, its RTX PRO workstation platform, and the DGX Spark desktop AI box — the company's pitch being that the whole thing runs on your own hardware rather than a rented cloud endpoint.

The weights are open under a permissive Apache 2.0 license, with support landing day one in Hugging Face Transformers, vLLM, and Unsloth. That "runs on a machine you own" framing is the part worth slowing down on, and it's where this otherwise pure-AI story crosses onto the workbench of anyone who fabricates locally.

The split that matters

One bead at a time, or the whole layer at once

Nearly every large language model in wide use today is autoregressive: it generates one token at a time, each word conditioned on the one before it. That strict left-to-right dependency is exactly why a chatbot looks like it's typing. It's also why generation can feel like waiting — the model can't start word twelve until word eleven exists.

DiffusionGemma takes the route image-generators take. It starts from noise and refines a whole block of text in parallel, denoising up to 256 tokens per step instead of emitting one and recomputing. It thinks in blocks rather than in sequence. Makers have a perfect mental model for this difference already:

AUTOREGRESSIVE

FDM logic · sequential


An FDM toolhead lays one bead, then the next, then the next. The path is sequential and order-locked — each move depends on where the last one ended. Reliable, but you wait for the line to finish.

DIFFUSION

MSLA logic · parallel


An MSLA resin printer cures an entire layer in a single flash of the LCD — every pixel at once, regardless of how complex the layer is. A diffusion model resolves a whole block of tokens the same way: in parallel, in one pass.

// The analogy is conceptual, not mechanical — it's a way to picture parallel vs. sequential generation, not a claim that the math is identical.

Why the silicon likes it

Sequential is memory-bound. Parallel is compute-bound.

Generating one token at a time is mostly a waiting game: the GPU spends its time pulling weights through memory rather than doing math, so a lot of its compute sits idle. Diffusion inverts that. Pushing a full block of tokens through the network at once is a dense, compute-heavy workload — the kind of math GPUs are built to chew through. NVIDIA's argument is simply that a block-parallel model maps onto Tensor Cores far more cleanly than a one-token-at-a-time model does.

That shows up in the figures NVIDIA published. Treat these as vendor-reported, single-user numbers until independent benchmarks land:

~1,000
TOKENS / SEC
on a single NVIDIA H100 data-center GPU.
~150
TOKENS / SEC
on the deskside DGX Spark (GB10, 128GB unified memory).
~2,000
TOKENS / SEC
up to, on the larger DGX Station — roughly 4× an equivalent autoregressive model.

// Source: NVIDIA's own announcement. We report the claim and the ×4 single-user figure; we don't independently verify it.

Base: Gemma 4, 26B MoE Active: 3.8B params / step Block: up to 256 tokens / step License: Apache 2.0 Runs on: RTX 5090, DGX Spark Soon: llama.cpp on GeForce

Why a print shop is writing about this

The interesting word is "local"

A text model isn't a 3D printing tool, and we're not going to pretend otherwise. What makes DiffusionGemma relevant to this corner of the world is the hardware story under it: the push to run capable AI on a single GPU, on your desk, with no cloud and no per-token bill.

The GPU that runs a local language model is the same GPU you'd put in a rendering rig, a slicer-automation box, or an AI-assisted modeling workstation. One machine, many maker jobs.

That convergence is already visible. AI is creeping into the design half of fabrication — text-to-STL generators and shape-search tools that turn prompts and photos into printable geometry, physics-aware AI that optimizes a bracket before it's ever sliced, and vision systems that watch a print and correct it mid-job. The thread connecting all of it is local inference: the faster a model runs on hardware you own, the tighter your design-and-iterate loop gets, and the less of your workflow depends on someone else's server staying up.

The honest part

"No per-token cost" is real, but local AI isn't free — it's prepaid. You trade a cloud bill for a hardware purchase plus a power bill, and in San Diego that second number is not small: residential electricity runs around $0.35 per kWh, among the highest in the country. A 5090 under sustained inference load pulls real watts. The upside of a compute-bound model that finishes roughly 4× faster (per NVIDIA) is that it draws that power for less wall-clock time. Whether local beats cloud comes down to how much you actually run — which is exactly the kind of math we'll work through with you honestly rather than selling you a tower you don't need.

If you want to poke at it

Where to actually try it

NVIDIA says the quickest path is Hugging Face Transformers, which runs DiffusionGemma on a GeForce RTX 5090 or a DGX Spark out of the box. For heavier throughput, vLLM provides day-one serving; for adapting the model to a narrow task, fine-tuning is available through Unsloth and NVIDIA's NeMo framework. GeForce RTX support via llama.cpp is listed as coming soon. You can also test it free through NVIDIA-hosted APIs before committing any local hardware to it.

None of that requires a $40,000 deskside supercomputer. The headline figures come from data-center silicon, but the entry point is a consumer card a lot of makers already own — which is rather the point.

Questions we expect

Straight answers

Is DiffusionGemma a 3D modeling tool?

No. It's a text-generation model — it writes language, not geometry. It matters here because it's part of the broader move toward local AI running on a single GPU, and that same hardware is what powers the design copilots, slicer automations, and modeling assistants that do touch 3D work.

Do I need a DGX Spark or DGX Station to run it?

No. The eye-catching speed numbers come from data-center and deskside hardware, but NVIDIA states the model runs on a GeForce RTX 5090 out of the box through Hugging Face Transformers. Consumer cards are a supported entry point, with wider GeForce support via llama.cpp described as on the way.

Is running AI locally actually cheaper than the cloud?

It depends on volume. Local inference removes per-token cloud fees but adds an upfront hardware cost and an ongoing power cost — and San Diego's ~$0.35/kWh electricity makes that power line meaningful. Light, occasional use often favors the cloud; heavy, sustained, privacy-sensitive use often favors local. We're happy to run the breakeven with you instead of guessing.

Can Dreaming3D build me a local-AI workstation?

Yes — custom PC builds are one of our core services, including AI inference and rendering machines alongside gaming rigs and creator workstations. We spec it, source parts at honest prices, assemble, cable-manage, benchmark, and hand you a tuned system. Same shop that repairs your printer, in Carmel Valley.

What is the resin-vs-FDM analogy really saying?

That generation can be sequential or parallel. FDM builds a line one bead at a time; an autoregressive model builds text one token at a time. MSLA cures a whole layer in one flash; a diffusion model resolves a whole block in one pass. It's a way to picture why the parallel approach can be faster, not a claim that printers and language models share any mechanism.

Build the machine. Print the part. Same bench.

Dreaming3D in Carmel Valley does FDM & resin printing, 3D scanning with the Revopoint MetroY, mobile printer repair across San Diego County, design tutoring, and custom PC builds — including the kind of local-AI workstation this whole story is about. Whichever way the chips fly, we're your San Diego fabrication partner.

Spec a Build or Request Service Read: AI in 3D printing

📞 858-342-6984  ·  ✉ dreaming3dprinting@gmail.com  ·  📷 @dreaming3dprinting


Share this post


Leave a comment

Note, comments must be approved before they are published