Categories
Artificial Intelligence Hardware Programming What I’m Up To

One last endorsement for the ZGX Nano AI workstation

Today’s my last day in my role as the developer advocate for HP’s GB10-powered AI workstation, the ZGX Nano. As I’ve written before, I’m grateful to have had the the opportunity to talk about this amazing little machine.

Of course, you could expect me to talk about how good the ZGX Nano is; after all, I’m paid to do so — at least until 5 p.m. Eastern today. But what if a notable AI expert also sang its praises?

That notable expert is Sebastian Raschka (pictured above), author of a book I’m working my way through right now: Build a Large Language Model (from Scratch), and it’s quite good. He’s also working on a follow-up book, Build a Reasoning Model (from Scratch).

Sebastian has been experimenting on NVIDIA’s DGX Spark, which has the same specs as the ZGX Nano (as well as a few other similar small desktop computers built around the NVIDIA’s GB10 “superchip”), and he’s published his observations on his blog in a post titled DGX Spark and Mac Mini for Local PyTorch Development. He ran some benchmark AI programs comparing his Mac Mini M4 computer (a fine developer platform, by the bye) and the NVIDIA H100 GPU (and NVIDIA’s A100 GPU when an H100 wasn’t available), pictured below:

Keep in mind that the version of the H100 that comes with 80GB of VRAM sells for about $30,000, which is why most people don’t buy one, but instead rent time on it from server farms, typically at about $2/hour.

Let me begin from the end of Raschka’s article, where he writes his conclusions:

Overall, the DGX Spark seems to be a neat little workstation that can sit quietly next to a Mac Mini. It has a similarly small form factor, but with more GPU memory and of course (and importantly!) CUDA support.

I previously had a Lambda workstation with 4 GTX 1080Ti GPUs in 2018. I needed the machine for my research, but the noise and heat in my office was intolerable, which is why I had to eventually move the machine to a dedicated server room at UW-Madison. After that, I didn’t consider buying another GPU workstation but solely relied on cloud GPUs. (I would perhaps only consider it again if I moved into a house with a big basement and a walled-off spare room.) The DGX Spark, in contrast, is definitely quiet enough for office use. Even under full load it’s barely audible.

It also ships with software that makes remote use seamless and you can connect directly from a Mac without extra peripherals or SSH tunneling. That’s a huge plus for quick experiments throughout the day.

But, of course, it’s not a replacement for A100 or H100 GPUs when it comes to large-scale training.
I see it more as a development and prototyping system, which lets me offload experiments without overheating my Mac. I consider it as an in-between machine that I can use for smaller runs, and testing models in CUDA, before running them on cloud GPUs.

In short: If you don’t expect miracles or full A100/H100-level performance, the DGX Spark is a nice machine for local inference and small-scale fine-tuning at home.

You might as well replace “DGX Spark” in his article with “ZGX Nano” — the hardware specs are the same. The ZGX Nano shines with HP’s exclusive ZGX Toolkit, a Visual Studio Code extension that lets you configure, manage, and deploy to the ZGX Nano. This lets you use your favorite development machine and coding environment to write code, and then use the ZGX Nano as a companion device / on-premises server.

The article features graphs showing his benchmarking results…

In his first set of benchmarks, he took a home-built 600 million parameter LLM — the kind that you learn how to build in his book, Build a Large Language Model (from Scratch) — and ran it on his Mac Mini M4, the ZGX Nano’s twin cousin, and an H100 from a cloud provider. From his observations, you can conclude that:

  • With smaller models, the ZGX Nano can match a Mac Mini M4. Both can crunch about 45 tokens per second with 20 billion parameter m0dels.
  • The ZGX Nano has the advantage of coming with 128GB  of VRAM, meaning that it can handle larger models than the MacMini could, as it’s limited by memory.

Raschka’s second set of benchmarks tested how the Mac Mini, the ZGX Nano’s twin cousin, and the H100 handle two variants of a model that have been presented with MATH-500, a collection of 500 mathematical word problems:

  • The base variant, which was a standard LLM that gives short, direct answers
  • The reasoning variant, which was a version of the base model that was modified to “think out loud” through problems step-by-step

He ran two versions of this benchmark. The first was the sequential test, where the model was presented on MATH-500 question at a time. From the results, you can expect the ZGX Nano to perform almost as well as the H100, but at a significantly smaller fraction of the cost! It also runs circles around the Mac Mini.

In the second version of the benchmark, the batch test, the model was served 128 questions at the same time, to simulate serving multiple users at once and to. test memory bandwidth and parallel processing.

This is a situation where the H100 would vastly outperform the ZGX Nano thanks to the H100’s much better memory bandwidth. However, the ZGX Nano isn’t for doing inference at production scale; it’s for developers to try out their ideas on a system that’s powerful enough to get a better sense of how they’d operate in the real world, and do so affordably.

Finally, with the third benchmark, Rashcka trained and fine-tuned a model. Note that this time, the data center GPU was the A100 instead of the H100 due to availability.

This benchmark tests training and fine-tuning performance. It compares how fast you can modify and improve an AI model on the Mac Mini M4 vs. the ZGX Nano’s twin vs. an A100 GPU. He presents three scenarios in training and fine-tuning a 355 million parameter model:

  1. Pre-training (3a in the graphs above): Training a model from scratch on raw text
  2. SFT, or Supervised fine-tuning (3b): Teaching an existing model to follow instructions
  3. DPO (direct preference optimization), or preference Tuning (3c): Teaching the model which responses are “better” using preference data

All these benchmarks say what I’ve been saying: the ZGX Nano lets you do real model training locally and economically. You get a lot of bang for your ZGX Nano buck.

As with a lot of development workflows, where there’s a development database and a production database, you don’t need production scale for every experiment. The ZGX Nano gives you a working local training environment that isn’t glacially slow or massively expensive.

Want to know more? Go straight to the source and check out Raschka’s article, DGX Spark and Mac Mini for Local PyTorch Development.

And with this article, I end my stint as the “spokesmodel” for the ZGX Nano. It’s not the end of my work in AI; just the end of this particular phase.

Keep watching this blog, as well as the Global Nerdy YouTube channel, for more!

Categories
Hardware Video

New video on the “Global Nerdy” YouTube channel: “How computers work ‘under the hood’”

Do you know how your computer works? If not, this video’s for you!

Here’s the video, which is the latest one on the Global Nerdy YouTube channel:

The video features the How Computers Work “Under the Hood” presentation that I gave at a Tampa Devs meetup on November 15, 2023.

In the presentation, I start by talking about the CPU chips in our computers, phones, and electronic devices:

…and then proceed to talk about the building blocks for these chips, transistors:

Then, after a quick introduction to the 6502 processor, which powered a lot of 1980s home computers…

…I introduced 6502 assembly language programming:

Watch the video, and learn how your computer works “under the hood!”

If you’d like to follow along with the video try out the exercises I demonstrated, you can do so from the comfort of your own browser — just follow this guide!

Want the slides for my presentation? Here they are!

Categories
Artificial Intelligence Hardware What I’m Up To

Talking about HP’s ZGX Nano on the “Intelligent Machines” podcast

On Wednesday, HP’s Andrew Hawthorn (Product Manager and Planner for HP’s Z AI hardward) and I appeared on the Intelligent Machines podcast to talk about the computer that I’m doing developer relations consulting for: HP’s ZGX Nano.

You can watch the episode here. We appear at the start, and we’re on for the first 35 minutes:

A few details about the ZGX Nano:

  • It’s built around the NVIDIA GB10 Grace Blackwell “superchip,” which combines a 20-core Grace CPU and a GPU based on NVIDIA’s Blackwell architecture.

  • Also built into the GB10 chip is a lot of RAM: 128 GB of LPDDR5X coherent memory shared between CPU and GPU, which helps avoid the kind of memory bottlenecks that arise when the CPU and GPU each have their own memory (and usually, the GPU has considerably less memory than the CPU).
NVIDIA GB10 SoC (system on a chip).
  • It can perform up to about 1000 TOPS (trillions of operations per second) or 1015 operations per second and can handle model sizes of up to 200 billion parameters.

  • Want to work on bigger models? By connecting two ZGX Nanos together using the 200 gigabit per second ConnectX-7 interface, you can scale up to work on models with 400 billion parameters.

  • ZGX Nano’s operating system in NVIDIA’s DGX OS, which is a version of Ubuntu Linux with additional tweaking to take advantage of the underlying GB10 hardware.

Some topics we discussed:

  • Model sizes and AI workloads are getting bigger, and developers are getting more and more constrained by factors such as:
    • Increasing or unpredictable cloud costs
    • Latency
    • Data movement
  • There’s an opportunity to “bring serious AI compute to the desk” so that teams can prototype their AI applications  and iterate locally
  • The ZGX Nano isn’t meant to replace large datacenter clusters for full training of massive models, It’s aimed at “the earlier parts of the pipeline,” where developers do prototyping, fine-tuning, smaller deployments, inference, and model evaluation
  • The Nano’s 128 gigabytes of unified memory gets around the issues of bottlenecks with distinct CPU memory and GPU memory allowing bigger models to be loaded in a local box without “paging to cloud” or being forced into distributed setups early
  • While the cloud remains dominant, there are real benefits to local compute:
    • Shorter iteration loops
    • Immediate control, data-privacy
    • Less dependence on remote queueing
  • We expect that many AI development workflows will hybridize: a mix of local box and cloud/back-end
  • The target users include:
    • AI/ML researchers
    • Developers building generative AI tools
    • Internal data-science teams fine-tuning models for enterprise use-cases (e.g., inside a retail, insurance or e-commerce firm).
    • Maker/developer-communities
  • The ZGX Nano is part of the “local-to-cloud” continuum
  • The Nano won’t cover all AI development…
    • For training truly massive models, beyond the low hundreds of billions of parameters, the datacenter/cloud will still dominate
    • ZGX Nano’s use case is “serious but not massive” local workloads
    • Is it for you? Look at model size, number of iterations per week, data sensitivity, latency needs, and cloud cost profile

One thing I brought up that seemed to capture the imagination of hosts Leo Laporte, Paris Martineau, and Mike Elgan was the MCP server that I demonstrated a couple of months ago at the Tampa Bay Artificial Intelligence Meetup: Too Many Cats.

Too Many Cats is an MCP server that an LLM can call upon to determine if a household has too many cats, given the number of humans and cats.

Here’s the code for a Too Many Cats MCP server that runs on your computer and works with a local CLaude client:

from typing import TypedDict
from mcp.server.fastmcp import FastMCP

mcp = FastMCP(name="Too Many Cats?")

class CatAnalysis(TypedDict):
    too_many_cats: bool
    human_cat_ratio: float  

@mcp.tool(
    annotations={
        "title": "Find Out If You Have Too Many Cats",
        "readOnlyHint": True,
        "openWorldHint": False
    }
)
def determine_if_too_many_cats(cat_count: int, human_count: int) -> CatAnalysis:
    """Determines if you have too many cats based on the number of cats and a human-cat ratio."""
    human_cat_ratio = cat_count / human_count if human_count > 0 else 0
    too_many_cats = human_cat_ratio >= 3.0
    return CatAnalysis(
        too_many_cats=too_many_cats,
        human_cat_ratio=human_cat_ratio
    )

if __name__ == "__main__":
    # Initialize and run the server
    mcp.run(transport='stdio')

I’ll cover writing MCP servers in more detail on the Global Nerdy YouTube channel — watch this space!

Categories
Artificial Intelligence Hardware What I’m Up To

Specs for NVIDIA’s GB10 chip, which powers HP’s ZGX Nano G1n AI workstation

I’m currently working with Kforce as a developer relations consultant for HP’s new tiny desktop AI powerhouse, the ZGX Nano (also known as the ZGX Nano G1n). If you’ve wondered about the chip powering this machine, this article’s for you!

The chip powering the ZGX Nano is NVIDIA’s GB10, a combination CPU and GPU where “GB” stands for “Grace Blackwell.” The chip’s two names stand for each of its parts…

Grace: The CPU

The part named “Grace” is an ARM CPU with 20 cores, arranged in ARM’s big.LITTLE (DynamIQ) architecture, which is a mix of different kinds of cores for a balance of performance and efficiency:

    • 10 Cortex-X925 cores. These are the “performance” cores, which are also sometimes called the “big cores.” They’re designed for maximum single-thread speed, higher clock frequencies, and aggressive out-of-order execution, their job is to handle bursty, compute-intensive workloads such as gaming and rendering, and on the ZGX Nano, they’ll be used for AI inference.
    • 10 Cortex-A725 cores. These are the “efficiency” cores, which are sometimes called the “little cores.” They’re designed for sustained performance per watt, running at lower power and lower clock frequencies. Their job is to handle background tasks, low-intensity threads, or workloads where power efficiency and temperature control matter more than peak speed.

Blackwell: The GPU

The part named “Blackwell’ is NVIDIA’s GPU, which has the following components:

    • 6144 neural shading units, which act as SIMD (single-instruction, multiple data) processors that act as “generalists,” switching between standard graphics math and AI-style operations. They’re useful for AI models where the workloads aren’t uniform, or with irregular matrix operations that don’t map neatly into 16-by-16 blocks.
    • 384 tensor cores, which are specialized matrix-multiply-accumulate (MMA) units. They perform the most common operation in deep learning, C = A × B + C, across thousands of small matrix tiles in parallel. They do so using mixed-precision arithmetic, where there are different precisions for inputs, products, and accumulations.
    • 384 texture mapping units (TMUs). These can quickly sample data from memory and do quick processing on that data. In graphics, these capabilities are use to resize, rotate, and transform bitmap images, and then paint them onto 3D objects. When used for AI, these capabilities are used to perform bilinear interpolation (used by convolutional neural network layers and transformers) and sample AI data.
    • 48 render output units (ROPs). In a GPU, the ROPs are the final stage in the graphics pipeline — they convert computed fragments into final pixels stored in VRAM. When used for AI, ROPs provide a way to quickly write the processing results to memory and perform fast calculations of weighted sums (which is an operation that happens with all sorts of machine learning).

128 GB of unified RAM

There’s 128GB of LPDDR5X-9400 RAM built into the chip, a mobile-class DRAM type designed for high bandwidth and energy efficiency:

  • The “9400” in the name refers to its memory bandwidth (the speed at which the CPU/GPU can move data between memory and on-chip compute units) of 9.4 Gb/s per pin. Across a 256-bit bus, this provides almost 300 GB/s peak bandwidth

  • LPDDR5X is more power-efficient than HBM but slower; it’s ideal for compact AI systems or edge devices (like the ZGX Nano!) rather than full datacenter GPUs.

As unified memory, the RAM is shared by both the Grace (CPU) and Blackwell (GPU) portions of the chip. That’s enough memory for:

  • Running large-language-model inference up to 200 billion parameters with 4-bit weights

  • Medium-scale training or fine-tuning tasks

  • Data-intensive edge analytics, vision, or robotics AI

Because the memory is unified, it means that the CPU and GPU share a single physical pool of RAM, which eliminates explicit data copies.

The RAM is linked to the CPU and GPU sections using NVIDIA’s C2C (chip-to-chip) NVLINK , their low-power interconnector that lets CPU/GPU memory traffic move at up to 600 GB/s aggregate. That’s faster than PCIe 5! This improves latency and bandwidth for workloads that constantly exchange data between CPU preprocessing and GPU inference/training kernels.

Double the power with ConnectX

If the power of a single ZGX Nano wasn’t enough, there’s NVIDIA’s ConnectX technology, which is based on a NIC that provides a pair of 200 GbE ports, enabling the chaining/scaling out of workload across  two GB10-based units. The doubles the processing power, allowing you to run models with up to 400 billion parameters!

The GB10-powered ZGX Nano is a pretty impressive beast, and I look forward to getting my hands on it!

 

Categories
Artificial Intelligence Hardware

HP’s ZGX Nano G1n AI workstation: A sneak peek!

I’ll be talking about HP’s upcoming ZGX Nano G1n AI workstation soon, but in the meantime, here’s HP’s Brian Allen providing a sneak preview of the ZGX Nano at last week’s HP event in New York.

Categories
Hardware What I’m Up To

The room where it happens

Joey de Villa’s home office. It has a shiny hardwood floor, two desks in an L-shaped configuration, monitors, keyboards, synthesizers, and other gear. A large octopus art piece looms over the back wall.
Tap to view at full size.

For the curious, here’s a recent pic of my home office, a.k.a. “The Fortress of Amplitude.” The gear configuration changes every now and then, but it generally looks like this. It’s where the magic happens!

Categories
Artificial Intelligence Hardware What I’m Up To

Quick announcement: I’m doing developer relations for HP’s new ZGX Nano AI computer!

Just so you know: today’s my first day at Kforce doing developer relations for HP! More specifically, for HP’s ZGX Nano, a tiny computer designed specifically for running large AI models right on your desktop…and not on someone else’s computers!

The ZGX Nano packs a ridiculous amount of power into a tiny space…

Powered by NVIDIA’s GB10 GPU and a 20-core ARM CPU sharing 128GB of RAM, the ZGX Nano performs at 1,000 teraflops (1 petaflop), which is 1015 floating-point operations per second. It’ll support an AI model taking in 200 billion parameters — 400 billion if you connect two ZGX Nanos together.

I’m getting set up for day one on the job as I write this, so I’m keeping this post short and ending with this gem from a little while back: HP’s Rules of the Garage: