GPT OSS 120B

Multilingual Thinking Tool Calls

MXFP4 nvidia-h100-80gb-350gb

Prerequisites

Ensure your GPU nodes are prepared with the NVIDIA container toolkit:

ansible-playbook prositronic.infra.nvidia_container_toolkit

Deploy Command

helmfile --state-values-file <(curl -s https://prositronic.org/values/gpt-oss-120b/mxfp4/nvidia-h100-80gb-350gb.yaml) apply

Generated values.yaml

/values/gpt-oss-120b/mxfp4/nvidia-h100-80gb-350gb.yaml

# Prositronic Model Card
# https://prositronic.org/deploy/gpt-oss-120b/mxfp4/nvidia-h100-80gb-350gb
#
# Model: GPT OSS 120B (MXFP4)
# Hardware: nvidia-h100-80gb-350gb

image:
  backend: cuda13
modelDownloads:
  - name: gpt-oss-120b
    url: https://huggingface.co/ggml-org/gpt-oss-120b-GGUF/resolve/main/gpt-oss-120b-mxfp4-00001-of-00003.gguf
    filename: gpt-oss-120b.gguf
models:
  gpt-oss-120b:
    m: /models/gpt-oss-120b.gguf
    ngl: 36
    ctx-size: 131072
    flash-attn: true
    load-on-startup: true
resources:
  requests:
    nvidia.com/gpu: 1
    memory: 80Gi
  limits:
    nvidia.com/gpu: 1
    memory: 350Gi

Install llama.cpp

Install llama.cpp from the official build instructions for your platform and backend.

Download Model

curl -L -o gpt-oss-120b.gguf "https://huggingface.co/ggml-org/gpt-oss-120b-GGUF/resolve/main/gpt-oss-120b-mxfp4-00001-of-00003.gguf"

Start Server

llama-server \
  -m gpt-oss-120b.gguf \
  --n-gpu-layers 36 \
  --ctx-size 131072 \
  --flash-attn

Verify

curl http://localhost:8080/health