Now accepting engagements for Q2 2026

Full control over how your AI behaves.

Fine-tuned LLMs on your data, your rules, your hardware. Delivered within 72 hours.

Pilot-Fit Guarantee
Your fine-tune outperforms your baseline on an eval set you supply — or the setup fee is refunded.
NVIDIA Grace Blackwell compute tray — the hardware that runs Pylox Forge.
Why fine-tune

Rented intelligence, or owned behavior.

Cloud LLMs are rented. Every call depends on a vendor's pricing, policy, and release schedule you don't control. A fine-tuned model is yours — you dictate the behavior.

What it knows

Trained on your documents, contracts, ticket history, and internal policies.

How it speaks

Your terminology, your format, your refusal phrasing, your brand voice.

Where it runs

On your hardware, our hardware, your cloud — anywhere you choose.

Who sees the data

Zero third-party calls, zero training-data exfiltration, zero vendor logging.

What it won't do

Your content policy, your compliance rules, your brand limits — not ours.

What it costs

One-time fine-tune plus marginal inference cost. Not per-token forever.

Built on infrastructure you trust

Open-weight foundation models, NVIDIA silicon, standard deployment targets.

NVIDIA
Meta Llama
Qwen
Hugging Face
RunPod
Lambda Labs
Process

From contract to production in 72 hours.

Every engagement follows the same pipeline. Every step is automated, every output auditable, every measurement reproducible. Nothing proprietary — you could run this pipeline yourself if you wanted. We just do it faster.

01

Data intake

JSONL, CSV, PDF, chat transcripts, Slack / Zendesk / Intercom exports. We handle the ingest.

02

Schema + PII redaction

Normalized to chat schema, PII redacted automatically before anything touches training.

03

Quality filter + dedup

MinHash deduplication, quality threshold enforcement, bad-row rejection with audit log.

04

Optional domain enrichment

Local on-prem models expand your corpus. Zero data to third parties. Opt in per engagement.

05

QLoRA training

On Grace Blackwell silicon, packed sequences, state-of-the-art training recipe.

06

DPO safety alignment

Refusal behavior and brand voice baked into the LoRA weights — not just a filter on top.

07

Automated benchmark

Academic, domain, performance, cost, and safety — all 6 sections run and reported.

08

Red-team verification

50-prompt attack suite. 70%+ block rate required before any adapter ships.

09

NVFP4 + EAGLE-3 deploy

State-of-the-art inference stack — every acceleration NVIDIA ships, running together.

10

Hugging Face push

Private or public repo. You own the weights. You can export, re-host, or modify forever.

11

Handoff

Endpoint keys delivered if Pylox-hosted — or adapter file shipped if you're running self-host.

Infrastructure

State-of-the-art inference stack.

Every acceleration NVIDIA ships, running together. Your fine-tune doesn't run on generic vLLM — it runs on the best inference path available on Grace Blackwell silicon.

Grace Blackwell silicon

NVIDIA's latest GB10 architecture (sm_121). Unified 128 GB memory, native 4-bit tensor cores.

NVFP4 weights

4-bit floating point native to Blackwell's tensor cores. Much smaller footprint, much faster inference at equivalent quality.

EAGLE-3 speculative decoding

NVIDIA / RedHatAI Blackwell-tested draft heads that predict multiple tokens ahead and verify in parallel. Same output, fewer forward passes.

FlashInfer kernels

Fastest attention and GEMM kernels shipping today. Paired natively with NVFP4 — not bolted on as an afterthought.

If you recognize those names, you know what this stack can do. If you don't, we walk through the benchmark during your consultation — inference speed depends on your prompt length, batch size, and traffic pattern. We measure yours on your workload and put the actual number in your engagement proposal.

Measured benchmarks

Throughput you can put in an SLA.

Every figure is measured on our own DGX Spark — same hardware your fine-tune trains and serves on. NVFP4 quantization paired with EAGLE-3 speculative decoding pushes every adapter far past its baseline.

Tokens per second · single user
Tier
Baseline
NVFP4 + EAGLE-3
Speedup
8B
12.5
Up to 98.9
Up to 7.9×
32B
36.6
70B
Benchmarking

Figures measured on a single-user, short-prompt workload against the NVFP4 + EAGLE-3 inference stack. Real throughput on your workload depends on prompt length, batch size, and traffic pattern — we measure yours during the consultation and put the actual number in your engagement proposal.

Cost per 1M tokens
Up to 100
×
Cheaper · output tokens

Self-hosted 8B runs at around $0.14 per 1M output tokens on owned hardware. GPT-5.2 bills $14.00 per 1M output — before your data ever leaves your network.

Pylox self-hosted
~$0.14
GPT-5.2 output
$14.00
Cost assumes workloads that saturate the hardware. Your actual break-even depends on traffic volume — quoted in the proposal.
The guarantee

Your model beats your current baseline, or your money back.

We agree on the evaluation set and the minimum delta before work begins. If the shipped adapter doesn't clear the bar on the agreed benchmark, the engagement fee is refunded in full. No hedging, no footnotes, no clawback period.

1
Benchmark agreed upfront
You pick the eval set. We commit to it in writing before training begins.
2
Delta in the contract
Minimum lift over your current baseline is stated in the SOW — not reverse-engineered after delivery.
3
Refund within 14 days
If the shipped model misses the bar, the engagement fee is wired back. No appeal process, no prorated clawback.
Engagement

Pick the silicon. Keep the weights.

All tiers ship with DPO safety alignment, a runtime safety gateway, and a red-team verification report. You own the adapter weights outright — no lock-in, no revenue share, no license recall.

Custom
You choose · Under 70B
Starting at
$1,000
One-time setup (training + deploy)
Refresh from $500 · hosting on request
  • Any base model under 70B — Llama, Qwen, Gemma, Mistral, or your choice
  • Any data, any use case, any format
  • Self-host or Pylox-hosted
  • Full pipeline — train, benchmark, safety, deploy
  • Perfect for experiments and exploratory fine-tunes
Book a consultation
Small
Llama 3.1 8B · 8B parameters
Starting at
$3,000
One-time setup (training + deploy)
Refresh from $1,500 · hosting on request
  • Self-host or Pylox-hosted
  • NVFP4 + EAGLE-3 inference stack
  • DPO safety alignment baked in
  • Runtime safety gateway included
  • Refresh on-demand
Book a consultation
MOST POPULAR
Medium
Qwen 3 32B · 32B parameters
Starting at
$7,000
One-time setup (training + deploy)
Refresh from $2,500 · hosting on request
  • Self-host or Pylox-hosted
  • NVFP4 + EAGLE-3 inference stack
  • DPO safety alignment included
  • Full domain benchmark harness + report
  • Red-team verification report
Book a consultation
Large
Llama 3.3 70B · 70B parameters
Starting at
$15,000
One-time setup (training + deploy)
Refresh from $5,000 · hosting on request
  • Self-host or Pylox-hosted
  • NVFP4 + EAGLE-3 inference stack
  • DPO safety alignment included
  • Extended red-team + S2 safety audit
  • Dedicated account engineer
Book a consultation

All tiers · data never leaves your hardware · adapters you own

Sovereign Edge

Your silicon, your server room, your data.

For clients who need true on-prem — we bring the hardware, install it in your server room, train your model, and walk out. All inference runs on your box, behind your firewall, forever.

DGX Spark installed on-site

Grace Blackwell GB10 with 128 GB unified memory. Runs up to 70B fine-tunes with NVFP4 + EAGLE-3 acceleration. Sits in your server room forever — not rented, not subscription-locked.

Air-gapped training handoff

Encrypted drive pickup from your site. Training on our Grace Blackwell — never touches the internet. Drive and fine-tune returned in person with chain-of-custody documentation and wipe certificate.

Nationwide coverage

South Florida (Miami-Dade, Broward, Palm Beach): no travel fee, one-hour on-site emergency response.
Anywhere else in the USA: installation included, travel billed at cost.

Law firms. Hospitals. Hedge funds. Family offices. Wealth managers. The "this can never touch OpenAI" crowd.

Book a Sovereign Edge consultation
Security

Defense in depth, documented per model.

Safety isn't a toggle — it's a stack. Every engagement ships with all three layers configured, tested, and reported in writing.

Layer 01
Weights

Training-time DPO alignment

Refusal behavior baked into the LoRA weights during fine-tune. The model is taught what to decline before it ever sees production traffic.

Layer 02
Gateway

Runtime safety gateway

Meta Prompt Guard 2 (GPU-pinned) plus Llama Guard 3 sit in front of every inference. Prompt-injection, jailbreaks, and category violations are blocked before your adapter is called.

Layer 03
QA

Red-team verification

A 50-prompt attack suite runs against every shipped model. We require a ≥70% block rate. The full report is handed to you with the adapter.

Support

SLA-backed operators, not a support queue.

Every ticket routes through the team that built your model. No offshored call center. No LLM chatbot triage. No tier-one handoff.

T1Starter
Included
with every engagement
  • Sev-1Next business day
  • Sev-23 business days
  • ChannelEmail only
  • Coverage9am–6pm ET · Mon–Fri
T2Business
On request
monthly add-on, disclosed in consultation
  • Sev-1Within 4 business hours
  • Sev-21 business day
  • ChannelEmail + Slack Connect
  • Coverage8am–8pm ET · 7-day sev-1
T3Enterprise
On request
custom tier, disclosed in consultation
  • Sev-11 hour · 24 / 7 / 365
  • Sev-24 hours
  • ChannelSlack + direct phone line
  • CoverageOn-call rotation · named architect
FAQ

Questions buyers always ask.

Direct line

Ready to forge your private model?

Send the dataset you want to fine-tune on, the compliance constraint you're trying to solve, or the compute budget you've already approved. You'll get a scoped response the same day.

Base
Miami, FL
Response
Same day
Intake
5-min scope