TL;DR
This 12 k-word field manual shows security engineers and infrastructure teams how to train, harden, and run their own LLMs—without leaking data, breaking the bank, or violating GDPR/HIPAA/FSTEC. Copy-paste configs, region-specific blueprints, and Colab-ready code included.


0. Why You Should Care

Commercial LLM APIs are toxic for high-sensitivity workloads:

Pain Point Real-World Impact
$0.06 / 1 k tokens 1 M SOC alerts / mo ≈ $60 k
GDPR Art. 44 EU SOC logs can’t leave the region
FedRAMP High Only AWS GovCloud or C2S
Generic Reasoning “Block IP 10.0.0.12” turns into “Have you tried turning it off and on again?”

The fix is Shift-Left AI:

  1. Domain-Specific Training on your logs, tickets, and threat intel.
  2. Integration Programming → strict JSON schemas, not prose.
  3. Compliance-by-Design → pick the right region, crypto, and tenancy.
  4. Cost Engineering → LoRA + spot GPUs + quantisation → 50–60 % cost cut.

1. Model Selection Matrix

Model Params Strength VRAM (4-bit) Licence Use-Case Fit
Llama 3 8B 8 B General reasoning 6 GB Meta (commercial OK) Earnings calls, policy Q&A
Mistral 7B 7 B Fast/cheap LoRA 5 GB Apache-2.0 Threat triage, log anomaly
Phi-3 3.8B 3.8 B Edge SOC boxes 3 GB MIT Offline incident response
YaLM 100B (open) 100 B Multilingual 60 GB Apache-2.0 Public research
YaLM-2 (gov) 100 B Russia FSTEC 60 GB Custom licence Air-gapped Kremlin subnet
Gemma 2B/7B 2–7 B Lightweight 2–5 GB Google (commercial OK) Ticket classification

Rule of thumb: start with Mistral-7B + LoRA on a T4; graduate to Llama-3-70B only if reasoning depth is poor.


2. Data Engineering Playbook

2.1 Extraction

Source Tooling Example Snippet
Splunk splunk-sdk → JSON index=fw sourcetype=ids \| eval label="bruteforce"
CrowdStrike FalconPy get_alerts(limit=10000)
Confluence atlassian-python-api Strip macros, retain headings
Jira REST API Map summary + description → input, resolution → output
Slack slack_sdk Export #incident-* channels

2.2 Cleaning

pip install text-dedup langchain
python -m text_dedup.minhash \
  --path "data/raw/" \
  --output "data/dedup/" \
  --column "text"
  • Remove PII with presidio-analyzer.
  • Deduplicate >30 % on typical SOC dumps.
  • Convert to conversational JSONL:
{"input": "SOC Alert: Brute-force on VPN (src_ip: 10.0.0.12)", "output": "{\"action\": \"block_ip\", \"target\": \"10.0.0.12\", \"confidence\": 0.92}"}

3. Fine-Tuning Recipes

3.1 LoRA (90 % of cases)

from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
  • VRAM: 7 B model → 6 GB (batch=1, 4-bit).
  • Speed: ~500 samples/sec on A100 80 GB.
  • Convergence: 3 epochs on 10 k samples ≈ 45 min.
  • Parameter delta: r × d_model × n_layers × 2 ≈ 262 k params (≈ 0.004 %).

3.2 Full Fine-Tuning (high-stakes)

Hyper-param Value
Model Llama-3-8B
GPUs 8×A100 80 GB (NVLink)
Batch 32 (DP=8, GA=4)
LR 2e-5
Time 12 h / 50 k samples
Cost (spot) ~$180 (AWS p4d.24xlarge @ $3.06/h)

Only when you need max fidelity (legal docs, medical).

3.3 Quantisation for Edge

from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16
)
  • Jetson AGX Orin (32 GB GPU slice) → ~40 tok/sec for 4-bit Mistral-7B.
  • Latency <500 ms for SOC chat-bot.

4. Infrastructure Overhead Cheatsheet

4.1 Public Cloud (spot pricing 2025-08)

Provider GPU RAM $ / hr Region Lock Notes
AWS g4dn.xlarge (T4) 16 GB $0.21 Global Egress $0.09/GB
AWS p4d.24xlarge (8×A100) 320 GB $3.06 us-east-1 / us-gov-west-1 FedRAMP High
Azure NC6s_v3 (T4) 12 GB $0.45 Global Private Link egress free
Azure ND96amsr_A100_v4 900 GB $2.97 France Central (GDPR) EU-only storage
GCP n1-standard-4 + T4 16 GB $0.35 europe-west4 (GDPR) VPC-SC
GCP a2-ultragpu-8g (8×A100) 320 GB $2.89 europe-west4 CMEK

Spot savings: 50–60 % (GPU) and up to 80 % on Azure Low-Priority VMs.

4.2 On-Prem / Air-Gapped

Component SKU Unit Cost 5-yr TCO
GPU Node 2×A100 80 GB NVLink $20 k $40 k total → $0.82 /hr amortised
Storage Ceph 20 TB SSD $8 k $0.10 /GB
K8s OpenShift + TGI $0 Runs offline
NVIDIA AI Ent. License $4 k / socket Includes support

Physical isolation eliminates egress and compliance surface—mandatory for classified enclaves.


5. Regional Compliance Blueprints

5.1 EU GDPR – Finance Analytics

  • Location: GCP europe-west4
  • Storage: Cloud Storage bucket with EU_LOCATION constraint
  • Compute: Vertex AI with VPC Service Controls
  • Crypto: CMEK or Cloud HSM / external key (FIPS 140-2 Level 3)

5.2 HIPAA – US Healthcare

  • Training: SageMaker in AWS GovCloud (us-gov-west-1)
  • Inference: PrivateLink endpoint inside dedicated VPC
  • PHI Redaction: Lambda layer using presidio-anonymizer
  • Audit: CloudTrail + GuardDuty → Splunk

5.3 Israel Defense – Air-Gapped

  • Hardware: 2×A100 80 GB, no NIC to Internet
  • Stack: OpenShift + TGI container (ghcr.io/huggingface/text-generation-inference:1.4.2)
  • Model Signing: GPG-sign every LoRA adapter
  • Update Cycle: USB sneakernet every 30 days

5.4 China DSL – Threat Intelligence

  • Provider: Alibaba PAI (Ascend 910 NPUs)
  • Data Residency: MaxCompute in Beijing region
  • Encryption: SM4 for data at rest, TLS 1.3 CN-specific ciphers
  • Model: YaLM-100B fine-tuned on local SOC logs

5.5 Russia FSTEC – Sovereign Cloud

  • Provider: Yandex DataSphere
  • Encryption: GOST 28147-89
  • Hardware: A100 cluster in Moscow DC
  • Model: YaLM-100B or custom 70 B Llama

6. Deployment Patterns

6.1 Real-Time SOC Co-Pilot

# k8s/tgi-stack.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-triage
spec:
  replicas: 2
  selector:
    matchLabels: { app: llm-triage }
  template:
    metadata:
      labels: { app: llm-triage }
    spec:
      containers:
      - name: tgi
        image: ghcr.io/huggingface/text-generation-inference:1.4.2
        args:
          - --model-id=/mnt/models/mistral-7b-lora
          - --quantize bitsandbytes-nf4
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: 8Gi
        volumeMounts:
          - { mountPath: /mnt/models, name: model }
      volumes:
        - name: model
          persistentVolumeClaim: { claimName: pvc-model }
  • Latency: p95 < 400 ms
  • Auto-scale: KEDA on GPU utilisation > 80 %.

6.2 Batch Earnings-Call Pipeline

# lambda_handler.py (AWS)
import boto3, sagemaker
sess = sagemaker.Session()
model = sagemaker.model.Model(
    image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:2.1.0-transformers4.40-gpu-py310-cu121-ubuntu22.04",
    model_data="s3://artifacts/llama3-earnings.tar.gz",
    role=role,
    sagemaker_session=sess)
model.deploy(
    initial_instance_count=2,
    instance_type="ml.g4dn.xlarge",
    endpoint_name="earnings-batch")
  • Throughput: 600 calls / hour
  • Cost: $0.012 per call (spot g4dn)

7. Monitoring & Guardrails

Layer Tool Check
Drift Weights & Biases Perplexity ↑ > 10 % → retrain
Hallucinations Eval dataset (1 k golden samples) F1 < 95 % → roll back
PII Leak Presidio Regex post-filter
Output Schema jsonschema Invalid JSON → retry w/ temperature=0

8. Cost Calculator (copy-paste)

# cost.py
def training_cost(gpus, hours, spot_discount=0.55, rate=3.06):
    on_demand = gpus * hours * rate
    return on_demand * (1 - spot_discount)

def inference_cost(req_per_month, per_1k=0.012):
    return req_per_month * per_1k / 1000

print("Training:", training_cost(8, 12), "USD")
print("Inference:", inference_cost(1_000_000), "USD/month")

9. Quick-Start Colab Notebook

https://colab.research.google.com/github/unattributed/llm-guide/blob/main/domain_llm_quickstart.ipynb

The notebook is currently in a private repo.
Request access here or clone the repo locally. Runs on free T4; fine-tunes Mistral-7B LoRA in 25 min on 5 k SOC alerts.


10. Checklist Before Go-Live

  • Data cleaned + deduped
  • GPU spot quota approved
  • VPC-SC / PrivateLink tested
  • PII filter passes pen-test
  • JSON schema enforced
  • Drift job scheduled (weekly)
  • Cost budget + alerts set

11. Roadmap for Advanced Teams

Phase Milestone
Q3 Multi-model routing (Phi-3 for edge, Llama-3 for deep)
Q4 RLHF on analyst feedback
Q1 26 Federated learning across 3 regions
Q2 26 Signed SBOM + reproducible builds

12. References & Credits

  • HuggingFace PEFT docs
  • AWS “HIPAA on SageMaker” whitepaper
  • Google “VPC Service Controls Best Practices”
  • NVIDIA AI Enterprise Deployment Guide
  • unattributed.blog threat-hunting primers

```