The client was running ML models on dedicated GPU instances with no auto-scaling, leading to 10x overprovisioning during off-peak hours. Model deployments took 2 days of manual work, and there was no visibility into cost per inference or model performance metrics.
We deployed an EKS cluster with Karpenter for just-in-time GPU node provisioning and built an LLM gateway for unified model serving. Implemented KServe for standardized model deployment, Prometheus-based cost attribution per model, and canary rollouts for safe model updates.