A high-throughput inference gateway for serving open-weight LLMs with token streaming, request batching, and per-tenant rate limiting. Built to run on Kubernetes with autoscaling backed by GPU node pools.
- Python
- vLLM
- FastAPI
- Kubernetes
- Triton
Selected work across AI infrastructure, inference platforms, and MLOps tooling. Most are open source — explore the code on GitHub.
A high-throughput inference gateway for serving open-weight LLMs with token streaming, request batching, and per-tenant rate limiting. Built to run on Kubernetes with autoscaling backed by GPU node pools.
Reference architecture and Terraform modules for an end-to-end MLOps platform on AWS — feature store, model registry, CI/CD for models, and automated rollout with shadow deployments.
A monitoring stack that attributes GPU utilization and cloud spend to individual models and teams, with Grafana dashboards and Prometheus exporters for inference workloads.
A modular toolkit for building real-time computer vision pipelines with hardware-accelerated decoding, model ensembling, and ONNX/TensorRT export for edge deployment.
A semantic search microservice with pluggable vector stores, hybrid retrieval, and a thin caching layer to keep tail latencies predictable under load.
A developer-friendly CLI that packages, validates, and rolls out models to staging and production with reproducible builds and automatic canary checks.