Nimesh Kumar

Reduced cloud infrastructure costs by 10% annually through strategic rightsizing of Kubernetes nodes and targeted GCP cost optimization initiatives.
Increased system reliability by tuning Kubernetes configurations, automating GKE cluster management via Terraform, and deploying Prometheus monitoring—leading to a 35% reduction in downtime and faster incident resolution.
Implemented GitOps workflows and Terraform pipelines, reducing infrastructure provisioning time by 70% and ensuring consistency across multi-cloud environments (GCP & Azure).
Mitigated security threats at scale by configuring NGINX rate limiting and edge-level DDoS defenses, maintaining >99.99% availability during high-traffic periods.
Maintained service reliability through participation in 24/7 on-call rotations.

Migrated 100% of CI/CD workflows from legacy systems to GCP Cloud Build and Azure DevOps, streamlining 500+ monthly deployments and reducing build times by 40%.
Strengthened security posture by implementing Teleport for role-based access control (RBAC), upgrading HashiCorp Vault, and automating secret synchronization with Kubernetes—cutting access-related incidents by 60%.

Improved homepage load times by 3x through advanced caching (CDN tuning, layered service-level caching), leading to higher engagement and lower bounce rates.
Implemented prompt caching for AI applications, reducing token input costs by 40% while preserving accuracy and response times.

Automated end-to-end model training and deployment pipelines in AWS SageMaker, ensuring reproducibility and accelerating delivery of ML models.
Deployed and managed trained SageMaker models at scale for low-latency, production-grade inference.
Built a Python Streamlit app to evaluate prompts against AWS Bedrock, enabling faster iteration and refinement of LLM workflows.
Integrated Weights & Biases (W&B) to track evaluation metrics, providing visibility into model performance and experiment outcomes.
Monitored AWS Bedrock usage metrics, improving AI service reliability and optimizing resource consumption.
Architected a scalable ML streaming service for automated curation of menu items, driving personalization and reducing manual overhead.

Boosted engineering productivity by 20% through automation of Terraform documentation, DataProc lifecycle management, and internal workflow tools.
Enhanced design system efficiency by automating dynamic design token workflows, reducing manual release cycles by 30%.
Served as a consultant and technical advisor for SRE and ML application development, contributing architectural insights and hands-on guidance to accelerate delivery.

Staff SRE