Cloud Infrastructure & Reliability
-
Reduced cloud infrastructure costs by 10% annually through strategic rightsizing of Kubernetes nodes and targeted GCP cost optimization initiatives.
-
Increased system reliability by tuning Kubernetes configurations, automating GKE cluster management via Terraform, and deploying Prometheus monitoring—leading to a 35% reduction in downtime and faster incident resolution.
-
Implemented GitOps workflows and Terraform pipelines, reducing infrastructure provisioning time by 70% and ensuring consistency across multi-cloud environments (GCP & Azure).
-
Mitigated security threats at scale by configuring NGINX rate limiting and edge-level DDoS defenses, maintaining >99.99% availability during high-traffic periods.
-
Maintained service reliability through participation in 24/7 on-call rotations.
CI/CD & Security
-
Migrated 100% of CI/CD workflows from legacy systems to GCP Cloud Build and Azure DevOps, streamlining 500+ monthly deployments and reducing build times by 40%.
-
Strengthened security posture by implementing Teleport for role-based access control (RBAC), upgrading HashiCorp Vault, and automating secret synchronization with Kubernetes—cutting access-related incidents by 60%.
Performance & Cost Optimization
-
Improved homepage load times by 3x through advanced caching (CDN tuning, layered service-level caching), leading to higher engagement and lower bounce rates.
-
Implemented prompt caching for AI applications, reducing token input costs by 40% while preserving accuracy and response times.
MLOps & AI Integration
-
Automated end-to-end model training and deployment pipelines in AWS SageMaker, ensuring reproducibility and accelerating delivery of ML models.
-
Deployed and managed trained SageMaker models at scale for low-latency, production-grade inference.
-
Built a Python Streamlit app to evaluate prompts against AWS Bedrock, enabling faster iteration and refinement of LLM workflows.
-
Integrated Weights & Biases (W&B) to track evaluation metrics, providing visibility into model performance and experiment outcomes.
-
Monitored AWS Bedrock usage metrics, improving AI service reliability and optimizing resource consumption.
-
Architected a scalable ML streaming service for automated curation of menu items, driving personalization and reducing manual overhead.
Tooling & Team Productivity
-
Boosted engineering productivity by 20% through automation of Terraform documentation, DataProc lifecycle management, and internal workflow tools.
-
Enhanced design system efficiency by automating dynamic design token workflows, reducing manual release cycles by 30%.
-
Served as a consultant and technical advisor for SRE and ML application development, contributing architectural insights and hands-on guidance to accelerate delivery.