AI Platform Engineer | LLM Systems Architect OVERVIEW AI Platform Engineer building production-grade LLM systems with strong reliability guarantees. 10+ years in distributed systems and high-availability infrastructure, now focused on AI control planes, RAG pipelines, routing, evaluation frameworks, and observability. Designed LLM orchestration systems with explicit validation, retry, rollback, and commit semantics to reduce non-determinism and enforce correctness. Brings deep experience in scaling, failure modeling, and SLO-driven production system design. SKILLS AI Platform & LLM Systems: RAG architectures, LLM orchestration, control planes, routing & fallback logic, evaluation frameworks (Langfuse, deepeval), trace-level observability, prompt/version management, context engineering, agentic workflows Distributed Systems & Infrastructure: High-availability design, autoscaling concepts, failure modeling, SLO/SLA enforcement, load balancing (HAProxy/Nginx), Kubernetes, Docker, cloud-native architectures (AWS/GCP/Azure), Rackspace, Terraform, Ansible, Chef, Linux, Solaris, ZFS Systems Design & Leadership: Architecture Reviews, Design RFCs, Failure Modeling, Incident Postmortems, Cross-team platform enablement Languages: Python, SQL, Bash Databases: PostgreSQL (expert: performance tuning, query optimization, replication, partitioning, migrations, security), MySQL (strong: performance tuning, optimization), CockroachDB, Cassandra, MS SQL Server, Oracle EXPERIENCE NetApp (Acquired), Remote — Senior Systems Architect May 2022 - PRESENT ● Designed and deployed PostgreSQL-backed LLM platform architectures using RAG and Model Context Protocol (MCP), treating the database as an AI control plane rather than a passive datastore. Improved internal operational workflows and reduced failure rates. ● Built and productionized multi-agent LLM pipelines using LangGraph, modeling workflows as explicit state machines with validation, retry, correction, and commit phases — increased reliability and contributed to ~2x YoY revenue growth. ● Integrated end-to-end LLM observability and evaluation pipelines (Langfuse, deepeval), enabling trace-level introspection, automated scoring, regression detection, and feedback-driven iteration. ● Defined system invariants, routing gates, and validation probes to enforce correctness, reproducibility, and safe deployment of AI-assisted database tooling in enterprise SaaS environments. ● Applied distributed systems principles (idempotency, transactional boundaries, rollback semantics, auditability) to AI workflows, reducing silent failures and improving deployment safety. ● Designed API-layer abstractions and LLM gateway components to support routing, validation, and model versioning strategies. ● Deployed & iterated LLM systems in cloud-native environments (AWS/Azure), integrating storage, compute, and monitoring layers to support production workloads. Instaclustr (Acquired), Remote — Senior Database Reliability Engineer March 2021 - May 2022 ● ● ● PostgreSQL scalability expert for DoorDash, helping them scale their Postgres RDS & Aurora systems to support rapid traffic growth during the COVID-19 pandemic surge. Led the team of database experts working on all aspects of DoorDash database scalability, performance, and infrastructure overhaul Provided ongoing schema design, query optimization, live production schema change processes with minimal disruption, and disaster recovery consulting playing a pivotal role in the company’s growth & operations leading to a successful IPO Credativ (Acquired), Remote — Senior Database Reliability Engineer January 2019 - March 2021 ● PostgreSQL & MySQL specialist for enterprise clients including DoorDash, Etsy, National Geographic, Follow Up Boss, Gilt, Leafly, and Paperless Post among others, maintaining up to 99.999% uptime for mission-critical operations ● Led large-scale PostgreSQL schema changes, migrations, major upgrades, security patching, and performance optimization in high-transaction, high-availability environments ● ● 24/7 oncall - Prevented critical outages during peak traffic through proactive tuning, capacity planning, and failure modeling, contributing to long-term contract renewals Conducted comprehensive database performance & security audits for numerous clients providing surgical, concrete solutions to complex issues & pain points OmniTI Computer Consulting, Remote — Data Engineer June 2013 - January 2019 ● Re-designed OLAP ETL pipeline and data sanitization project for Gilt Japan saving them $2+ million per year on infrastructure and resources costs ● ● ● ● Designed and implemented HA/DR strategies that reduced processing time by 65% and recovery time objectives (RTO) by 70% 24/7 PagerDuty oncall for multiple clients - debugging and solving production critical performance and availability issues at all times of the night Built and contributed to several open source projects for database monitoring, security, replication, observability, reporting, partitioning, sharding, and migration and employed these tools to better serve clients Performed in-depth & thorough configuration tuning for Postgres and MySQL as well as Linux & Solaris OS to improve database performance and efficiency for multiple clients. Clients like Locally and Click2Ship experienced drastic improvements (4x traffic with no issues) in scaling for holidays & random spikes as a direct result of my endeavors SELECTED PROJECTS ● ● Enterprise AI platform for automated DBMS migration and code conversion using multi-agent orchestration with full auditability, evaluation pipelines, and feedback loops. Designed for safe migration of heterogeneous workloads under strict correctness and traceability guarantees. Designed and implemented RAG + MCP financial analytics platform with routing logic, persistence layer, and evaluation-driven pipelines for reproducible AI-assisted analysis. ● ● Built real-time streaming RAG + MCP pipeline as proof-of-concept for reliable LLM deployment under latency constraints, integrating observability and controlled rollout mechanisms. In-memory enterprise data sanitization pipeline for PCI-compliant PostgreSQL environments CONFERENCE TALKS Spoke at SCaLE 16x, 18x, 23x (most recent), pgCon.dev (Ottawa), Percona Live, NYCPUG, CPOSC on various topics: ● ● ● ● ● Postgres as an AI Control Plane: Building RAG + MCP Workflows Inside the Database (2026) Provisioning & Automating High Availability Postgres on AWS Securing Your Data on Postgres Data Mining for Beginners with Pandas Think Your Postgres Backups & Disaster Recovery Are Safe? Let’s Talk. EDUCATION University of Maryland Baltimore County, Baltimore County MD — Masters in Computer Science September 2011 - December 2014 ● Research Assistant for NSF funded massive data processing & visualization project for geologists Panjab University, Chandigarh (India) — Bachelor of Engineering in Computer Science July 2007 - June 2011 CERTIFICATIONS AWS Certified Solutions Architect AWS Certified Cloud Practitioner TECHNICAL BLOGS & LINKS https://medium.com/@reliable-by-design https://penningpence.blogspot.com https://github.com/payals