- AI Platform Engineer | LLM Systems Architect
- OVERVIEW
- AI Platform Engineer building production-grade LLM systems with strong
- reliability guarantees.
- 10+ years in distributed systems and high-availability infrastructure, now
- focused on AI control planes, RAG pipelines, routing, evaluation frameworks,
- and observability.
- Designed LLM orchestration systems with explicit validation, retry, rollback,
- and commit semantics to reduce non-determinism and enforce correctness. Brings
- deep experience in scaling, failure modeling, and SLO-driven production system
- design.
- SKILLS
- AI Platform & LLM Systems: RAG architectures, LLM orchestration, control
- planes, routing & fallback logic, evaluation frameworks (Langfuse, deepeval),
- trace-level observability, prompt/version management, context engineering,
- agentic workflows
- Distributed Systems & Infrastructure: High-availability design, autoscaling
- concepts, failure modeling, SLO/SLA enforcement, load balancing
- (HAProxy/Nginx), Kubernetes, Docker, cloud-native architectures
- (AWS/GCP/Azure), Rackspace, Terraform, Ansible, Chef, Linux, Solaris, ZFS
- Systems Design & Leadership: Architecture Reviews, Design RFCs, Failure
- Modeling, Incident Postmortems, Cross-team platform enablement
- Languages: Python, SQL, Bash
- Databases: PostgreSQL (expert: performance tuning, query optimization,
- replication, partitioning, migrations, security), MySQL (strong: performance
- tuning, optimization), CockroachDB, Cassandra, MS SQL Server, Oracle
- EXPERIENCE
- NetApp (Acquired), Remote — Senior Systems Architect
- May 2022 - PRESENT
- ●
- Designed and deployed PostgreSQL-backed LLM platform architectures using
- RAG and Model Context Protocol (MCP), treating the database as an AI
- control plane rather than a passive datastore. Improved internal
- operational workflows and reduced failure rates.
- ●
- Built and productionized multi-agent LLM pipelines using LangGraph,
- modeling workflows as explicit state machines with validation, retry,
- correction, and commit phases — increased reliability and contributed to
- ~2x YoY revenue growth.
- ●
- Integrated end-to-end LLM observability and evaluation pipelines
- (Langfuse, deepeval), enabling trace-level introspection, automated
- scoring, regression detection, and feedback-driven iteration.
- ●
- Defined system invariants, routing gates, and validation probes to
- enforce correctness, reproducibility, and safe deployment of AI-assisted
- database tooling in enterprise SaaS environments.
- ●
- Applied distributed systems principles (idempotency, transactional
- boundaries, rollback semantics, auditability) to AI workflows, reducing
- silent failures and improving deployment safety.
- ●
- Designed API-layer abstractions and LLM gateway components to support
- routing, validation, and model versioning strategies.
- ●
- Deployed & iterated LLM systems in cloud-native environments (AWS/Azure),
- integrating storage, compute, and monitoring layers to support production
- workloads.
- Instaclustr (Acquired), Remote — Senior Database Reliability
- Engineer
- March 2021 - May 2022
- ●
- ●
- ●
- PostgreSQL scalability expert for DoorDash, helping them scale their
- Postgres RDS & Aurora systems to support rapid traffic growth during the
- COVID-19 pandemic surge.
- Led the team of database experts working on all aspects of DoorDash
- database scalability, performance, and infrastructure overhaul
- Provided ongoing schema design, query optimization, live production
- schema change processes with minimal disruption, and disaster recovery
- consulting playing a pivotal role in the company’s growth & operations
- leading to a successful IPO
- Credativ (Acquired), Remote — Senior Database Reliability
- Engineer
- January 2019 - March 2021
- ●
- PostgreSQL & MySQL specialist for enterprise clients including DoorDash,
- Etsy, National Geographic, Follow Up Boss, Gilt, Leafly, and Paperless
- Post among others, maintaining up to 99.999% uptime for mission-critical
- operations
- ●
- Led large-scale PostgreSQL schema changes, migrations, major upgrades,
- security patching, and performance optimization in high-transaction,
- high-availability environments
- ●
- ●
- 24/7 oncall - Prevented critical outages during peak traffic through
- proactive tuning, capacity planning, and failure modeling, contributing
- to long-term contract renewals
- Conducted comprehensive database performance & security audits for
- numerous clients providing surgical, concrete solutions to complex issues
- & pain points
- OmniTI Computer Consulting, Remote — Data Engineer
- June 2013 - January 2019
- ●
- Re-designed OLAP ETL pipeline and data sanitization project for Gilt
- Japan saving them $2+ million per year on infrastructure and resources
- costs
- ●
- ●
- ●
- ●
- Designed and implemented HA/DR strategies that reduced processing time by
- 65% and recovery time objectives (RTO) by 70%
- 24/7 PagerDuty oncall for multiple clients - debugging and solving
- production critical performance and availability issues at all times of
- the night
- Built and contributed to several open source projects for database
- monitoring, security, replication, observability, reporting,
- partitioning, sharding, and migration and employed these tools to better
- serve clients
- Performed in-depth & thorough configuration tuning for Postgres and MySQL
- as well as Linux & Solaris OS to improve database performance and
- efficiency for multiple clients. Clients like Locally and Click2Ship
- experienced drastic improvements (4x traffic with no issues) in scaling
- for holidays & random spikes as a direct result of my endeavors
- SELECTED PROJECTS
- ●
- ●
- Enterprise AI platform for automated DBMS migration and code conversion
- using multi-agent orchestration with full auditability, evaluation
- pipelines, and feedback loops. Designed for safe migration of
- heterogeneous workloads under strict correctness and traceability
- guarantees.
- Designed and implemented RAG + MCP financial analytics platform with
- routing logic, persistence layer, and evaluation-driven pipelines for
- reproducible AI-assisted analysis.
- ●
- ●
- Built real-time streaming RAG + MCP pipeline as proof-of-concept for
- reliable LLM deployment under latency constraints, integrating
- observability and controlled rollout mechanisms.
- In-memory enterprise data sanitization pipeline for PCI-compliant
- PostgreSQL environments
- CONFERENCE TALKS
- Spoke at SCaLE 16x, 18x, 23x (most recent), pgCon.dev (Ottawa), Percona Live,
- NYCPUG, CPOSC on various topics:
- ●
- ●
- ●
- ●
- ●
- Postgres as an AI Control Plane: Building RAG + MCP Workflows Inside the
- Database (2026)
- Provisioning & Automating High Availability Postgres on AWS
- Securing Your Data on Postgres
- Data Mining for Beginners with Pandas
- Think Your Postgres Backups & Disaster Recovery Are Safe? Let’s Talk.
- EDUCATION
- University of Maryland Baltimore County, Baltimore County MD —
- Masters in Computer Science
- September 2011 - December 2014
- ●
- Research Assistant for NSF funded massive data processing & visualization
- project for geologists
- Panjab University, Chandigarh (India) — Bachelor of Engineering
- in Computer Science
- July 2007 - June 2011
- CERTIFICATIONS
- AWS Certified Solutions Architect
- AWS Certified Cloud Practitioner
- TECHNICAL BLOGS & LINKS
- https://medium.com/@reliable-by-design
- https://penningpence.blogspot.com
- https://github.com/payals
