Download PDF

Summary

Brad Chapin is a highly skilled Senior Software Architect with expertise in Site Reliability best practices, methodologies and principles with a strong background in Distributed Systems, Hybrid Cloud Computing, and the Software Development Process. He has a proven track record of success in managing large-scale projects and teams, including his role as Senior Lead Architect/Senior Manager at Lumen Technologies, where he was responsible for Site Reliability, Configuration Management, Deployment, and Automated Failure Recovery for the Lumen Content Distribution Network (CDN). Brad's accomplishments include designing an authenticated RESTful API (FastAPI) to determine network disposition, implementing a GitOps solution to execute upgrade and trial configurations, and architecting and implementing staged deployment for 20K operational CDN servers. He has also played a key role in the migration from home-grown network automation systems to Salt and the implementation of GitLab CI/CD for network automation code. Brad's expertise extends to Continuous Integration and Continuous Delivery (CI/CD), VM/container integration, and high performance clustered storage solutions. He holds a Bachelor of Engineering in Computer Science from the University of Illinois Urbana-Champaign. Brad is seeking advanced opportunities to apply his skills and contribute to the success of complex projects in the field of Site Reliability.

Work experience

Lumen Technologies, Broomfield, US
2015present

Senior Lead Architect/Senior Manager - CDN/Network Automation

Technical lead and manager responsible for Architecting, Designing and allocating work for a team of 3-6 responsible for Site Reliability, Configuration Management, Deployment, and Automated Failure Recovery for Lumen CDN managing 20K servers, VM's and containerized work flows. CDN produced $100M+ in yearly recurring revenue while delivering greater than 100 Tbps.

CMAPI - Architected and Designed an authenticated RESTful API (FastAPI) to aggregate and cache CM data from 10+ systems. Currently used by monitoring, config management and repair automation as source of truth for network disposition. This system is constantly evolving as we add new data sources and have new requests for custom aggregations for both internal and external uses. It is the foundation and source of record for all projects listed below.

Trial ROD - Implemented a GitOps model using Rest API's through GitLab CI/CD to identify clusters/servers for upgrade and apply new trial configurations through CMAPI interfaces and utilize Release On Demand to deploy new applications/configurations to the network (bare metal and VMS). The resulting project gave ownership of candidate releases to the engineers responsible for validating performance and expedited candidate deployments from monthly to 1-5 per week.

Release On Demand - Architected and Implemented staged deployment for 20K operational CDN servers based on the disposition of the network. RoD removed the onus for generating a upgrade schedule from operations, while taking in to account traffic levels, metros, server health and network blackout dates. It increased release cadence for primary distribution network from 1-2 GA releases per year, to our targeted goal of once per month.

Section IO Compute Engagement - Operationalized a PoC for a partnership with SectionIO to repurpose CDN hardware that was no longer viable for edge delivery due to network constraints and utilize it for a multi-tenant customer facing compute product. Utilized Canonical MaaS to build and configure machines to hand over to SectionIO for proprietary application install and configuration. We were able to reallocate 800 out-of-date and impaired servers at no cost other than space, power and networking. 

Salt Versioned Formulas - Extended the Salt concept of environments to include supporting multiple versioned CM code based on "Release States" and server role by rebuilding them on the fly prior to running a job. One control file defines which role and release state get which versions of code, allowing us to trial new formulas on a targeted network segment while knowing we would not affect other segments. Presented our results at 2019 Salt Conf.

RA/CM Replacement - Documented existing Architecture and Defined/Implemented next-gen architecture for migration from a wholly home-grown network automation system to Hybrid Salt solution. Incorporating Salt allowed us to more consistently and quickly incorporate changes to our growing network including but not limited to new OS's, containers and virtualization.

CI/CD - Incorporated and mandated the concept of CI/CD within the network automation code base. Provided templates and frameworks for consistent automated testing utilizing VM's and containers with the pytest and Test-Kitchen frameworks. Each new GitLab project is created with a set of configurations for virtual env's (Poetry, RVM), testing virtualization, linting, and deployment models. The developer just needs to add test either from scratch utilizing shared custom pytest libraries with common fixtures. This initiative has minimized the need for regression testing because tests grow naturally as bugs are found fixed and validated.  

VM/Container Integration - Architected, Designed, and Implemented frameworks for deploying and configuring VM's (VMware) and Containers (Docker/Kubernetes) in production environment. Containers allow us to abstract away the OS layer and maintain some of our legacy applications while still addressing security concerns. Virtualization has reduced the time to grow and shrink our network from weeks to purchase and allocate hardware to minutes to define our new server in inventory and use automation to stand up and configure. 

Edge Portal - Implemented container based deployment of a secure publicly available interface utilizing Traefik with TLS certificates, custom Golang/Node.js applications, KrakenD, Postgres DB's, and ElaticSearch/Kibana monitoring using FileBeat, Metricbeat.

Lumen Technologies, Broomfield, US - Then Level3 Communications
20072015

Senior Network Engineer/Senior Manager - CDN/Origin Storage

Technical lead and key contributor that Architected, Designed, Implemented, and Supported an 60PB multi-site origin storage product with $20M in yearly recurring revenue supporting 3000+ customers with over 30 billion objects, achieving egress delivery in excess of 150 Gbps across four gateway locations.

Live Storage PSVue - Architected, Designed, and Implemented storage for live ingest and cloud DVR offering for short-lived OTT Sony PSVue offering. Responsible for multi-bitrate live ingest in 5 second chunks and on-demand API requests for later DVR viewing. Utilized a unique A/B segment combination to uniquely identify each stream to adhere with the broadcast ownership legal ruling while still minimizing storage.

OSP 3.0 Auditing - Designed and implemented dynamic decision tree functionality for auditing and recording 10K+ distinct values on 3K+ live servers across the globe utilizing distinguishing characteristics like region and role to identify any deviations from the expected configuration.

OSP 3.0 - Architected, Designed and Deployed with a team of 5 a 60PB geo redundant storage solution with 100% durability and 99.999% ingest SLAs utilizing GPFS (IBM) on DDN hardware. Defined and implemented a migration strategy to perform a live cutover of 1500 customers (including industry leaders like Hulu, Sony and Microsoft) and 1B files without service impact.  

OSP 2.0 - Architected, Designed and Deployed with a team of 7 a 20PB geo-redundant storage built on Lustre filesystem with bi-direction continuous synchronization cross-site. Our team was responsible for all product systems including but not limited to secure shared ingress, secure egress, API driven customer service management, billing, inventory and data security, reliability and durability.

Lumen Technologies, Broomfield, US
20002007

Senior Software Engineer - Viper VoIP/Managed modem softswitch

Key individual contributor for both infrastructure and signaling proxy for a VoIP softswitch producing 430M in yearly recurring revenue.

Standard Lib - Implemented STL extensions within C++ for standardized development framework across multiple applications including parsing, communications, logging and alerting. Also worked with a team to use a network sniffer to process and visually represent call tracing per call leg with decoded SS7 and SIP messaging for each network hop. 

Signaling Proxy - Implemented communications proxy parsing SS7, SIP and internal control plane messages to support Managed Modem and VoIP products. Specifically tasked with implementing features like SIP-T parsing, Local Number Portability, and E911. 

TRW, Aurora, US
19982000

Software Engineer - DoD contract

Individual contributor on DoD project responsible for providing infrastructure libraries for a development team of ~50 developers.

Infrastructure Team - Implemented C++ tooling for standard messaging, logging, alerting, and signal handling libraries. Required Top-Secret clearance.

Skills

Distributed Systems

CDN - 100 System Roles, 20K servers, 300 Geographic sites

Origin Storage - 80 System roles, 3K servers, 60PiB clustered filesystem

ManagedModem/VoIP - 10 System Roles, 3K servers

Config Deploy Upgrade Migrate: Change Management

SaltStack

CI/CD - PyTest, Test-Kitchen, TestInfra, Gitlab CI/CD

Puppet

Package/Container building - rpm, deb, docker

Repo management - yum, apt, docker registry, npm

Hybrid Cloud Computing

Virtualization - VMware, Vagrant with VirtualBox, Salt-Cloud
Containerization - Docker, Kubernetes

Bare Metal - CentOS, Ubuntu - including imaging over IPMI
    Redhat Anaconda, MAAS, cloud-init, curtain

Software Development Process

CI/CD for all applications - utilizing containers and VMs for automated testing

Consistent Interfaces - REST APIs adhering to OpenAPI and JsonAPI specs

Documentation Templates - for software designs and projects

Merge/Pull Requests - reviewed by a second developer

Commitments - provide clear LOEs and deliver on commitments

Programming: Languages

Python

C, C++

Scripting - Perl, Bash, Ksh, PHP

Ruby/Rails

Programming: API

RESTful APIs - using FastAPI, Java

OpenAPI with Swagger

JsonAPI spec

Nginx, Traefik, Apache Httpd, KrakenD, Apigee

Networking

Router/Switch ACL’s

Firewall rules - both on-box and appliance based

DNS

Protocols - Http(s), TCP/IP, SSL/TLS, ICMP

Databases

Postgres

mySQL

Redis

Qualifications

Collaboration and Leadership - Managed teams of 3-9 within an organization of 6 primary teams, gathering requirements from Product, Architecture, Engineering, and Operations to design and implement consistent reliable solutions.
Linux
- A long history of managing tens of thousands of servers in hundreds of sites across the globe
SaltStack - Architected and Deployed Tiered infrastructure to support 20K servers
Gitlab - Admin with emphasis on CI/CD
Virtualization - VMware, Vagrant with VirtualBox, Salt-Cloud
Containerization - Docker, Kubernetes
Staged Deployments - Safely deliver the right software and configurations to the right segment of the network at the right time
Troubleshooting - 15 years as tier 4 support identifying and addressing issues between systems operating in more than 300 sites around the globe
Mentorship - A strong desire for both junior and senior level contributors, because it is easier to develop a rock-star than to hire one

Education

University of Illinois Urbana-Champaign
19941998

Bachelor of Engineering - Computer Science - GPA 3.8