Do not re-distribute this resume without obtaining my consent. Authoritative source for this resume: http://www.rage.net/~greg/resume.txt Greg Retkowski / 408-544-0437 / greg@rage.net Principal Engineer, Leader, Network & Systems Architect, Cloud, DevOps # Executive Summary: Demonstrated history of adding organizational value through the successful execution of innovative technological projects - delivering within schedule and budget constraints while exceeding expectations - both as in individual contributor and as a leader of teams. Successful outcomes in organizations ranging from fast-paced startups to large public companies. Adept at all aspects of network, server, & service management, with a focus on cloud and operations automation projects. Strong problem solving, communications, technical, and leadership skills. Delivered valuable results for Internet startups such as Nebula, OnLive, Avvenu, Mediaplex, & SpeedEra - and established brands such as SilverSpring, Upwork, & 8x8. # Values: Commit and Deliver, Candor, Hands-On Technical Leadership, Adaptability, Innovative, Customer Focused, Metrics Driven # Technical Skills: Customer Experience Monitoring: Anomaly Detection, Real User Monitoring, Synthetics Logging: Logstash, OpenSearch, Syslog, Splunk, fluentd, gelf, Cloudwatch logs Observability: Grafana, Prometheus/Thanos, Graphite, Icinga, SNMP Observability SaaS: PagerDuty, DataDog, DynaTrace Amazon/AWS: Athena/Glue, CloudFormation, CloudWatch, EC2, EKS, ElastiCache, IAM, OpenSearch/ElasticSearch, OpsWorks, RDS, Route53, S3, SNS, ECS, EKS, SQS, VPC OpenStack: Nova (compute), Cinder (volume), Ceph, Neutron (network), Trove (DBaaS), LBaaS, Heat (orchestration), Keystone (Identity) Python: flask, numpy, jupyter, pip, venv, pytest Configuration Management: Chef, Puppet, Ansible, Terraform Security: Firewalls, SAML/SSO, MFA/2FA, Vault, SSH, Sudo, SSL/PKI, VPN, ZTA Containers: Docker, K8S/Kubernetes, ECS, EKS, compose, microservice arch Methodologies: Kanban, TPM, CI/CD, Atlasian/Jira, DevOps, Agile Web: Apache, haproxy, nginx, HTTP & HTTPS, REST, SOAP Other Languages: Java, Perl, PHP, HTML, CSS, Shell, SQL, Ruby Dev Tools: Git, Github, Gitlab, Bitbucket; APT/Dpkg & RPM/Yum, Jenkins LLM/GenAI: falcon, mistral, vicuna, langchain, chromadb, openai chatgpt Big Data & Analytics: Retool, Redash, Athena/GLUE HW/OS: Intel/Linux, Ubuntu, Redhat # Experience: 2022-Current Principal Engineer, Manger EHM, Upwork Inc; Remote Founded and managed the Experience Health Management team, focused on Customer Experience Monitoring to address gaps in incident detection. Our team made out-sized contributions to improving incident detection. Responsible for the launch of three new initiatives; Synthetic Monitoring, Anomaly Detection on critical Metrics, and Real User Monitoring. These initiatives doubled the number of incidents detected within the 10 minute SLO. A very low percentage of incidents now fall outside of this SLO due to these efforts. In addition to hands-on engineering work, responsibilities included all TPM effort, and bootstrapping/management for the team. Track record of engineering mentorship, the team members from the team showed constant improvement and are now two are in the top 10 most productive engineers in the company. 'On-prem' (in-vpc) Deployment of a Generative AI Large Language Model (LLM), comparison/eval of several models including Mistral, Vicuna, Llama, Falcon. integration work using python/langchain. 2018-2022 Team Lead, Automation (DevOps) Team, Upwork Inc; Remote Diverse responsibilities including architecture guidance, and delivery of many DevOps projects using tools including chef, python, jenkins, terraform. To highlight one of many projects during this period - was tasked with remediation of the logging pipeline which corrupted a large percentage of log entries - impacting the overall availability of the customer-facing service. Rapidly triaged several failing portions of the system, and applied fixes to address the scaling problems that were the root cause. Became the steward of the logging system and therefore made several improvements which enabled the log volume to increase 4x to dozens of TB per day. Launched a project to vastly increase the retention and query capabilities by 26X. The logging system now provides the ability to query petabytes of data with a query response time measured in seconds. 2016-2018 Systems Architect, Silver Spring Networks; Remote Primary Architect and Implementer of project to convert the company from Perforce to Git/GitLab. Provided engineering management a cost/benefit analysis and project plan for converting to Git/Gitlab and migrating existing Perforce code. Deployed multiple GitLab servers and integrated with existing systems and workflows for user authentication, continuous integration, and ticket tracking. Tasked with evaluating suitable options for a private cloud within the organization. Deployed a 22-node pilot. Spec'ed, configured, and managed an OpenStack deployment which included compute, network, volume, orchestration, DBaaS, LBaaS, services & supported multi-region and IPv6 features. This included deploying monitoring, metrics, & logging aggregation. Provided demonstrations, training and documentation for end-user use of the private cloud. Created a configuration management workflow for the Technical Operations department. Designed a workflow based on Ansible & GitLab with development & testing using test-kitchen & Docker containers and secret management. Wrote a python library to integrate Ansible with a Remedy asset database. Developed a training curriculum, including hands-on classes, training videos, documentation, and sandbox environments for the OpenStack, Ansible and Git/Gitlab projects. 2014-2015 DevOps, Upwork (Elance); Mountain View, CA Created a front/back-end microservices monitoring system. Built on ruby and leveraged Icinga as an execution engine. It queries metadata from each microservice to generate monitors, thresholds, and route email and pager alerts to the appropriate team. This monitored hundreds of microservices and automatically adjusted to additions or removals of services as the topology evolved. Managed AWS cost management using resource tags. Costs were tracked by team, department, and tier. Wrote a ruby library and several tools manage tags for resources. Deployed and modified Netflix ICE for on-demand and scheduled reporting. Designed and built the disaster recovery environment. Did the system- archaeology on the undocumented/cloned server image and porting the configuration into Chef recipes. Created a tool 'cloudmanager' which allowed operators to define the entire configuration for an environment in YAML and then launch/provision/manage/teardown all or subsets of hosts in an environment. The hosts would do an unattended configuration on boot via Chef. 2012-2013 DevOps, Nebula, Inc. Palo Alto, CA 2008-2011 OpsEng Team Lead, OnLive, Inc; Palo Alto, CA In charge of the five-member senior systems administration team at OnLive. The team is responsible for all server & network automation projects surrounding the launch of the OnLive game service. The team built out the service from less than 100 nodes to more than two orders of magnitude larger. Implemented a team work-flow using Kanban, which allows the team to simultaneously complete large projects, increase team productivity, and react to changing priorities. It made it possible for the team to deploy the large-scale production service while meeting an aggressive schedule. Introduced the operations software-quality initiative, which ensured systems-configuration code was thoroughly vetted via phased-release, continuous-integration, and unit-tests. Used VM's to CI & Unit test system configuration code. Defined and created the Release-to-Customer processes - deployed all production releases for a period of about six months. Performed all of these team-leading activities while also performing my responsibilities as a primary individual contributor. Key designer/implementer of the service software deployment mechanism. Currently manages over a half a terabyte of data spread across thousands of build-server produced RPM's. Designed and implemented a multicast software distribution system that enables software to be deployed across the network quickly with little server impact. Primary maintainer of our puppet automation system which automatically provisions and maintains our UNIX/Linux servers. Primary maintainer/contributor to the service configuration database written in Ruby on Rails. The software was responsible for maintaining & updating configurations of the hosts, monitoring, trending, logging, reporting, install-automation and needed to work in a mixed (Windows & Linux) environment. 2004-2007 Senior Network Architect, Avvenu, Inc; Palo Alto, CA Was the senior-most member of the technical operations staff. Designed and deployed lights-off resilient network infrastructure for the roll-out of a remote access service. Created a self-healing ability in the network utilizing NAGIOS and cfengine. Automated all levels of management of the network, from host installation through application upgrade via cfengine and other tools. Evaluated network hardware, colocation facilities, operating systems, and management tools based on requirements of service and software. Wrote and deployed a Ruby on Rails application to manage shared music for subscribed users. Supported Avvenu's Facebook application by deploying development and production environments and working with development team. Deployed new corporate website which included re-coding CSS/HTML, deploying and customizing Wordpress, writing a database-backed management system for press releases. Set up openads (OpenX) to manage rotating content internal to the Avvenu application. 2002-2004 Project Implementer, Chip Express Corp; Santa Clara, CA 2004-2005 Infrastructure Consultant, Integrated Devices; San Diego, CA 2004 Infrastructure Consultant, TruSonic Inc; San Diego, CA 2000-2001 IT Manager, Netergy Networks (8x8); Santa Clara, CA 2000 IT Manager, SpeedEra Networks (now Akamai); Santa Clara, CA 1999 Move Coordinator, Bamboo.Com (now IPIX); Palo Alto, CA 1999 Network Architect, MediaPlex; Cupertino, CA 1998-1999 Senior SysAdmin, SaveSmart Inc; Mountain View, CA 1997 Operations Manager, PocketScience Inc; Santa Clara, CA 1995-1997 Network Administrator, Safari Internet; Ft Lauderdale, FL First employee - grew network from inception to a 1500 user mixed (consumer & business) Internet Service Provider. Responsible for all aspects of network administration in addition to Internet related programming and consulting for company customers. 1991-1995 LWV (HMMWV) Mechanic. US Army; Schofield Bks, HI Personal Achievements: Co-designed, Tested, and Implemented a homebuilt wireless router with a 1 megabit capacity at 13+ miles in 1998. Prior art to patent #7035281 Responsible in whole or part in several Open Source software initiatives; ported LDAP nameservices to Linux, wrote an authentication hash library for TCL/TK, wrote several HOWTO's including Wireless-Router and LDAP-authentication, and worked on several web based projects including a search-engine submittal tool and a re-work of the Gnats system web frontend. Ported the 'Pygame Learning Environment' Reinforcement Learning environment to Python 3. Deployed a 8 node 2 segment 100-base-T enterprise class network to provide remote management and monitoring facilities for the house lava lamps. (1998) Published an article with Oreilly, titled "Self Healing Networks with NAGIOS & Cfengine". Presented to 'Large Scale Production Engineering' group on operational lessons learned building large scale production services Published "Migrating servers into OpenStack" for Sysadvent 2012 Eagle Scout, Army Veteran Code samples available at github: https://github.com/gregretkowski FAA Certs: Private Pilot (SEL/Glider), Repairman, Remote Pilot If you google me, I'm the technologist / sailor / pilot. The German Elvis impersonator is someone else. :) ### Do not re-distribute this resume without obtaining my consent. Email (don't call) when establishing initial contact with me. Thank you.