WHAT YOU'LL DO - Take on ambiguous reliability, scalability, and efficiency challenges and drive solutions across SRE and development teams. - Build and run large-scale, massively distributed, fault-tolerant systems that keep Genesis platform reliable and performant for our customers. - Optimize existing systems, build infrastructure, and eliminate toil through automation to continuously improve uptime and rate of change. - Cultivate a culture of reliability throughout the organization, guiding technical decisions that balance system health with fast-moving product priorities. - Ensure the long-term health, maintainability, and reliability of services through capacity planning, performance analysis, and proactive incident prevention. WHAT YOU'LL BRING - Strong software engineering skills (e.g., in Python, Go, or similar) with extensive experience designing, analyzing, and troubleshooting distributed systems. - Deep expertise with cloud computing platforms (e.g., Kubernetes, Cloud Functions) and Non-Abstract Large Systems Design (NALSD). - Experience leading complex, large-scale technical projects and providing technical leadership across teams. - Ability to apply coding, algorithms, and complexity analysis to solve ambiguous problems at scale with minimal disruption. - A collaborative, intellectually curious mindset — comfortable working across a wide variety of backgrounds and bringing cross-team perspective to build robust, reusable solutions.

Member of Technical Staff, Site Reliability Engineer

Job Description