WHAT YOU'LL DO
- Take on ambiguous reliability, scalability, and efficiency challenges and drive solutions across SRE and development teams.
- Build and run large-scale, massively distributed, fault-tolerant systems that keep Genesis platform reliable and performant for our customers.
- Optimize existing systems, build infrastructure, and eliminate toil through automation to continuously improve uptime and rate of change.
- Cultivate a culture of reliability throughout the organization, guiding technical decisions that balance system health with fast-moving product priorities.
- Ensure the long-term health, maintainability, and reliability of services through capacity planning, performance analysis, and proactive incident prevention.
WHAT YOU'LL BRING
- Strong software engineering skills (e.g., in Python, Go, or similar) with extensive experience designing, analyzing, and troubleshooting distributed systems.
- Deep expertise with cloud computing platforms (e.g., Kubernetes, Cloud Functions) and Non-Abstract Large Systems Design (NALSD).
- Experience leading complex, large-scale technical projects and providing technical leadership across teams.
- Ability to apply coding, algorithms, and complexity analysis to solve ambiguous problems at scale with minimal disruption.
- A collaborative, intellectually curious mindset — comfortable working across a wide variety of backgrounds and bringing cross-team perspective to build robust, reusable solutions.