Back To All Jobs
Site Reliability Engineer
San Francisco, CA
Coinbase’s mission is to create an open financial system for the world and as the largest bridge into this new system we need to provide a highly reliable foundation to drive this shift. Our services have grown 100x in the past 2 years and we’re now in 30 countries and scaling faster than ever. We’ve invested heavily in an immutable, fully codified cloud infrastructure that allows us to move fast and confidently while we continue to scale.
The Site Reliability Engineers work across all product teams to help design, build and operate reliable services. Where our infrastructure team emphasizes self-service, you embed within teams and work to integrate the right systems and processes in a reliable way. You identify technical and cultural shortcomings that are preventing improved reliability and lead us by example alongside engineering leadership to fix them. You effectively mentor junior engineers and your experience raises the bar across the company.
Reliability at Coinbase
- Work with product teams to identify and measure SLIs and corresponding SLOs
- Identify anything preventing engineers from richly monitoring their applications and prioritize fixes
- Increase the transparency of production service design & operation to the eng teams
- Emphasize automation and engineering over toil and operations
- Work with leadership and customer support to ensure we’re optimizing the right metrics
- Embed early within engineering teams and make reliability a first-class consideration of services
- Define what it means to be a SRE at Coinbase
- Build a team over the next 12 months to sustainably integrate SRE with our team
- Work with product teams to judiciously roll out performance budgets on critical systems
- Work through incidents and improve the overall incident response and post-mortem workflows
- Educate and mentor the engineering team on improving our systems
- Work closely with the infrastructure team and help improve systems
- Split time between hands-on engineering and building up our teams and their processes
- Proven experience building, scaling and operating distributed systems in AWS
- Experience working with thousands of running virtual machines
- A working understanding of Docker & Container Linux/Gentoo in AWS
- Ability to debug complex systems and evolve a running environment without downtime
- Demonstrated ability to prioritize and work in a complex environment
- Understanding of cloud computing and cloud-first environments
- A bias for action and ability to separate good from perfect
What to send
- A resume or LinkedIn profile
- A brief answer to one of the following questions:
- Tell us about a production outage that you led the response to, how your team restored service and reflect on the overall incident.
- Walk us through a system that you helped scale and the decisions you made that most impacted reliability of that system.
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
Apply For This Job
* = required field