Job Description:
• Lead the definition, advocacy, and adoption of SRE principles across engineering teams
• Partner with leadership to shape long-term reliability, resiliency, and observability strategies
• Champion distributed tracing, real user monitoring (RUM), and key performance metrics such as Largest Contentful Paint (LCP) to improve system visibility and user experience
• Build and scale self-healing systems to minimize manual intervention and reduce downtime
• Drive enterprise-wide improvements to incident response processes, including those related to Machine Learning systems
• Collaborate closely with Development Productivity and Quality teams to improve engineering velocity without sacrificing reliability
• Influence technical and operational roadmaps through data-driven insights and hands-on technical contributions
• Own and deliver cross-functional initiatives from concept through execution, applying program management skills to align stakeholders and achieve results
Requirements:
• 10+ years combined experience across Software Engineering and Site Reliability Engineering, with a balanced background in both disciplines
• Proven track record as an SRE thought leader and evangelist, driving adoption of reliability best practices across organizations
• Strong communication and mentoring skills to influence engineers across disciplines
• Proficiency in Python, Go, and JavaScript/TypeScript
• Proficiency with Infrastructure as Code (Terraform, CDK, CloudFormation, etc.)
• Experience building internal tooling from scratch in agile development environments
• Expertise with observability, distributed tracing, RUM, LCP, and performance monitoring tools (e.g., Datadog, Prometheus)
• Experience with on-call and incident management, including large-scale or ML-related incidents
• Strong background in automation and building self-healing systems
• Hands-on experience with LLM/GenAI to improve SRE efficiency and processes
• Program management skills, including the ability to propose innovative solutions, influence leadership, improve processes, and drive cross-functional projects to completion
Benefits:
• Competitive compensation, including base pay, bonus opportunities, and annual equity grants that vest quarterly
• Generous 401(k) plan with Upstart matching $2 for every $1 contributed, up to $15,000 per year
• Employee Stock Purchase Plan (ESPP) with discounted stock purchase options for eligible employees
• Affordable medical, dental, and vision coverage, with multiple plan options - Upstart covers 90% to 100% of the cost depending on the plans you choose
• Health Savings Account contributions from Upstart for eligible plans
• Income protection benefits, including company-paid Basic Life, AD&D, and Short- and Long-Term Disability coverage, with options to purchase supplemental coverage
• Paid time off, sick and safe time, and company holidays
• Paid family and parental leave to support caregiving and major life moments
• Family-centered benefits through Carrot and Cleo, supporting fertility, parenthood, and caregiving
• Employee Assistance Program (EAP) offering mental health support and life-centered resources
• Financial wellness resources, including access to financial planning tools and a financial concierge service
• Annual wellness allowance to support your physical and emotional well-being and personal development, based on what matters most to you
• Annual productivity allowance to invest in relevant tools and resources you need to do your best work, no matter where you work from
• Connection and community through team events and onsites, all-company updates, and employee resource groups (ERGs)
• Onsite perks, including catered lunches and fully stocked micro-kitchens when working from one of our four offices, located in the Bay Area, Austin, Columbus, and New York City (opening Summer 2026!).