Staff Site Reliability Engineer
Company: Gradle Technologies
Location: New York City
Posted on: February 16, 2026
|
|
|
Job Description:
Job Description Job Description Who We Are Develocity is a
first-of-its-kind toolchain observability and acceleration platform
that helps software teams adopt and improve DORA capabilities
(including continuous delivery) in order to achieve software
delivery excellence. It combines build and test acceleration with
deep observability for builds and tests with Gradle Build Tool,
Apache Maven™, sbt, npm, and Python, and applies to both CI and
local builds and tests. Ultimately, Develocity provides an
operational layer across an organization's toolchains to speed up,
troubleshoot, and optimize local developer and remote CI feedback
loops. Our software is used by some of the world's leading software
organizations, such as Netflix, Airbnb, SAP, several top ten banks,
and many other major customers across all verticals. We regularly
collaborate with these and other users to make our products
continuously better. We have partnered with the Apache Software
Foundation, the Commonhaus Foundation, the Scala Center, the
Micronaut Foundation, and other OSS projects like Spring, Quarkus,
Kotlin, JUnit, AndroidX, and many more to bring the values of
Develocity also to the OSS Community. Our Values Seek to
Understand: Everything starts with listening and understanding, and
we strive to understand different viewpoints, problems, and
motivations. Before we take action, we ensure we truly grasp the
challenges, perspectives, and goals. Know the Why : We approach our
work with a clear sense of purpose, ensuring every step is
deliberate and focused. We take meaningful action with urgency, but
never at the expense of thoughtful consideration. Innovate &
Iterate : We embrace challenges and are not afraid to try new
things, even if they might fail. With deep understanding and a
clear purpose, we can develop creative and bold solutions to tackle
challenges. Own the Outcome: We are empowered to take initiative
and we maintain transparency in our work and its outcomes. When we
execute, we take responsibility for our decisions, measure the
success of our innovations, and learn from the results. Who You Are
We're building a new SRE team and looking for founding members to
help shape how we operate. As a Lead SRE, you'll be a technical and
operational leader for reliability across Develocity. You'll help
define our SRE vision, set standards for how we operate production
services, and mentor other SREs as the team grows. This is a
hands-on role with broad influence across engineering, cloud
platform, and customer-facing teams. The SRE team will be
responsible for the reliability, performance, and availability of
Develocity instances serving paying customers, open-source
projects, and public-facing services, plus supporting
infrastructure like artifact registries. You'll work on our
internally-built Cloud Application Platform, Kubernetes on AWS, and
develop deep expertise in it. When incidents happen, you'll
troubleshoot issues across the stack, from application to
infrastructure. You'll collaborate with the Cloud Platform team to
improve the tooling you depend on, and with engineering teams to
build reliability into how we ship software. If you like automating
things and hate doing the same task twice, you'll fit in well.
You'll be part of a distributed, remote-first team that values
asynchronous communication and written documentation. Strong
self-direction and clear communication across time zones are
essential. Responsibilities Operate and maintain all Develocity
instances and supporting services in production. Define and evolve
SRE standards, practices, and operating models, including on-call,
incident response, postmortems, and SLOs. Participate in a
follow-the-sun on-call rotation, acting as a technical escalation
point for complex or high-severity incidents. Lead incident
response and blameless retrospectives, ensuring learnings result in
measurable reliability improvements. Set reliability priorities
using risk, customer impact, business goals, SLOs, and error
budgets. Identify systemic reliability risks and continuously
evolve Develocity's SaaS operations as the platform and customer
base grow. Lead and influence architectural and design reviews to
ensure reliability, scalability, and operability. Drive automation
across deployment, upgrades, monitoring, self-healing, recovery,
and operational workflows. Build and maintain comprehensive
observability for all managed services, including logging, metrics,
tracing, and alerting. Own disaster recovery, backups, and business
continuity planning and execution. Partner with engineering
leadership to balance feature delivery with reliability and
operational excellence. Mentor and coach SREs, supporting technical
growth and strong operational practices. Help onboard new SREs and
contribute to hiring by defining and assessing SRE excellence at
Develocity. Communicate clearly with customers during incidents and
maintenance windows. Optimize performance, resource utilization,
and operational costs. Minimum qualifications 7 years in SRE,
DevOps, or an equivalent role operating production services at
scale. Experience leading reliability initiatives across multiple
teams or services. Demonstrated ability to influence technical
direction without direct authority. Experience designing and
operating systems with SLOs and error budgets, and exercising
strong judgment in balancing reliability, velocity, and cost.
Strong Kubernetes experience in production environments. Cloud
infrastructure expertise, preferably AWS (EKS, RDS, S3, EC2).
Proficiency with observability tools (Prometheus, Grafana) and
Infrastructure as Code (Terraform). Track record of incident
management and response in a 24/7 on-call environment. Scripting
proficiency (Python, Bash) for automation. Strong written and
verbal English communication skills. Preferred qualifications
Experience as a founding or early SRE establishing practices in a
growing SaaS organization. Familiarity with Develocity. JVM
language experience (Java, Kotlin). Experience with customer-facing
and executive-level incident communications. What We Offer A
ground-floor role in a new SRE team - you'll shape how we do
things, not inherit someone else's decisions. Real ownership of
production systems used by engineers at companies you've heard of.
Direct interaction with customers when things go wrong (and when
they go right). A culture that values automation over heroics.
In-person meetings, such as our annual company offsite and team
meetings. Work from home in a remote-first environment. Competitive
salaries and equity grants. Compensation The US salary range for
this position is $180-220k which reflects the target ranges for all
US locations. Within this range, individual pay is determined by
geographic location and additional factors including but not
limited to experience, relevant skills, qualifications, seniority,
performance, and travel requirements. Our recruiting team can share
more information about the specific salary range for your location
during the hiring process. Location Remote from anywhere in EST
timezone. While our team works remotely and is spread across the
globe, we deeply value daily interactions and collaboration.
Keywords: Gradle Technologies, Hempstead , Staff Site Reliability Engineer, IT / Software / Systems , New York City, New York