Site Reliability Engineer II
Company: The Walt Disney Company
Location: New York City
Posted on: April 4, 2026
|
|
|
Job Description:
Job Posting Title: Site Reliability Engineer II Req ID: 10143234
Job Description: Department/Group Overview Our engineering fleet is
a horizontal set of teams providing engineering services across the
organization. Our specific team provides reliability engineering
and operational support to backend service development teams.
Disney Entertainment and ESPN Product & Technology Technology is at
the heart of Disney’s past, present, and future. Disney
Entertainment and ESPN Product & Technology is a global
organization of engineers, product developers, designers,
technologists, data scientists, and more – all working to build and
advance the technological backbone for Disney’s media business
globally. The team marries technology with creativity to build
world-class products, enhance storytelling, and drive velocity,
innovation, and scalability for our businesses. We are Storytellers
and Innovators. Creators and Builders. Entertainers and Engineers.
We work with every part of The Walt Disney Company’s media
portfolio to advance the technological foundation and consumer
media touch points serving millions of people around the world.
Here are a few reasons why we think you’d love working here:
Building the future of Disney’s media: Our Technologists are
designing and building the products and platforms that will power
our media, advertising, and distribution businesses for years to
come. Reach, Scale & Impact: More than ever, Disney’s technology
and products serve as a signature doorway for fans' connections
with the company’s brands and stories. Disney. Hulu. ESPN. ABC. ABC
News…and many more. These products and brands – and the unmatched
stories, storytellers, and events they carry – matter to millions
of people globally. Innovation: We develop and implement
groundbreaking products and techniques that shape industry norms
and solve complex and distinctive technical problems. Job
Description The Streaming SRE squad drives improvements in
performance, resiliency, and operational excellence. We take a
consultative approach to reliability engineering—partnering with a
variety of cross-functional teams to provide guidance, automation,
education, and best practices that elevate the reliability and
scalability of services that support our products and brands. We
are seeking a Site Reliability Engineer who will contribute to the
stability and scalability of critical systems by building
automation, improving operational workflows, enhancing
observability, and participating in incident response. The ideal
candidate has a strong understanding of distributed system
fundamentals, cloud-native resources and operations, and
performance optimization. This role requires a collaborative
mindset and the ability to work closely with engineering teams to
implement SRE principles across the organization. Fostering
innovation is a critical component to success here at Disney
Entertainment and ESPN Product & Technology. Therefore, the ideal
candidate will also need to be highly adaptable to changes and be
able to pivot when required. Responsibilities: Contribute to the
design, implementation, and improvement of systems to enhance
reliability, scalability, and performance. Build and maintain
automation for deployment, monitoring, alerting, and operational
workflows. Collaborate with software engineering teams to implement
SRE best practices, including SLIs, SLOs, error budgets, and
automated remediation. Support CI/CD pipelines and participate in
optimizing the software delivery lifecycle. Develop tools,
dashboards, and instrumentation to improve observability across
metrics, logs, and distributed tracing. Participate in incident
response, root cause analysis (RCA), and corrective actions to
prevent recurrence. Assist in capacity planning, performance
tuning, and scaling strategies for distributed systems. Maintain
and improve Infrastructure-as-Code (IaC) definitions and cloud
environment configurations. Contribute to documentation, runbooks,
architectural diagrams, and operational standards. Collaborate with
cross-functional teams to identify reliability risks and recommend
improvements. Participate in incident-based escalations and
rotations to support high-availability production systems.
Continuously evaluate system architecture, tools, and practices to
drive operational excellence and efficiency. Basic Qualifications
Bachelor's degree in computer science, Engineering, or related
field (or equivalent experience). 3 years of experience in Site
Reliability Engineering, DevOps, Platform Engineering, or related
discipline. Hands-on experience with cloud platforms – AWS
(preferred), GCP, Azure. Proficiency in Python, Go, JavaScript,
Bash, or equivalent scripting languages. Working knowledge of Linux
or Unix-based systems. Experience with CI/CD systems (e.g., GitHub
Actions, GitLab CI, Jenkins). Familiarity with
Infrastructure-as-Code (Terraform, CloudFormation, etc.).
Experience with containerization technologies such as Docker and
Kubernetes. Understand networking fundamentals, distributed
systems, and system design basics. Strong analytical and
troubleshooting skills, including the ability to diagnose complex
system issues. An ability to work both independently and
collaboratively Strong communication skills and the ability to
collaborate effectively with cross-functional teams. Preferred
Qualifications Hands-on experience with observability stacks
(Prometheus, Grafana, ELK/EFK, Datadog, Splunk, New Relic).
Exposure to GitOps tooling (Argo CD, Flux). Experience contributing
to SLO/SLI frameworks and implementing error budgets. Knowledge of
service mesh architectures (Istio, Linkerd). Familiarity with
performance testing and load testing tools. Experience with message
queues, event-driven systems, or distributed data platforms. Cloud
or DevOps-related certifications (AWS Associate/Specialty, GCP
Professional, Kubernetes CKA/CKS). Experience working in
large-scale enterprise environments or with distributed global
teams. Experience using modern AI-assisted development tools (e.g.,
Copilot, Cursor, or similar) to improve code quality, accelerate
development, and enhance documentation. Understanding foundational
AI/ML concepts, familiarity with cloud-native AI services such as
model hosting, and/or ability to use AI tools to automate cloud
operations tasks. The hiring range for this position in New York
City is $123,000 - $165,000. The base pay actually offered will
take into account internal equity and also may vary depending on
the candidate’s geographic region, job-related knowledge, skills,
and experience among other factors. A bonus and/or long-term
incentive units may be provided as part of the compensation
package, in addition to the full range of medical, financial,
and/or other benefits, dependent on the level and position offered.
Job Posting Segment: PE - Sports, News & Entertainment, Enablement
Job Posting Primary Business: PE - Sports, News & Entertainment,
Enablement - Infrastructure Engineering Primary Job Posting
Category: Site/System Reliability Engineer Employment Type: Full
time Primary City, State, Region, Postal Code: New York, NY, USA
Alternate City, State, Region, Postal Code: Date Posted:
2026-02-26
Keywords: The Walt Disney Company, Hempstead , Site Reliability Engineer II, IT / Software / Systems , New York City, New York