We use cookies. Find out more about it here. By continuing to browse this site you are agreeing to our use of cookies.
#alert
Back to search results
New

Sr. Staff Site Reliability Engineer, Factory Infrastructure & Systems

Rivian
$199,300 - 249,100
vision insurance
United States, Illinois, Normal
Oct 23, 2025
About Rivian

Rivian is on a mission to keep the world adventurous forever. This goes for the emissions-free Electric Adventure Vehicles we build, and the curious, courageous souls we seek to attract.

As a company, we constantly challenge what's possible, never simply accepting what has always been done. We reframe old problems, seek new solutions and operate comfortably in areas that are unknown. Our backgrounds are diverse, but our team shares a love of the outdoors and a desire to protect it for future generations.


Role Summary

This Site Reliability Engineer (SRE) role owns reliability outcomes for factory digital systems spanning compute, network, and application layers. The work is split across Platform Engineering, Observability, and Tiger Team incident response. This position will be located in Normal, IL and report to our Sr. Manager, Software Infrastructure/DevOps.


Responsibilities

Platform Engineering
  • Design and evolve reliable, scalable, and secure platform foundations across hybrid/onprem factory environments (e.g., Kubernetes/EKS, vSphere/ESXi, Linux/Windows server, industrial PCs), with clear reliability and cost guardrails.
  • Codify productionreadiness standards and guardrails for factory systems (health checks, runbooks, SLOs/SLIs, deployment safety, failover patterns) aligned to Platform's production readiness checklist.
  • Advance InfrastructureasCode and configuration automation (e.g., Terraform/Terragrunt, Ansible) for factory workloads, including provisioning, secrets, policies, and change safety.
  • Partner with Manufacturing Engineering, Factory IT, Security, and Networking to land pragmatic, operable designs; contribute to reference architectures and reusable patterns.
  • Lead or contribute to reliability initiatives (e.g., selfhealing automation, safe rollouts/canaries, rollback strategies) appropriate to level.
Observability
  • Raise the bar on endtoend telemetry for factory systems: highsignal metrics, logs, traces, and SLOdriven alerts (e.g., Prometheus/Grafana, Loki/Tempo, Datadog, Splunk).
  • Establish consistent dashboards and service health views for shop/linelevel systems, including exporters for hypervisor/VM health and plant endpoints where feasible (e.g., vSphere exporters).
  • Improve alert quality and ownership: reduce noise, align escalation policies, and ensure actionable runbooks and health checks for critical services.
  • Build internal tooling (CLI/SDKs, operators/controllers, remediation bots) that turns telemetry into prevention and rapid response.
Tiger Team / Incident Response
  • Act as technical incident responder for factoryimpacting events; lead fast triage, stabilize services.
  • Drive postincident reviews that eliminate repeat failure modes; improve MTTR and availability through durable engineering fixes and process improvements.
  • Drill oncall readiness, escalation policies, and schedules using established incident tooling and practices (e.g., Rootly/alternatives), tuned for 24x7 manufacturing operations.
  • Mentor peers through reliability deep dives, failover exercises, and simulation runbooks (breadth of mentorship scales with level).

Qualifications

  • Production experience in SRE/Platform/DevOps or Operations, owning availability, performance, and cost for critical services.
  • Strength in several of: Kubernetes/EKS and container networking; AWS primitives for resilient platforms; vSphere/ESXi and virtualization; Linux (and working Windows Server) administration; service discovery, load balancing, and DNS.
  • Observability across metrics/logs/traces, SLO/errorbudget practice, and alert hygiene with tools like Prometheus/Grafana, Loki/Tempo, Datadog, Splunk.
  • Production change safety: GitOps, progressive delivery, guardrails in CI/CD (GitLab preferred), automated rollbacks, and policyascode.
  • Infrastructure automation: Terraform/Terragrunt, Ansible, scripting (Python/Bash), secrets management, and leastprivilege patterns.
  • Incident leadership/participation in 24x7 environments; clear comms under pressure and a habit of converting learnings into durable fixes.
  • Ability to partner across Factory IT, Manufacturing Engineering, Security, Networking, and application teams; communicate tradeoffs simply and drive decisions.
Nice to have
  • Industrial/OTadjacent experience (lineside HMIs, MES/SCADA integrations, PLC interfaces, ruggedized compute) and shopfloor networking constraints.
  • Experience building or integrating exporters (e.g., vSphere) or consolidating factory telemetry into plantwide health views.
  • DR playbooks, capacity modeling, and cost/performance optimization for hybrid environments.

Pay Disclosure

Salary Range: $199,300 - 249,100 (actual compensation will be determined based on experience, location, and other factors permitted by law).

Benefits Summary: Rivian provides robust medical/Rx, dental and vision insurance packages for full-time employees, their spouse or domestic partner, and children up to age 26. Coverage is effective on the first day of employment, and Rivian covers most of the premiums.



Equal Opportunity

Rivian is an equal opportunity employer and complies with all applicable federal, state, and local fair employment practices laws. All qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, ancestry, sex, sexual orientation, gender, gender expression, gender identity, genetic information or characteristics, physical or mental disability, marital/domestic partner status, age, military/veteran status, medical condition, or any other characteristic protected by law.

Rivian is committed to ensuring that our hiring process is accessible for persons with disabilities. If you have a disability or limitation, such as those covered by the Americans with Disabilities Act, that requires accommodations to assist you in the search and application process, please email us at candidateaccommodations@rivian.com.

Candidate Data Privacy

Rivian may collect, use and disclose your personal information or personal data (within the meaning of the applicable data protection laws) when you apply for employment and/or participate in our recruitment processes ("Candidate Personal Data"). This data includes contact, demographic, communications, educational, professional, employment, social media/website, network/device, recruiting system usage/interaction, security and preference information. Rivian may use your Candidate Personal Data for the purposes of (i) tracking interactions with our recruiting system; (ii) carrying out, analyzing and improving our application and recruitment process, including assessing you and your application and conducting employment, background and reference checks; (iii) establishing an employment relationship or entering into an employment contract with you; (iv) complying with our legal, regulatory and corporate governance obligations; (v) recordkeeping; (vi) ensuring network and information security and preventing fraud; and (vii) as otherwise required or permitted by applicable law.

Rivian may share your Candidate Personal Data with (i) internal personnel who have a need to know such information in order to perform their duties, including individuals on our People Team, Finance, Legal, and the team(s) with the position(s) for which you are applying; (ii) Rivian affiliates; and (iii) Rivian's service providers, including providers of background checks, staffing services, and cloud services.

Rivian may transfer or store internationally your Candidate Personal Data, including to or in the United States, Canada, the United Kingdom, and the European Union and in the cloud, and this data may be subject to the laws and accessible to the courts, law enforcement and national security authorities of such jurisdictions.

Please note that we are currently not accepting applications from third party application services.

Applied = 0

(web-675dddd98f-rz56g)