Position Summary The Senior Reliability Engineer (Infrastructure) is responsible for ensuring the reliability, availability, and recoverability of JetBlue's critical infrastructure platforms. This role applies engineering discipline to operational challenges, leads response to complex incidents, and drives improvements that reduce operational risk over time. The Senior Reliability Engineer works closely with cloud, platform, network, and application teams to ensure infrastructure systems are observable, resilient, and safe to operate in production, while exhibiting the JetBlue values of Safety, Caring, Integrity, Passion, and Fun. Essential Responsibilities
- Own reliability outcomes for critical infrastructure platforms supporting JetBlue production systems.
- Define and manage Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for infrastructure capabilities.
- Lead response, diagnosis, and resolution of complex infrastructure incidents as Incident Commander or senior technical authority.
- Participate in a 24x7 on-call rotation and help improve incident response practices.
- Diagnose and mitigate failures across Linux systems, Kubernetes platforms, Azure cloud infrastructure, and networking layers.
- Review and approve high-risk infrastructure changes with consideration for blast radius, rollback readiness, and dependency impact.
- Identify and mitigate capacity, scaling, and saturation risks across infrastructure systems.
- Improve monitoring, alerting, and dashboards to reflect real system health and customer impact.
- Reduce operational toil through automation, tooling, and reliability-focused engineering improvements.
- Develop and maintain operational documentation, runbooks, and recovery procedures.
- Lead blameless post-incident reviews and drive corrective actions to prevent repeat incidents.
- Mentor engineers on operational excellence, reliability practices, and incident response.
- Collaborate with cloud, platform, network, and security teams to ensure reliable and secure infrastructure operations.
- Ensure infrastructure platforms meet regulatory, compliance, and security requirements as applicable.
- Other duties as assigned.
Minimum Experience and Qualifications
- Bachelor's Degree in Computer Science or a related discipline; OR demonstrated capability to perform job responsibilities with a combination of a High School Diploma/GED and at least four (4) years of relevant experience.
- Five (5) or more years of experience in Site Reliability Engineering, infrastructure operations, DevOps, or production engineering roles.
- Demonstrated experience operating and supporting large-scale production infrastructure.
- Strong Linux troubleshooting skills across CPU, memory, disk, and process behavior.
- Strong understanding of networking fundamentals including TCP/IP, DNS, load balancing, and failure modes.
- Hands-on experience operating Kubernetes clusters, including troubleshooting, scaling, and failure recovery.
- Experience operating infrastructure in a public cloud environment (Azure preferred).
- Experience with observability tools including metrics, logs, tracing, and alerting.
- Proficiency in at least one programming or scripting language (such as Python, Go, Java, or similar) used to automate operations and improve reliability.
- Experience using infrastructure-as-code and automation to reduce operational toil.
- Ability to make sound decisions under pressure and communicate clearly during incidents.
- Able to work flexible hours and participate in on-call rotations.
- Available for occasional overnight travel (10%)
- Must pass a pre-employment drug test
- Must be legally eligible to work in the country in which the position is located
- Authorization to work in the US is required. This position is not eligible for visa sponsorship Preferred Experience and Qualifications
- Seven (7) or more years of experience in Site Reliability Engineering, infrastructure operations, DevOps, or production engineering roles.
- Experience defining and operationalizing SLOs and using error budgets to guide reliability decisions.
- Experience with capacity planning and demand forecasting.
- Experience operating highly available, distributed systems.
- Experience mentoring engineers or acting as a technical lead.
- Experience with additional cloud platforms or hybrid environments.
Crewmember Expectations:
- Regular attendance and punctuality.
- Potential need to work flexible hours and be available to respond on short-notice. Able to maintain a professional appearance.
- Must be an appropriate organizational fit for the JetBlue culture, that is, exhibit the JetBlue values of Safety, Caring, Integrity, Passion and Fun. Promote JetBlue's #1 value of safety as a Safety Ambassador.
- Identify safety and/or security concerns, issues, incidents or hazards that should be reported and report them whenever possible and by any means necessary including JetBlue's confidential reporting systems (Aviation Safety Action Program (ASAP) or Safety Action Report (SAR)).
- The use of ChatGPT or any other automated tool during the interview process will disqualify a candidate from being considered for the position.
Equipment:
- Computer and other office equipment
Work Environment:
- Traditional office environment
Physical Effort:
- Generally not required, or up to 10 pounds occasionally, 0 pounds frequently. (Sedentary)
Compensation:
- The base pay range for this position is between $105,600.00 and $150,400.00 per year. Base pay is one component of JetBlue's total compensation package, which may also include access to healthcare benefits, a 401(k) plan and company match, crewmember stock purchase plan, short-term and long-term disability coverage, basic life insurance, free space available travel on JetBlue, and more.
#LI-JC1 #LI-Hybrid #LI-Remote
|