Data Reliability Engineer

BridgeView
Littleton, Colorado
Full Time

Email Address

Apply Now

Data Reliability Engineer ensures the reliability, stability, and operational excellence of an AWS-based data platform. Owns production data pipelines, monitors SLAs, diagnoses incidents, and implements durable fixes. Collaborates with engineering teams to enhance design and operational practices. Key Responsibilities: Own the reliability and stability of production data pipelines and platform services. Define and enforce data SLAs/SLOs for batch and streaming products. Diagnose and resolve pipeline failures, delays, and data quality issues in production. Investigate issues across distributed data systems, including Spark/EMR, ingestion pipelines, and warehouse performance. Lead or support incident response, including triage, mitigation, and long-term resolution. Perform root cause analysis and implement durable fixes to prevent recurrence. Design and enhance monitoring, alerting, and observability for data systems. Develop automation and tooling to reduce operational toil and improve resilience. Contribute to disaster recovery planning, including backup validation and recovery workflows. Partner with engineering teams to improve pipeline design, reliability, and readiness. Create and maintain runbooks, SOPs, and operational documentation. Participate in occasional off-hours support for production data systems when required. Qualifications: Bachelor's degree in Computer Science, Information Systems, Data Science, or related field. 5+ years in data engineering or analytics platform roles, with 3+ years operating production cloud data warehouses (Redshift, Snowflake, etc.). 3+ years building AWS data pipelines and managing them through production. 3+ years working with production data platforms in AWS, focusing on anomaly detection, reconciliation, and end-to-end validation. 3+ years experience with Python and SQL in real data systems. Hands-on experience troubleshooting distributed data processing systems such as Spark/EMR, Redshift, and streaming systems. Proven ability to debug and resolve production issues in data pipelines and platforms. Experience with AWS data services (EMR, Redshift, DynamoDB, S3, or similar). Proven ability in handling production incidents and performing root cause analysis. Strong problem-solving mindset and ability to work through ambiguous production issues. Preferred Skills: Experience handling real-world data issues such as pipeline delays or failures. Experience with data backfills and reprocessing. Experience influencing or guiding data pipeline reliability and operational practices. Exposure to streaming/event-driven systems (Kafka, Kinesis, CDC patterns).

Qualifications:

You have 5+ years data engineering with AWS production. You troubleshoot Spark/EMR pipelines and ensure data reliability. You design monitoring, alerts, and automation for data platforms

Job ID: 522292503

Originally Posted on: 5/23/2026

Email Address

Apply Now