Data Reliability Engineer ensures the reliability, stability, and operational excellence of an AWS-based data platform. Owns production data pipelines, monitors SLAs, diagnoses incidents, and implements durable fixes. Collaborates with engineering teams to enhance design and operational practices. Key Responsibilities: Own the reliability and stability of production data pipelines and platform services. Define and enforce data SLAs/SLOs for batch and streaming products. Diagnose and resolve pipeline failures, delays, and data quality issues in production. Investigate issues across distributed data systems, including Spark/EMR, ingestion pipelines, and warehouse performance. Lead or support incident response, including triage, mitigation, and long-term resolution. Perform root cause analysis and implement durable fixes to prevent recurrence. Design and enhance monitoring, alerting, and observability for data systems. Develop automation and tooling to reduce operational toil and improve resilience. Contribute to disaster recovery planning, including backup validation and recovery workflows. Partner with engineering teams to improve pipeline design, reliability, and readiness. Create and maintain runbooks, SOPs, and operational documentation. Participate in occasional off-hours support for production data systems when required. Qualifications: Bachelor's degree in Computer Science, Information Systems, Data Science, or related field. 5+ years in data engineering or analytics platform roles, with 3+ years operating production cloud data warehouses (Redshift, Snowflake, etc.). 3+ years building AWS data pipelines and managing them through production. 3+ years working with production data platforms in AWS, focusing on anomaly detection, reconciliation, and end-to-end validation. 3+ years experience with Python and SQL in real data systems. Hands-on experience troubleshooting distributed data processing systems such as Spark/EMR, Redshift, and streaming systems. Proven ability to debug and resolve production issues in data pipelines and platforms. Experience with AWS data services (EMR, Redshift, DynamoDB, S3, or similar). Proven ability in handling production incidents and performing root cause analysis. Strong problem-solving mindset and ability to work through ambiguous production issues. Preferred Skills: Experience handling real-world data issues such as pipeline delays or failures. Experience with data backfills and reprocessing. Experience influencing or guiding data pipeline reliability and operational practices. Exposure to streaming/event-driven systems (Kafka, Kinesis, CDC patterns).
Qualifications:
You have 5+ years data engineering with AWS production. You troubleshoot Spark/EMR pipelines and ensure data reliability. You design monitoring, alerts, and automation for data platforms
Qualifications:
You have 5+ years data engineering with AWS production. You troubleshoot Spark/EMR pipelines and ensure data reliability. You design monitoring, alerts, and automation for data platforms
Job ID: 522292503
Originally Posted on: 5/23/2026
Want to find more Quality Control opportunities?
Check out the 33,418 verified Quality Control jobs on iHireQualityControl
Similar Jobs