Site Reliability Engineer II

  • Alibaba Cloud US LLC
  • Bellevue, Washington
  • Full Time
Site Reliability Engineer II Alibaba Cloud US LLC - Bellevue, WA

Posted: 5/28/2026 - Expires: 7/2/2026

Job ID: 293447539

Apply Now I have already applied Save Job Print Email Share
Job Description

Platform Stability & High Availability: Conduct health checks, risk assessments, and preventive maintenance for database platform components. Design and implement HA solutions (e.g.,

automated fault recovery, adaptive disaster resilience) and cloud-native technologies. Optimize network architecture and Kubernetes (k8s) cluster operations for database services. Operational Tooling & Automation: Develop platforms/tools for large-scale distributed systems management, including automated deployment, monitoring, and diagnostics. Enhance observability through metrics, logging, tracing, and alerting systems (e.g., Prometheus, Grafana, OpenTelemetry). Incident Management & Optimization: Resolve live-site issues, including performance bottlenecks, capacity scaling, and security threats. Collaborate with product teams to refine architectures, reduce latency, and improve availability. Cross-Functional Collaboration: Drive standardization of control-plane components (e.g., microservice frameworks, metadata services) across database engines.

1. Research and Development of Database Platform Infrastructure

Systems & Products: The employee will design and support Database-as-a-Service (DBaaS) platforms. This includes cloud-native database engines (such as PolarDB, RDS, or similar

distributed SQL/NoSQL databases) and their control-plane orchestration systems. Research Areas: Conduct research on Distributed Consensus Protocols (e.g., Paxos, Raft) to ensure

data consistency and high availability. Research Adaptive Disaster Resilience algorithms to automate failover across multi-region cloud architectures. Process: Lead the end-to-end

lifecycle of high-availability solutions, from architectural design and prototyping to automated stress testing and chaos engineering to validate system robustness under extreme failure

modes.

2. Large-Scale Distributed Systems Management & Tooling

Equipment & Systems: Work extensively with Kubernetes (K8s) orchestration, focusing on Custom Resource Definitions (CRDs) and Operators to manage stateful database workloads.

Tools & Technologies: Develop and maintain internal automation platforms using languages such as Go (Golang), Java, or Python. Utilize Prometheus, Grafana, and OpenTelemetry to

build advanced observability frameworks that provide real-time telemetry and predictive diagnostics for thousands of database nodes. Specific Projects: Development of an automated

Database Fleet Management System that handles seamless patching, scaling, and migration of large-scale distributed clusters without service interruption.

3. Network Architecture and Cloud-Native Optimization

Technical Focus: Optimize the networking stack within virtualized environments (e.g., Service Mesh, VPC configurations, Load Balancers) to minimize tail latency and maximize throughput

for database traffic. Industry Application: These duties are situated within the Cloud Computing and Information Technology Services industry, specifically focusing on Infrastructure-as-

Software and Large-Scale Data Management.

4. Incident Management and Security Performance

Process: Implement a systematic approach to Root Cause Analysis (RCA) for complex live-site incidents involving performance bottlenecks, such as CPU saturation, I/O wait times, or

memory leaks in distributed environments. Security: Design and implement automated security auditing tools to ensure database components comply with industry standards (e.g.,

encryption at rest/in transit, identity and access management).

Telecommuting may be permitted. When not telecommuting, must report to worksite.

Requirements:

  • Bachelors degree or foreign degree equivalent in Computer Science, Information Science, or related field.
  • 2 years of experience in the Site Reliability Engineer II, or any other related occupation, job title/position.

Worksite Address:

205 108th Ave NE, Suite 400, Bellevue, WA, 98004

Job Summary Company Details Company Alibaba Cloud US LLC Industry All Other Professional, Scientific, and Technical Services Contact method Contact Info Email: Apply by Email
Job Information Location Bellevue, WA Job Type Full Time Employee Education Level Bachelor's degree Job Position 1 Position(s) Open Salary/Wage $144,000.00 - $172,800.00 /year Duration Over 150 Days Additional Information Reference Code 9849968 Federal Contractor No Affirmative Action Plan No View More Jobs All Alibaba Cloud US LLC jobs View similar jobs All Bellevue, WA jobs
Job ID: 522908561
Originally Posted on: 5/29/2026

Want to find more Quality Control opportunities?

Check out the 33,428 verified Quality Control jobs on iHireQualityControl