Location: Riverwoods, Illinois
Employment Type: Contract
Job ID: 559
Date Added: 03/22/2023
Overview:
Site Reliability Engineers (SREs) are responsible for keeping production systems running smoothly. SREs blend pragmatic operators and software craftspeople who apply engineering principles, operational discipline, and mature automation to our operating environments.
An SRE specializes in systems (operating systems, networks, observability) while continuously implementing best practices to improve availability, reliability, and scalability.
As an SRE, you will:
- Develop and run SRE tooling and observability using automation like CI/CD and Kubernetes.
- Build monitoring that alerts on symptoms rather than on outages.
- Document every action, so your findings turn into repeatable actions and then into automation.
- Debug production issues across services and levels of the stack.
- Plan the growth and reliability of services.
- Use your on-call shift to prevent incidents from ever happening.
- Be on an on-call rotation to respond to “Code Red” incidents to help restore customer-impacting service.
- Have the urge to deliver quickly and effectively and iterate fast.
- Think about systems: edge cases, failure modes, behaviors, and specific implementations.
- As an engineer, when you see something broken, you cannot help but fix it.
- Have the urge to document everything, so you do not need to learn the same thing twice.
- Strong knowledge of SDLC (System Development Life Cycle).
- Strong knowledge of git, Docker, Kubernetes, Jenkins, AWS (Amazon Web Services), or similar technologies.
- Know the use of configuration management systems like Chef and Ansible.
- Have strong programming skills in one or more of the following languages: C, Ruby, Python, 0r Java.
- Good understanding of hybrid infrastructure.
- Automation like CI/CD, self-healing of services, end-to-end or performance testing.
- Improve monitoring (data Dog, AppD, etc.) and build new smart metrics.
- Develop a relationship with a product group and help define their SLO/SLI.
- Work directly with AppDev to improve the product through Non-functional and production readiness.
- Improve operability, latency, capacity planning, change management, and MTTR (Mean Time to Repair).
Technical
- Configuration management: use Chef and Ansible to manage our infrastructure effectively.
- Infrastructure as code: use Terraform and GitLab CI/CD for automation, containerize the environments (Kubernetes), and leverage cloud technologies to meet our goals.
- Systems: manage, configure, and troubleshoot operating system issues, storage (block and object), networking VPC (Virtual Private Cloud), proxies, and CDN (Content Delivery Network) and administer high-availability PostgreSQL and Redis clusters.
- Monitoring and instrumentation: implement metrics in Prometheus, Grafana, log management, related system, and Slack/PagerDuty integrations.
- Engineering practices: availability, reliability, and scalability, as well as disaster recovery
- Use and contribute to code to git.
- Experience coding in one or more languages: C, Ruby, Python, Shell, or Java.
- Planning: familiar with agile methodologies; use epics and issues to drive projects.
- Organization: workload organization, OKR (Objective and Key Result) leadership.
- Management: a manager of one, able to self-organize and report asynchronously.
- Lead and contribute to scope and designs for issues, epics, and OKRs (Objective and Key Result)
- Contribute to the Handbook, creating and updating runbooks, general documentation, and writing blogs.
- Completing Root Cause Analysis (RCA) investigations and performing readiness reviews.
- Improving team practices through code reviews, handoffs of work, and incidents.
- Knowledge sharing and mentoring.
- Self-awareness, handling conflict in the team, and providing and receiving feedback.
- Maintaining good relationships with other engineering teams that help improve the product.
- Accountability: willing to proactively step in and do the right thing while providing candid and constructive feedback.
Technical
- General knowledge of 4 technical expertise areas, with deep knowledge in 1 area.
- AWS Cloud Practitioner, resources provisioning and configuration through CLI/API.
- Chef (basic syntax, recipes, cookbooks) or Ansible (basic syntax, tasks, playbooks).
- Working knowledge of CI/CD, Jenkins, Nexus, pipelines, and jobs.
- Kubernetes basic understanding, CLI (Command Line Interface), service re-provisioning.
- Provision and set up metrics in AppD or Grafana, or Datadog.
- Provision and set up logs and queries for frequent questions.
- Networking VPC, proxies, and CDN (Content Delivery Network).
- Working knowledge of git.
- Provides emergency response by being on-call or reacting to symptoms according to monitoring and escalation when needed.
- Proposes ideas and solutions to debug, optimize code, and automate tasks.
- Plan, design, and execute solutions within Card/Bank to reach specific goals agreed upon within the team.
- Plan and execute configuration change operations at the application and infrastructure levels.
- Actively look for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation.
- Experience designing, analyzing, and debugging distributed systems.
- Self-organize through issues and epics.
- Improves documentation, either in application documentation or runbooks, explaining the why, not stopping with the what.
- Root cause analysis and corrective actions.
- Shares the learnings publicly through issues, runbooks, documentation, and blog posts.
- Contributes to the hiring process by reviewing questionnaires or being part of the interview team to qualify SRE candidates.
- Act as a reliability champion.
- 5+ years of experience BE/B.Sc.