At AccelByte, our mission is to empower game creators by providing them with the backend platform and tools required to make scalable, reliable AAA-quality games. The company was founded in 2016 by industry veterans who have engineered online systems for some of the largest game and distribution platforms in the world including Fortnite, Epic Store, Xbox Live, PlayStation Network, and EA Origin. We are backed by top investors including Softbank, Sony Interactive Entertainment, Galaxy Interactive, NetEase, and Krafton. Our latest Series B funding has firmly solidified our place as a top player in the gaming industry. AccelByte's talent has decades of experience building and shipping some of the largest game and distribution platforms in the world.
We believe that the best companies empower employees to make decisions, obsess about the best user experience, and are not afraid to make and learn from their mistakes. Our culture is based on humility, openness to feedback, drive, and collaboration, which we feel results in the best performing teams. As a company that values diversity, inclusion, and employee growth, our employees have opportunities to work with and learn from teams all over the world. We offer competitive salaries, a full range of health benefits, social activities, career growth opportunities, and an amazing team. Come join us!
**Position Summary**
AccelByte is seeking an SRE/ Cloud Engineer I - Incident Response for our 24x7 operations team dedicated to AAA multiplayer video games. This position requires a driven individual who can maintain the high reliability of the service, identify, and mitigate service-impacting problems. Coding knowledge is necessary for routine task automation and root cause analysis.
**Essential Functions/Responsibilities**
The SRE/ Cloud Engineer I - Incident Response is accountable for the following functions and responsibilities:
- Collaborate within a LiveOps/L3 support team, covering shifting schedules.
- Proactively ensure production uptime, stability, and resiliency while providing constructive feedback on coworkers' changes.
- Ensure the continuous availability, performance, security, and scalability of infrastructure components, adhering to platform SLA.
- Assist in Root Cause Analysis and identify solutions to production events.
- Provide modern Infrastructure as Code (IaC) principles, identifying efficiency opportunities through automation and process improvement.
- Utilize modern Infrastructure as Code (IaC) principles, and identify opportunities for efficiencies by leveraging automation and process improvement.
- Contribute to the development of automation solutions, streamlining tasks, enhancing efficiency, and minimizing manual effort.
- Engage in direct communication with clients, understanding their needs and providing valuable support as a team member.
- Meet requirements for engineering excellence.
- Perform other duties as assigned.
**Qualifications/Experience Required**
- Bachelor's Degree background or relevant work experience, certification, or courses
- At least 1 year of experience specializing in operations and reliability automation, with a focus on a variety of modern infrastructure and operational technologies, including Linux and AWS Cloud Infrastructure.
- Basic experience in incident management, emphasizing prompt service restoration after incidents, alongside adept problem-solving during production events and compliance with incident management processes.
- Basic experience in performing cloud system operations on an AWS environment.
- Basic experience in cloud monitoring, logging, and APM solutions, with exposure to monitoring tools such as Prometheus, Grafana, and Datadog.
- Basic experience in Kubernetes and Docker: hands-on experience with many AWS services such as EC2, EKS, S3, ELB, RDS, DocDB, OpenSearch, ElastiCache, EBS, CloudFront, CloudWatch, CloudTrail, etc.
- Practical knowledge of scripting in programming languages such as Python, Bash, GoLang, etc.
- Practical knowledge of using support ticketing solutions like Jira Helpdesk and Zendesk, with effective communication and collaborative problem-solving skills.
- Practical knowledge of problem-solving abilities under pressure during production events, ensuring compliance with incident management processes.
- Practical knowledge of Infrastructure as Code (IAC) using Terraform and/or CloudFormation.
- Practical knowledge of CI/CD tooling and pipeline. Primarily Gitlab, Jenkins, and Flux.
- Practical knowledge of similar products or services offered by AccelByte, preferably in a AAA game studio or software product company. Expected to acquire practical knowledge of how AccelByte's products are hosted within the infrastructure upon joining.
- Solid understanding and implementation of security best practices is a big plus.
- A good understanding of DevSecOps, Cloud, microservices, and containers is a big plus.
- Familiarity with web services patterns/architectures (REST, SOAP