I joined Klarna to build products at scale and found exactly what I was looking for. My team provided the platform for running containerized workloads for the entire company, serving nearly 2000 engineers, running 1000 microservices, and maintaining 30000 containers in production. Every engineer interacted with their deployments via our API, leaving the rest to us.
The API, Control Plane, and supporting services were built in TypeScript and ran with AWS Lambda and Step Functions. Underlying infrastructure was managed by CloudFormation, mostly using CDK. To ensure a frictionless experience for engineers, we used a plethora of AWS services such as ECS, ECR, CodeDeploy, ASG, EC2, ALB, Route53, CloudWatch, ACM, SSM, IAM, and many more.
Given the platform's handling of billions of dollars, the SLA bar was set reasonably high. Everything was meticulously monitored, audited, and tightly secured. Designed to be self-healing, the platform resolved issues before engineers even noticed them.
By default, the platform provided scalability, observability, security, logging, and compliance worthy of the biggest fintech in Europe. DataDog alerts, dashboards, and metrics were automatically provided for each deployment. We invested considerable effort into training and onboarding engineers to ensure metrics, dashboards, and alerts were both meaningful and actionable. I even gave a talk on our platform at the Klarna's conference, you can watch here.
We managed EC2 instances with ASGs and rotated the entire production fleet of 1000 EC2 instances each week to apply the latest patches and security updates. Changing Linux distributions without downtime was a fun challenge. With thousands of containers, upgrading the kernel by a major version required significant SRE magic. The scale of our underlying infrastructure occasionally exposed cracks in AWS itself, necessitating close collaboration with the AWS ECS team.
As a financial institution, everything had to be end-to-end encrypted and secured. We managed the PKI infrastructure, public certificates, and provided engineers with tools to manage their secrets.
Of course, not everything ran smoothly all the time. Positioned at the crossroads of the entire company, our on-call rotations were intense. We served as the go-to SRE team when incidents were challenging to mitigate.