At Roblox, we’ve adopted the microservices paradigm and built a container-based development platform. Our International Growth team pioneered this effort by implementing services that embraced the new approach and introduced new practices aiming for high concurrency and scalability. Now we look to how we can improve our overall performance metrics, and we’ve delivered consistent results after some iteration.
Once Roblox decided to move forward with the development of a microservice- and container-oriented platform, a lot of open questions were raised regarding how to:
- Reach maturity with the platform, and navigate the learning curve to get there.
- React to “the unknown” (such as frequent dependency updates, tool upgrades, and other changes).
- Simplify and rewrite strategies, since we were coming from a monolithic code base that addressed too many concerns at the same time.
- Deal with production uncertainty and readiness of the infrastructure.
Our International Growth team was the ideal spearhead to find these answers. Our projects were already exploring uncharted territory to make Roblox in international locales a reality – like supporting legal compliance context and screen-time regulations. At the same time, our then-future, now-present services did not have strong dependencies with the rest of the platform. Therefore, doing practical experiments with our development process – from planning to execution – was expected in order to reach a new level of understanding.
Facing “the Unknown”
After interstitial planning and meetings with the Microservice Platform (MSP) team, we quickly realized that a new mindset was needed to plan for “the unknown.” Such mindset should consider all the concerns above while driving progress each sprint.
As software development moved forward, we decided to invest in understanding all layers, from the code, up to the infrastructure cluster with this performance-driven development approach. Similar to test-driven development, we were not just looking to pass test cases or find defects, but looking for performance improvements as well.
All concerns were paired with metrics to frequently track the success and assign concrete actions to be taken every sprint, for every deliverable. Those concerns surfaced as opportunities to improve our process as follows:
Continuously measuring success
Building performance is the process of continuous observation, understanding and correcting what happens on all layers – code, container and cluster – in order to reach optimal operation per iteration.
During the early development of our first microservice, we identified that having a load testing framework was crucial to measure success. With that goal, we put together a project using Gatling to model basic simulations. We also found that collecting QA metrics per build could be simplified by using Sonarqube.
After a few iterations, these ideas grew into a more coherent containerized framework capable of simulating thousands of requests and producing real-time dashboards when builds and tests happen. More tools came into the equation in order to simplify performance tracking. Today, we have the means to capture dashboards from multiple sources and put together a report for every characterization, load testing, or significant event. These are just a few examples of how continuous iteration with the “building performance” mindset has already delivered results.
When “the Unknown” Becomes Familiar
It paid off to do continuous profiling, characterization and load testing after several sprints. We found multiple issues on different layers. A few of them were easy to address, like log optimizations. Others were less trivial, like server tuning and connection management. Every hot spot or improvement we address is improving at least one metric; either reducing average response time or doing more requests with the same resources.
Our confidence increased after the first service hit big traffic numbers in production. Our velocity has significantly improved sprint by sprint as new services come or existing ones are polished. There is still room for improvement, but initial improvements were evident after only a few months.
Building knowledge is also important, and distributing features from multiple services to different engineers has had successful results. This provided value not only because of tribal sharing, but also because every time we revisit the performance, new ideas come to light.
How to Make Performance a Priority
Detection of performance issues as part of the development cycle has proven to be crucial to the future of our team and company. The tweaks we did to our agile process and development practices are minimal compared to the benefits we have observed. It is as natural as any other traditional test-driven development approach but with the advantage of better scalability forecasting and resource utilization.
Some of the work we have been doing is still semi-automatic, but the future is promising. Our plan is to embrace full automation of the process and embed it in our CI/CD pipeline. The new platform is beginning to adopt some of these practices, and very soon all backend teams will benefit from the tools we built.
The biggest takeaway is that the investment in a performance-driven development mindset has taken us to the right place with a lot of further opportunities. Having at hand the means to track performance is strongly contributing to the maturity of our new platform. Using containers for services and tools adds a coherent way of standardizing this methodology and reproducing it everywhere.
Making performance part of your day-to-day is key to the future of any highly scalable service.
Neither Roblox Corporation nor this blog endorses or supports any company or service. Also, no guarantees or promises are made regarding the accuracy, reliability or completeness of the information contained in this blog.
This blog post was originally published on the Roblox Tech Blog.