This is a continuation of the series of articles on the valuable lessons I've learned while working at Flite.
You can find a list of the articles in the series here.
No one loves ads but they are everywhere
When I worked there, Flite served ads (or creative media as they were called). A lot of ads. Millions upon millions a day. While we definitely weren't the largest ad platform (cough Doubleclick cough), I think we did a great job scaling to deal with huge traffic spikes and I'd like to share some points that I took away for delivering fast and consistent service.
Starting from a database level, we had regional clusters of MySQL servers. Considering the large amount of reads that we were dealing with and little in the way of rights, we decided to focus on providing an AP (Availability and Partition Performance). This meant that instead of extensively sharding, we decided to go with a large number of slave systems using asynchronous replication.
While the danger of lost writes is more evident with a large amount of replication, the amount of writes that occured and lack of impact that lost ad changes presented meant it was more useful to render ads with old data that not at all.
Data Access Layer
In order to not completely hammer our databases into the ground with requests, we used memcached to cache frequent requests. While I do agree that caching everything is bad, I think that caching appropriately is important. My suggestion is to profile queries on your platform and cache those with either the highest use or the highest latency.
Another strategy we used was turning our caching system into a separate library using Java annotations and Spring's AOP facilities. Taking the time to create a caching strategy that works in a modular and flexible manner is probably the most important thing you can do. There is little worse than slapping a memcached or redis into your platform and then having to ssh into a box to flush the box manually. That does not scale.
Transactions! All those logically grouped things that we do to the data layer is transacted. I've found that if you have enough differentiation between your data modeling / access layer, you should be making sure that you're making those service calls properly transactional. While setting up transactions may take some work with the database (and/or ORM if you're using one), the alternative is having to manually delete lines from the database or asking your DBA to do it. (this could be potentially terrifying)
Ads need assets. (no matter how crappy you might think those banner ads might look) These assets varied in size from fonts to pictures to high definition videos. Flite hosted all of the files but the amount of times that they are served makes it a pain in the butt to have to query the database to find where they are stored. It's much easier to just cache their location in our asset system with Redis and move on.
The obvious step if you are serving world-wide. We used Edgecast and it worked great for us. This is another important thing to think about since CDN invalidation can be quite a pain in the butt. Fortunately, we had a system developed inhouse to invalidate CDN content on publishing of a new ad version. These kinds of systems can be a bit complex to make but pay incredible dividends.
The three cases we had for compression were assets, reports, and general outgoing responses. For assets like video and pictures, we transcoded all hosted video to more efficient mp4 and hls versions using FFMPEG / AWS Transcoder and for jpeg/png/bmp, we used ImageMagick. For the general case, as pretty much everyone does now-a-days, we set up NGINX with gzip compression for outgoing responses.
Proxies and Domain Load Balancing
Now-a-days, ads use a ridiculous amount of data. Clients regularly use DMP services to customize people's ad experience. This means a large differentiation of URLs due to custom data parameters. To deal with this, we created a proxy funnel. The basic idea is that since we uniquely build an ad script for each url, each unique url is cached with its appropriate ad script with the Redis instances I previously mentioned.
In addition, depending on the type of media used by the ad, the proxy app will direct requests to a different subdomain like s.flite, v.flite or r.flite. The subdomain servers were put behind an HAProxy load balancer with round robin so they don't get overwhelmed. (which can easily happen for constantly streaming video proxies)
This almost seems like a foregone point to mention, but we made copious use of AWS EC2, S3, CloudFront and SQS, to name a few services. At the risk of sounding like a marketing shill, AWS auto-scaling groups and elastic load balancing is a god-send. While it is definitely possible to use an in-house system like NGINX / HAProxy, not having to think about it if possible is even better.
Using ASG's, the boxes we used to capture metrics and serve ads would grow and shrink as needed without a thought. I won't argue that it makes me feel a bit lazy but that's something you should definitely automate away as soon as you can.
Dealing with scale at Flite was a fun endeavor. It definitely wasn't Netflix but it had its own interesting challenges. I didn't mention all of the strategies that we used (like message queues and copious amounts of microservices) and I'm sure I didn't learn all of the techniques we could have. Scale and distributed systems are an incredibly interesting space that I'll continue to study and absorb.