This is a continuation of the series of articles on the valuable lessons I've learned while working at Flite.
You can find a list of the articles in the series here.
"Its a killer feature"
Six months ago, I was in a product meeting for a new application we were going to roll out. People wrote down features they thought would be great in the new product and then we reviewed them as a group. We got to a feature that said "object and face detection". The engineering manager looked around and said "Does anyone know anything about object detection?"
"Does anyone want to try building this?".
"It would be a killer feature."
That was the phrase that was uttered when you knew that people wanted this feature regardless of if it was a good idea.
Since I knew that this was not going to go away and I've had a lot of experience working with video (video editing and transcoding music videos and video lessons for a number of years), I figured I was probably the best bet for this. That and it actually seemed like a fun (albeit difficult) project.
Looking for a cheap wheel
Any engineer worth their salt is going to do one thing and one thing only when given this kind of responsibility.
Find a service you can use already.
I knew that building this system, testing it and deploying it by myself was going to be a mountain of work. Do yourself a favor and build on someone else's molehill if you can.
Turns out there was a great resource very easily found that lists most of the services out there (both free and commercial). I found it to be of great use. (I should note that facial recognition is finding a particular face while detection is just finding a face. These terms are used interchangeably many times.)
So my checklist was to find a way to:
1. stream video from a proxy service
2. transcode video if needed
3. if face detection chosen, identify all the faces in the video and track each face separately
4. if object detection chosen, identify the object the user has chosen and track it through the scene
5. return that data
After a lot of research, it turns out that almost all services out there either do single frame detection or the video upload is prohibitively expensive and time-consuming.
To Build or Not to Build, (hopefully not the question)
In my opinion, it is incredibly important as an engineer that you exhaust all the possible avenues for using someone else's work before you embark on your own. There is a reason that this sort of thinking has it's own page on wikipedia.
I would consider 3 possibilities in which you should roll your own feature:
- It is not economically viable to use another service or resource
- This means that at scale you would likely incur for too much cost for utility
- The only resource is an unmaintained project
- This means you will likely incur a lot of maintenance cost (and just as much developer time)
- The combination of requirements of your feature don't properly align with current offerings
- This means the frankenstein monster you would create would still need just as much maintenance
You need to bestow your coding majesty on the filthy developer peons of the world
Unfortunately, the feature I was working on satisfied both 1) and 3). Turns out basically all computer vision services out there don't operate on video and the ones that do can't do both face detection and object detection in a general form.
I learned a large number of things in regards to computer vision over the coming months. Here are some important takeaways.
- OpenCV - Please contribute to it if you have the chance. It is an amazing library and should be your first stop at learning computer vision.
- There are a multitude of books out there on this topic that use OpenCV. These resources can provide a lot of hands-on experience instead of abstract academics.
- There are few examples in the wild that are not a complete mess or simple toy projects.
- Object Detection is much much easier than face detection.
- Face detection is very much not a solved problem. There is a reason why almost all the face detection pictures you see are straight on.
- The bottle-neck of any video detection process is typically marshalling video frames into a data structure.
- Detection (face or object) is very CPU intensive (regularly pegging a processor at 50-80% depending on resolution). Separating and distributing these processes goes a long way to making this kind of activity scalable.
- Selectively transcoding can help immensely to speed up video marshalling and detection times.
- This is possibly the best video I've found to use for detection. Simple trajectory, always forward facing and high contrast. Also, it's amazing. I spent so much time using this video for tuning that people kept asking me why I was constantly looking at shirtless men.
I had a lot of fun working on this project. In the end, I delivered a system that with roughly the following properties:
- Upload a video (or use an existing hosted one)
- Specify either face focused detection or specify an object via a targeting reticule
- Send video analysis jobs to a distributed queue for processing
- Transcode videos selectively depending on size and/or container/codec
- Detect faces in video and attempt to track/ match them
- Detect objects in video and track them
- Store results of detection for retrieval via client side application
- Healthcheck, testing and profiling systems for long term maintenance
- Play back the video with an overlay targeting a designated object or face
Overall the project was a great amount of work and a great amount of fun. The state of computer vision is a very turbulent one and I look forward to new break throughs to allow better usages of the technology.