Kaggle Case Study – Michael Xiao
Kaggle Veteran Builds an Image-Screening Model to Better Predict Cervical Cancer
Over the past few years, Society of Actuaries member Michael Xiao has participated in multiple Kaggle challenges by himself and with actuaries and candidates. His teams placed in the top 25% frequently, and Michael is ranked in the top one percent of experts across all Kaggle competitions. As anyone who is familiar with the competitive heat of Kaggle data science competitions knows, that is an extremely impressive batting average.
“Actuaries naturally have a good understanding of data, as it’s part of our day-to-day jobs,” said Xiao, who himself leads a data science team at Blue Cross and Blue Shield of Illinois, Montana, New Mexico, Oklahoma & Texas. “We understand how data can be used in ways that purely technical people can’t—by bridging both sides of the business and technical world.”
Kaggle competitions are analytics and data science challenges in which competitors build models to solve real-world machine learning problems for organizations ranging from the Department of Homeland Security to Intel. This year, the SOA created the Kaggle Involvement Program to incentivize actuaries participating in Kaggle competitions.
“Doing [the Kaggle challenges] makes you better at your job and a better data scientist,” said Xiao. “It’s fun to work with new data sets—you get to see things like image data sets, text data sets, even data sets from other industries. It’s fantastic for cross training.”
This year Xiao participated in a number of Kaggle competitions with different teams, including an image-screening challenge to determine the most effective treatment for cervical cancer. Challenged to improve the cervical screening process, Intel and MobileODT asked participants to develop an algorithm that uses images to identify a woman’s cervix type. This screening has the potential to help women who live in remote areas or without easy access to modern medical care ensure they are receiving accurate diagnoses.
As with most of the Kaggle image challenges, Xiao’s team approached this problem using artificial neural networks (ANNs), which perform best for almost any type of image problem.
It was Xiao’s practical experience working as an actuary that helped his team rank highly. Actuaries work with large sets of data every day, which allows them to approach the huge data sets seen in these challenges with comfort. “Actuaries start off by asking the right questions,” said Xiao.
Although Xiao’s team took a similar approach to the other teams working on the cervical screening challenge, they did employ some pre-processing steps, like using ANNs to flip images, to generate large amounts of additional data, which helped his team land in the top 10% for the challenge. As actuaries, they knew that this approach would serve them well.
Using the ANNs, one of the steps they took was a method called ensembling, which looks at many data models with a weighted average. This approach is also considered a “black box” approach, due to the incredibly complex set of calculations the ANN is able to perform in the process of image analysis and recognition.
Despite the effort that goes into the Kaggle challenges, Xiao finds the process rewarding. Participants receive immediate feedback that helps them learn if they’re taking the right approach, an experience that doesn’t always happen on the job. “You can build a predictive model—and you think you’ve built the best thing ever, but how do you know for sure? Kaggle helps you know right away,” said Xiao.
At Blue Cross Blue Shield, Xiao rallied the data science team he leads to participate in the Kaggle challenges. But he encourages anyone who works as an actuary to get involved, both for the learning experience it promises and the intellectual fun it provides. “It’s not that difficult—just jump in and get started. You’ll probably do terribly at first, but you learn a lot, and learn quickly,” said Xiao.