Harnessing Machine Learning to Analyze Unstructured Data

By Josh Hammerquist

Health Watch, May 2024

Data is arguably the most valuable currency at the moment and is being collected constantly, but the usefulness of such information often depends on its accessibility. While we know how to interpret data that comes in spreadsheets and uses drop-down box options, it becomes more difficult when dealing with free-flowing text. However, through machine learning, which enables computers to learn from the data they take in through pattern recognition and statistical algorithms to make predictions and produce readable results, seemingly unorganizable data is becoming more manageable to the point of being usable in actuarial analysis. I spoke with Supriya Ramachandra to discuss her work in creating a computational model to answer yes/no questions more accurately from customizable text input fields.

Josh Hammerquist (JH): Thank you for speaking with me today. I would like to start with your background as well as the foundations of the project.

Supriya Ramachandra (SR): I have a master’s in computer science and have worked as a data analyst for health care companies in the past. I am currently working as a data scientist at SIKKA, and we have been working on this project for the past few years. We take unstructured data, specifically dental notes formatted in a variety of ways, and preprocess them using natural language processing techniques before feeding them into our long short-term memory (LSTM) model. The model then produces answers to various yes/no questions such as “Is the patient a tobacco user?” or “Does the patient have indicators of cardiovascular disease?” We also have a rules-based engine that we added as an additional layer to address any misclassifications made by the model. While in the past the only way of evaluating this data was to look at patient notes one by one, with this method we can classify a large amount of unorganized data at once. We then can use these final organized results to give us information about key health indicators.

JH: Could you go into more detail about natural language processing and LSTM modeling, as well as why you decided to use these techniques and models?

SR: The base of these things lies in neural networks. Neural networks are computational models inspired by the human brain. They are a type of machine learning process, called deep learning, that use interconnected nodes and their respective activation functions in a multilayered structure to learn from data by adjusting connection strengths (weights). Through iterative training, these networks fine-tune their parameters, enabling them to recognize patterns; make predictions; and perform complex tasks such as image/speech recognition, natural language processing and decision-making. So natural language processing uses neural networks to give computers the ability to understand and interpret text and speech the same way a human would. Because we were working with text data, it naturally led to using natural language processing. The LSTM is a type of recurrent neural network (RNN) architecture, and it is often implemented in various deep learning frameworks, many of which are open source. Many models were tested with text data and the LSTM model had the best accuracy of the ones we tried at about 90%. When we added the rules-based engine at the end, we increased the accuracy to 98%.

JH: Could you clarify what these nodes, activation functions and layers are?

SR: In neural networks, nodes, activation functions and layers are fundamental components. Layers in a neural network are composed of nodes/neurons that perform computations. Each node typically applies an activation function to the weighted sum of its inputs before passing the result to the next layer. Activation functions introduce nonlinearities into the neural network, allowing it to learn complex patterns and relationships in the data.

JH: Can you explain what you mean by “accuracy”? How did you determine the LSTM model had the accuracy level it did?

SR: The first step in machine or deep learning is typically data collection and preprocessing. This involves gathering relevant data that will be used to train the model. The model is optimized by fine-tuning the hyper parameters. After training, the model is evaluated using another set of data called the “test set.” Accuracy is a metric that measures how often a model correctly predicts the outcome, especially in binary classification tasks where the classes are balanced. We input the test dataset into the trained model to evaluate the model’s accuracy. For the LSTM model, we had a set of data consisting of 500,000 dental notes and split it 80-20—that is, made 400,000 notes the training set and the remaining 100,000 notes the test set.

JH: Does this mean you had staff who went through all 100,000 notes of the test set and stated whether the model was correct or not by reading the notes themselves?

SR: Yes, we compared the correct classification established by manually reading the note with the model prediction of the text for all 100,000 notes to establish the accuracy rate.

JH: I understand how, as people, we read language and can comprehend it. You mentioned natural language processing is occurring for the computer to understand the text, but could you explain in terms specific to the dental notes you are using how the model interprets this information?

SR: In the preprocessing stage, clinical notes that contain specific keywords like "tobacco," "smoke" and so on are identified. These clinical notes are refined using various natural language preprocessing techniques. The model itself does not directly understand the text, so we use a Word2vec algorithm to generate a distributed representation of words from clinical notes as numerical vectors, capturing the semantics and relationships between words. These numerical representations of the text are fed into the model to train.

JH: Can you further expand upon what all occurs in this preprocessing stage?

SR: Notes often contain repetitive information that is not relevant for analysis such as months, punctuation like semicolons or words like signature. These filler words are removed at this stage. Also, there are many variations in how people write their notes, using uppercase versus lowercase letters, slang terms, contractions and abbreviations. We can add additional spaces or correct punctuation. We also see text that does not follow typical format—for example, the name O’Malley could be interpreted as one or two names depending on how the computer reads the apostrophe. These deviations from the normal are addressed at this stage. While people can usually see past these differences and still understand the meaning, a computer is very specific and, for example, will interpret every character differently such as “a” versus “A” or one space versus two spaces. Natural language processing performs pruning procedures to address this by converting all text into lowercase, removing unnecessary words, expanding contractions and so on in order to clean up the text and make it more uniform before inputting it into the LSTM model.

JH: Do you perform any modifications on the model now?

SR: We do not alter the model itself. However, we do perform quality assurance and adjust both the preprocessing before executing the model and the rules-based engine executed after the model to maintain and improve this tool.

JH: Is there a reason you do not change the model?

SR: As we obtain more data on which to run the model, there is not a significant decrease in accuracy that warrants retraining the model, and going down that path would most likely worsen the accuracy. Rather, to maintain and improve the model, we edit the rules-based engine to address most of the misclassifications we are seeing because they often come from very specific text instances and are incredibly nuanced.

JH: Can you give me an example of that as well as expound upon this rules-based engine?

SR: The rules-based engine comes in after the model output and acts as a judge of sorts by evaluating whether the model output is good as the final classification, or if there is some specific rule not addressed in the model that needs to be looked at, and potentially adjusts the final classification. For example, we can adjust the rules engine to correctly interpret a commonly misspelled word, an unconventional format like tobacco/marijuana/cigarette, a word with multiple meanings such as stroke (cardiovascular or toothbrush), or a misinterpreted word like v-pen (which is really a dental tool called a vista pen, but the computer thinks it is referring to a vape pen). From there, the final classification is produced and is what we consider our results to the text-based input.

JH: How did you identify these specific rules that need to be addressed?

SR: We do quality assurance quarterly, which means every three months we get a new set of data to validate/test the model on, checking its accuracy and adjusting the rules-based engine as needed. We also do this on an ad hoc basis; if we see something in retro studies that looks off or the error rates seem to be higher than normal, we will perform quality assurance.

JH: During this validation/testing process, are you manually validating/testing the model like you did when you initially trained it?

SR: Yes. For example, we did a validation/testing session outside our normal quality assurance that contained 13,000 records, so we manually read all 13,000 notes and calculated the error rates by comparing the final classification from the model and rules-based engine with the true classification.

JH: Where do you get the data to perform this quality assurance?

SR: We have internal data, and we also get data from carriers from retro studies we can use for validation/testing.

This interview yielded a wealth of knowledge that shows through practical application how machine learning can turn large quantities of unsorted data into interpretable findings quickly and accurately. Utilizing this technology can allow actuaries to access new untapped datasets that were previously too onerous for practical use. Like the human brain lends itself to creative thinking, neural networks are expanding the horizons of what data is accessible to be used to better shape our future predictions. I thank Supriya for taking the time to share her work with the actuarial community.

Statements of fact and opinions expressed herein are those of the individual authors and are not necessarily those of the Society of Actuaries, the editors, or the respective authors’ employers.

Josh Hammerquist, FSA, MAAA, is a vice president and principal with Lewis & Ellis. Josh can be reached at jhammerquist@lewisellis.com.