Link to the search page

Actuaries Can Excel® at Data Science (Pun Absolutely Intended)

We explore the use of mito, a Python package that allows users to use excel-like point-and-click interface with large datasets in Python.

et-2022-02-hanewinckel-hero.jpg

Most Actuaries in today’s world have some degree of exposure to Data Science, and this is increasing every day. As a result, many actuaries are somewhere in the Excel/Python/R Venn Diagram. Recent trends suggest Python may be overtaking R as the language of choice for Data Science. Since most actuaries are in the Excel RC PythonC region of this Venn Diagram, it stands to reason that there would be great value in a tool that lets us extend our Excel knowledge to be able to write working Python code. Mito is a tool that can do just that—at least for common data analysis tasks.

To be clear, mito was not designed as a “teaching tool,” exactly. According to mito’s homepage (https://trymito.io/), mito grew from the common frustration of data being “too big” for handy point-and-click tools like Microsoft Excel®, while other tools like Python or SQL can be cumbersome and require more esoteric technical knowledge. We can think of the mito package as a scaled-down spreadsheet-like app that lives in Jupyter Labs (a very popular user interface for Python).

So, what does it do? Firstly, it loads a “sheet” of data that can be directly read via point-and-click from, say, a .csv file. It can also use a DataFrame stored in memory as mito is built around the pandas package (a DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data[1]; pandas is an extremely popular package that implements DataFrames). Upon loading, it will display the table in a spreadsheet-like app. This allows the user to do many common tasks in a spreadsheet program: Adding or deleting columns, making a pivot table, merging (joining) tables, removing duplicates, and plotting. Secondly, as the user executes the various tasks, the python code that actually performs the analysis is output to the cell below the table for easy editing or reuse. This code is also auto-documented. The user gets the results of their analysis, and the code that generates it (whether they intend to learn from this code or use it directly).

Then, what does it not do? Importantly, it is not a replacement for true Python training. It will not teach you “slick” and/or “pythonic” coding methods. It will not eliminate redundant or inefficient code. It generates very “procedural” code as opposed to cleaner “functional” code. It does not even delve into the object-oriented programming features of the Python language. Another challenge for those using this as a learning tool is that it does not exclusively use functions from the “base” python libraries or industry-standard libraries like pandas. Instead, it uses some of its own functions for more concise code (many of the trickier code to, say, handle exceptions are buried within these functions). This is fine—the code still works and it can still run (as long as you’ve imported the mito library). However, most data scientists would not be familiar with those functions and their syntax. Fortunately, Python is very human-readable, and the functions pretty much do exactly what you’d expect them to do by name.

The bottom line: If you are simply using this to replace Excel when you have more than 1,048,576 rows, none of the above matters. If you want to expand your learning, you may want to delve into mito’s codebase to see some of the techniques it uses to “automagically” clean, format, and make decisions about data.

Mito’s homepage makes the bold statement “Analyze your data quickly. No Googling. No Stack Overflow.” If you have little to no python experience, this will likely not be true. Still, once you are up and running, it’s fairly intuitive and most Excel users should be able to start pointing and clicking away.

I have made a GitHub repository (or “repo”[2], as the cool kids call it) that hosts a Jupyter notebook that will take us through a sample use case with CDC population mortality data. This is simply a use case to demonstrate simple data analysis; I chose this CDC data since it resembles mortality data that actuaries may work with on a day-to-day basis. Rather than restate the contents of that notebook, I encourage you to follow the link, read the notebook, and perhaps clone the repo so that you can try out the code at home. I have exported examples as .gifs to preserve still graphics, and animated .gifs of the images are in the /images folder. I should mention that mito is intended to work on Jupyter Lab, which allows you to write an .ipynb file (as opposed to a .py python script) which can contain both the python code and markdown text; these two will combine to make a very slick report. R users will note this is very similar to RStudio.

If you find this list of tools (github, Jupyter, etc) intimidating or confusing, I have made a YouTube video that will give a brief intro into getting set up with Anaconda (the suite that includes Python, Jupyter, etc.) and GitHub (git-driven site that hosts code and handles version control). It concludes with the step of “cloning” the repo (and there are instructions at the end of this article). You can find the video at my YouTube channel, “The Hacktuary.”

If you want to try the code and are a “noob” as the kids say these days, I recommend downloading Anaconda which will include the Python interpreter, conda (to manage packages … trust me, it is useful), and Anaconda Navigator, which allows you to use Jupyter Notebooks/Lab.

Rather than go over the example that’s available on GitHub, I’ll use the remaining space to assist those just getting started with Python/Jupyter/mito so that you can get started. I will assume successful Anaconda installation.

Environment Setup

It is good practice to set up virtual environments using conda and install what you need to that environment so that different projects can draw from their own codebases. For this project, you should only need to install mito; other packages needed should install as a dependency (and those I added to my notebook were only to create images for the article itself). Below are two ways—one to install from scratch, and one to build the environment from a .yml file in my repository.

A note for advanced users: Some users, myself included, often run their Jupyter notebook in a stable “base” environment, and then use ipykernel to use different conda environments from within this base environment. I have had difficulty doing this with mito. I advise launching Jupyter from your mito environment, but your mileage may vary.

Create Env and Install Mito Only

With Anaconda installed, the following lines in the command line should be all you need:

conda create -n mitoenv python=3.8 #creates env named mitoenv and
                                   #specifies the use of python v3.8

conda activate mitoenv #makes mitoenv active instead of base

python -m pip install mitoinstaller #uses pip to install the installer

python -m mitoinstaller install #uses the installer to install


It is not standard for a package to install its own installer; it is more common that you can directly install package by running conda install package (preferred) or pip install package directly.

Then, the command Jupyter Lab or Jupyter Notebook will open your notebook in an environment with mito installed.

Create Env From .yml File

This method may be easier to do, but a bit less flexible. I have stored my own environment as config/mito-env.yml in the GitHub repo. You can create the environment in a single line, but you are stuck with the exact packages and environment name I used (the env name is taken from the .yml file itself). I will show the code below. I am adding a line to highlight that we have to navigate to the root of this repository before running the command. My code uses linux bash, but for windows users the command (cd) is the same.

cd <path to soa-mito folder>

conda env create -f ./config/mito-env.yml 

jupyter lab # or jupyter notebook

Final Thoughts

I hope you find this package useful. For those of you who have trouble with installation and getting started, I recommend searching for your issue with your search engine of choice as it’s likely other users will have had the same errors. You can also reach out to me at nhanewinckel@scor.com for ideas. Finally, I would imagine that any data scientist on your team would be happy to help you get set up since it will assist you being able to quickly analyze their work. Happy coding, and I hope this serves as a friendly intro to Python for those that find it as confusing at first as I did.

Statements of fact and opinions expressed herein are those of the individual authors and are not necessarily those of the Society of Actuaries, the newsletter editors, or the respective authors’ employers.

[1] From https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

[2] A repo is just like a folder in a computer drive for one specific project. Github will host files here, version control them with git, and allow users to work on their own versions, or “branches.”