I was introduced to the NPPES database last week and decided to carry the NPPES Exploration into the weekend. The dataset is quite large and it’s definitely another box of woe.
For those interested, the dataset can be found at the following web address: https://nppes.cms.hhs.gov.
It is over 7 gigabytes and refused to open.
Since I was having such difficulty exploring the data, a quick evaluation was done to determine whether or not what little I could get from the set was worthy of further learning and exploration. On the one hand, the answer was no. The data lacks a very much required relation that cannot be achieved without a great deal of research and/or cataloging. We were looking for a form of organizational recognition, which could not be achieved with ease.
That said; I still felt compelled to explore the data. I wanted to be able to build a map, learn how many organization there are, and was hoping to identify some key metrics or correlations that could help identify which of the individuals (entity type 1) belonged to each of these organization (entity type 2). Hope almost always falls short.
I moved my platform over into Jupyter Lab to perform some data cleaning tasks. Initially, my computer locked up. I learned that I ought to make use of the pandas chunksize parameter and how to work with the Python reader. Once I remembered list comprehension, it was merely a matter of converting the data into a normal form. Once I got the data cleaned up and exported into several csv files, I was able to make use of Tableau to create a dashboard that can be found here.

The Python notebook that cleaned the data can be found here. Did I still run into issues while working with the data? Yes, but it was worth the bottle of wine.
I learned a lot with the data. I didn’t get to quite tell a story with it. It is in Tableau and can be used now – which is exciting. Also, I think I might like to explore the taxonomies a little more.
Leave a Reply