Much like Alice, the data grows, the data shrinks, it goes in full of complexity and comes out simplified. This is a tale wherein we find our authors based on a few sentences.
First, meet Gutenberg. Gutenberg is a website that contains a collection of works wherein the US copyright has expired. A few of these books are available via one of the NLTK libraries; but it’s a large resource and that is where the books are housed.
The author classification project was built to classify as many authors as possible using natural language processing (NLP) and prediction models. The potential is that we can identify a work whose author is unknown based on a sentence or, as a friend indirectly pointed out, it could also be used to identify plagiarism.
I had a hard time with this, originally, due to space and time constraints. At the time, I wanted to work with as much data as possible; just to see what it would look like and because I knew I had to run a lot of models in a small amount of time.
I had to skimp on something, might as well be the data cleaning. The results of that behavior can be seen in this notebook.
Here is the classification report for what I considered to be the best model at the time.
A 53% accuracy is significantly below my must-fit 68% goal. I had to run with it anyway due to the time constraint and delivering the best runtime with the highest result, Naive Bayes was my answer. There were a lot of perils with my first run. I had to change the performance level on my pc. It took a long time to run the model and I experienced a lot of memory errors.
I got my my models, though; and at the time, that was the goal. After I met my deadline, I knew I still had more work to do. I set the project aside with some issues to be worked at a later date and came back to it. Now, when I returned, my task was to run a few unsupervised learning models, do a little topic modeling, and work with a deep learning model; to get a full picture of the product that’s still being built.
Is it a product? I don’t know.
The thing is, in order to get a good unsupervised model going, I had to get my data cleaned properly. I also, flat out, wasn’t happy with that score. So; I changed my approach.
- Instead of attempting to build and deliver of my research in one notebook, I broke it down into several; using categories.
- Instead of pulling thousands of books directly from the website, I downloaded a select few books to a hard drive. Twelve instead of a thousand.
- Instead of filtering/cleaning on the fly; I filtered/cleaned the data and saved it to a drive to use with all models – eventually introducing a parquet file.
- I moved the project to the GoogleColab environment, to prevent myself from allowing a model to run for over six hours.
- Using the same technique I learned the first time around; I built the notebook out such that only had to run a specific few set of cells, in case of timeouts.
The steps above worked to make the project easier to manage. I was able to complete each portion in chunks and a few of the notebooks may need to be run or fixed to run in sections; it made things a lot easier. When it was easier, I was happier in my exploration and in my data cleaning process.
I still have quite a few things to do in the cleaning process, but eager – I moved forward when I felt comfortable that I should be able to get a good result from my model and I still can’t believe my results.
Would you look at that? I still need to clean up those various authors! The data cleaning is still a mess?!? There’s only 13 sentences by Charlotte Bronte? Those might be paragraphs. That’s fine. I wasn’t concerned with how much text I feed the model at all times. It could be two words, it could be 700; so I’m fine with that.
I’m skeptical of these results. These models were run on only a few works! How do I know for certain that this is going to work on a larger scale?
While I’m off exploring, check out my author classification project on GitHub. It is a work in progress. Please feel free to contribute.