Jason Sypniewski bio photo

Jason Sypniewski

Data scientist in training, professional sarcasm engineer, advocate for human decency.

Email Twitter LinkedIn Github

Its hard to fathom it's already been three weeks on this data science journey, bootcamp style. I don't think it's as much time flies when you're having fun (which I'm having tons of), but more like time is relative when you're completely consumed learning something that you're passionate about. Certainly, it's not all roses, as long nights, endless deliverables, and stressing out over project timelines and having to sacrifice perfectionism to get something out the door, makes one question their decision process. The subtlety of how well this bootcamp prepares you for the real world, beyond the technical skills, is certainly not lost on our cohort.

Fresh off the heels of our Week 1 projects, there is no rest for the wicked as we dive right into Project 2, which is a two-week endeavor focused on an analysis of the movie industry. That's pretty much all the background you get, with the exception that the project needs to focus on a problem that can be solved (or approached is a better word) using linear regression, i.e. the statistical approach for modeling the relationship between a dependent variable y and one or more independent variables, X. Simple enough, especially when you consider Python has a slew of modules and methods, like statsmodels and scikit learn, that are very good at fitting models and testing predictions with said models. However, let's stop and think about the problem itself. The movie industry is inherently volatile and difficult to predict, one man's bomb is another's blockbuster. Secondly, most movie statistics are categorical (i.e. genre, rating, Oscar nominations, etc.) which can present challenges when modeling via linear regression. Not impossible, there are ways to "dummy code" categoricals, but other supervised learning approaches that may be more suitable...but I digress, that's Week 4 ;)

I already mentioned the statistical tools we were introduced to for this project. But before we can start doing analysis, we need to do some data munging. We all know where to search for movie data, but how do you get data from layers of HTML tags and tables into a usable format? Scrape Baby Scrape, that's how! And so we were served steaming, hot bowls of Python's web scraping library, BeautifulSoup (that I'll affectionately call BS4). Yes BS4 does make web scraping easier, but it also makes you realize how painstakingly tedious a process is, which is why one of our weekly guest speakers stated "...I don't scrape, we have an intern do that." Which is also why several folks in our cohort quickly found several web-based API's to do their scraping. As for me, I'm all for learning to crawl before you go running for the API's, so I bought in to the BS4 for the most part...which leads me to Part II...