Intro

Some time ago I had decided that I wanted to learn more about data science. I didn’t have a particular sub-topic of data science that I wanted to delve into so I settled upon undergoing a personal project that involved (or had the potential to involve) many different areas.

What Data

To give you some background, I’m a drummer in a band and we recently began the quest for new members. Now something that peeked my interest whilst I was hunting through band advertisements was, what are the statistics of band advertising? Or more specifically I wanted to be able to answer questions like: which city has the greatest number of adverts in relation to its population or what’s the most popular music genre being advertised?

Where I got the data from was an easy choice. Pretty much of all my experience of searching for musicians is from one site called Join My Band. Its a good choice since no login is required to access all the posts and they are already categorised into region. An example ad is shown in the screenshot below:

Band ad

How to Get the Data

Looking back on it, it probably would be have been easier just to ask the site owner for a copy of the data. But I wouldn’t have learnt as much so I guess it worked out just as well.

I knew that there is technique used to obtain data from a web site called Web Scraping, but I didn’t know anything about it. So I decided to read a book. I had recently learnt the basics of Python so I wanted to do something to cement my knowledge. I found a book on Safari Book Online called Web Scraping with Python which seemed to fit quiet nicely with what I wanted to do. The book lead to me to learn about Scrapy, which is a Python framework for performing web scraping.

With a web scraping tool found I sought out how to store this advertisement data in order to manipulate it later. I had plenty of options on how, but I decided to use MongoDB as I found that saving a python dictionary to MongoDB is really simple and since I didn’t know much about MongoDB it was a good opportunity to learn. Though at the time of writing, Scrapy did not have built in functionality to export the scraped data to a MongoDB database, but I discovered you can add your own item pipeline. Using the Scrapy doc and this helpful blog post I was able to save the data of each band advertisement into a MongoDB database. For each advertisement I was able to scrape:

  • Region of ad
  • User entered location, often a city
  • Text of ad
  • The username who posted the ad
  • Time and date of when the ad was posted
  • Title of the ad

Run Scrapy and Wait.

With Scrapy all setup, I set it off running. Luckily the site displays how many ads are currently in each region so I could roughly work out how long it would take to scrape them. It was configured to request a new ad every 3 seconds which meant it took me about 2 weeks on a single machine to gather close to 300,000 ads.

What’s Next

Now that I have the raw data, part 2 of my project will involve summarising the data to produce statistics like a time plot of ads posted per month.