POSTS

Craigslist Auto Ad Scraping into a Google Sheets Document

Let me just be up front about the topic: Craigslist is not fond of people scraping their site. They purposely make their site annoying to scrape (well, that or they are terrible at making decent templates), and it’s against their TOS, and if you do it and make money, they will very likely sue you as they have many others.  You have been warned.  Tread carefully.

All that said, buying a car isn’t fun.  Comparing hundreds/thousands of Craigslist ads about cars is also not fun.  Keeping track of all them is just as bad.  Even with RSS feeds, email notifs, and all the other tools out there to help with this, I still felt the situation was pretty terrible.  I thought it would be nice if I could get a spreadsheet periodically updated with new/changed listings without having to do any copy-paste and just have the information in front of me from the ads I really cared about.  So I decided to scrape CL anyway, since for my use it was pretty low volume, and I wasn’t going to be making any money off of it.

The technology and tools I used to do this are:

Ok, right, enough intro…github

Scrapy really makes this all pretty easy.  I first started out using beautiful soup 4 and a lot of custom code to do most of what scrapy has figured out very well.  My initial rough version of a scraper was working fine (with concurrency even) but it didn’t/wouldn’t hold a candle to what scrapy does easier. If you want to use python to scrape things, please give scrapy a try.

Side note:
I considered using something akin to PhantomJS for this project, but realized that CL doesn’t really have anything “fancy” I need to worry about in their UI.  It’s certainly not a JS-dependent SPA or anything close to it.

So, what exactly does this do?  Well, given some search params you enter at the command line, it scrapes each page of search results for all the resulting ads, and the “detail” page of each result “item”/ad.  It scrapes out all the information I cared about from the ads, and then exports the entire set of scraped ad results as a .csv file.

I made another python script which uses the Google Drive API and the Google Sheets API to import the .csv data into a Google Sheets document.  This part requires a google developer account, and a project set up to use these APIs.

That’s all fine and good, but what if you want a custom view/sheet into this yummy data?  It’d get destroyed every time you import new data!
The easiest thing I found to do was use the “=IMPORTRANGE()” function in another document to pull from the raw data document, so that your analysis of the data is not broken/interrupted when the data changes.  

Of course, you could use some other more sophisticated program (like Tableau for instance) to analyse the data as well, but this gets me to where I wanted for the purpose of looking for the best car deals with my personal criteria.

comments powered by Disqus