Craigslist is known for being a place where you can purchase and sell anything from a used car to a chicken, but what you might not know about it is that it is notoriously difficult when it comes to data harvesting. This is because of how they have set everything up. This means that there’s no easy way to scrape data, in any respect.
On most social sites, the developers offer an API for users to scrape data and export it in a format that they want. A good example of this is Facebook. This means that you can take data from any page that you own on Facebook, and you can also access data on pages that you don’t own, as long as it’s public. All of this is surprisingly simple.
However, when it comes to Craigslist, things are slightly different. They do have an API, but its functions are in complete reverse. With Facebook’s API, you can extract data, but you can’t post. This is why it’s got a Facebook app, which you can used to post content.
When it comes to Craigslist API, you can post, even in bulk if you desire, but it doesn’t let you pull any data for read-only. This means that everything is backwards, but from Craigslist’s point of view, it does make sense.
This is because they benefit from allowing businesses, particularly in the real estate industry, to post listings in bulk through their simple API. However, they don’t gain anything by allowing third parties to extract data and display it on another website that isn’t theirs. So, even if all you are wanting to do is run some basic data analysis, you are going to come up against it.
Craigslist doesn’t include RSS feeds which you can subscribe to. Of course, you can use these for personal use, but if you try to use them to scrape data on a greater scale, then your access will be blocked. Craigslist even explains this in their terms of service.
So, what does this all mean? Let’s break it down a little bit further.
- You can only access Craigslist through an email client or a web browser.
- You can only post to Craigslist using their bulk posting API, or a web browser.
- You cannot use a script, bot, crawler, or spider to scrape data.
- You can’t scrape contact information or personal data from users.
Legality Around Craigslist Scraping
So, why are we mentioning this? For two main reasons. The first one is pretty obvious; we review and recommend proxies to our readers, and proxies are of course essential to being able to scrape data from Craigslist. The other is a word of caution. Anything you do while following instructions like the ones below, is on you.
Is It Legal to Scrape Craigslist Data?
In the past, Craigslist has taking legal action when it comes to use the scraping of data. Of course, this all depends on the scale of data scraping, as well as what you use the data that you scraped for. Simple data analysis is more or less OK, but commercial use, especially if you are directly in touch with the competition, is not going to land lightly.
The biggest mistake that businesses make when scraping data from Craigslist is ignoring Craigslist’s warnings when they send out a cease and assist letter and ban their IP addresses. At this point, you should definitely think about slowing down, if not stopping completely, but a lot of businesses out there ignore these actions and continue to extract data. Therefore, if you get a cease and desist letter from Craigslist, then we recommend that you stop all activity.
Issues with Craigslist
The thing about Craigslist is that it’s a website with a lot of issues. It first started back in 2006, but how much has really changed since then? There have been a few significant updates over the years, but when it comes to the actual design of the website it’s hardly changed. This means that the user interface hasn’t changed at all, but more data is obscured than what used to be. These days, you get to see three different types of ads posted.
- Ads with Plaintext Contact Info: Most of the time these types of ads are posted by brands looking for people to contact them. These brands will have staff who can answer the phone, and therefore are able to avoid unwanted callers.
- Ads with Obfuscated Contact Info: These are the types of individuals who post personal ads to Craigslist, and they post their phone numbers in a slightly different format, using both numbers and words. This is so that humans can work out the phone number, but a robot won’t be able to.
- Ads with Zero Contact Info: If you want to contact the person who posted the ad, then you will need to send an email to the anonymous email address that Craigslist has provided as a forwarding address. You’ll see no personal information on the post, but the user will see your return address which means that they will be able to respond to you if they want.
When it comes to Craigslist, there are also issues around what is and what isn’t allowed when it comes to ads. Post titles are of course free to include all sorts of different symbols, and in the world of Craigslist, it’s actually more effective to use symbols to stand out. However, this kind of format does pose a problem to scrapers, who either need to figure out how to decode these special characters or get rid of them completely.
Another common problem on Craigslist is spam. Of course, you won’t find this kind of issue in the more serious sections, like the property section, where everything is heavily moderated. However, you will find spam in personal sections, including jobs and free listings.
Craigslist does have anti-spam measures, and sometimes they require phone verification from their users. They also have a posting limit, and an automated system that can ban or suspend somebody who breaks the rules. Does any of this work? No.
Craigslist attempted to make moves to improve the viability and flexibility of their website a few years ago. This meant that you could use a lot of HTML to customize your posts, and to provide more information to people looking at your post. However, back in 2013, Craigslist got rid of these features, returning to the basic black and white aesthetic.
They actually called this look Hurricane Craig, and there’s only really one benefit that we can see. This is the fact that it standardized a lot more data and posts. This made it a lot easier for robots to extract the data from browser windows, instead of needing to decode it first. This means that Craigslist is inadvertently making it easier for people like you to scrape data.
Why You Might Want to Scrape Craigslist
So, why on earth would you want to scrape data from Craigslist? We think that there are many different reasons for this.
From an Analytical Standpoint
There’s always the chance that you might want to scrape data so that you can write a report. Investigative journalism is still a thing, even if it’s rare. This means that you might want to scrape all of the posts in a particular section, and look at things about them, including frequency of posting, and the medium prices for products.
You might even want to compare the kind of item with how difficult it is to get in touch with the user. Of course, none of this work is profitable, it’s just information that you can use for different things.
Honestly, we think that Craigslist would probably be fine with this kind of activity, which means that you’re probably going to be safe doing so. They most likely wouldn’t win a court case over it. However, you’ve still got to be careful, because sites like this can be unpredictable.
From a Personal Standpoint
You might want to use a scraper for information that you want to use personally. This means that if you are shopping for a used car, you might want to harvest data on all used cars so that you can correlate locations, prices, model and make information, and any other data about the cars so that you can get a way better idea of what you’re looking for. As helpful as Craigslist can be, their filtering and browsing is average at best.
From a Profitable Standpoint
There is also the opportunity to scrape data for something that you would like to purchase and resell. One common target of course is event and concert tickets. You can monitor events that have been sold out, scrape the information from Craigslist to locate tickets for those in advance, buy the tickets, and then resell them for more, on other websites like eBay. Of course, this type of activity does require a lot of effort, but if the margins are good, it can end up paying off.
From a Commercial Standpoint
You can use Craigslist to generate leads, and for this you would want to scrape information from The Wanted section, for anyone who is searching for an item or service that you provide. This way, you can reach out to them to sell your service or product. It’s not the most efficient way of generating leads, but the option is there.
How to Scrape Data from Craigslist
The exact approach that you choose for scraping data is going to depend a lot on the tool you sign up for. However, the general process is going to look something like this:
1. Pick a Tool
Of course, the first thing you want to do is choose a scraping tool that you can use to scrape from Craigslist. If you have had experience in the coding and developing industry, then you could even try your hand at developing one yourself. If you haven’t, then there’s really no need to do so, because there are so many tools out there that already exist. Let’s take a look at a few options that we think could be worth your time.
This web spider works specifically within the Cloud, which makes it potentially difficult to use. There’s not too much documentation for it, but it is good if you want to experiment with coding and you don’t want to develop your own scraper from scratch. Another bonus is that it is free.
Visual Web Ripper
Visual Web Ripper is not as difficult to use as Cloud Crawler, and it helps you point directly at the information you want to scrape, meaning that the program does everything else. It even comes with video tutorials and has a user-friendly website. However, like everything, it comes with its limitations.
The free trial only lets you scrape 100 elements from the website, which can come with a lot of code and scripts that you don’t need. Also, the free trial is only available for 15 days. It’s very expensive, so you need to have the budget for it. The license for using this web scraper is $350.00.
Python Craigslist Scraper
Another open source code scraper is Python, and when compared to some of the other web scrapers we have talked about already, it’s definitely a lot easier to use. It’s also free, encoded in one of the easiest languages for you to learn. This makes it potentially one of the most popular scrapers out there.
We think that the last web scraper on our list is potentially one of the most legitimate ones. It markets itself as an all-purpose web crawler, which means that you can use it for a lot more than just Craigslist. It’s also a lot less limited, but it’s really easy to configure, and it’s free to use.
It offers its clients tutorials around scraping data from specific areas, where you won’t get unnecessary information. When you first check it out, you might think that it looks a little bit overwhelming, but when you get to know it, you will realize that it’s not that bad.
2. Use Proxies When You Can
Remember when we talked about above how Craigslist is always on the hunt to prevent scrapers? The solution to this is using a proxy when you scrape Craigslist. The only way that Craigslist has to identify a scraper is to observe that the same IP address is accessing various pages again and again, in a really short amount of time.
You won’t even be able to tell what the user is doing, which means that they could just be browsing Craigslist. However, if they think that your IP address is accessing too many web pages at once, they will limit or restrict you. This is why using a proxy is so fundamental, as it will funnel traffic through a rotating selection of web servers, which filter the origin point of the website.
This means that it’s impossible for Craigslist to keep track of the IP address that you’re using, because it is changing all the time. There is very little possibility of getting restricted or banned when using a proxy for your Craigslist scraping. We recommend using High Proxies or Blazing SEO Proxy for all of your Craigslist scraping needs.
One thing to note here is that you will need to work out how to filter your scraper through a proxy.
3. Collate and Harvest Data
Once you have set your web scraper up, and decided which proxy to use with it, then you will be ready to scrape. All you’ve got to do is run it and collect the data. There’s a good chance that the output will be into a CSV file which you can open on any spreadsheet program, including Google Sheets or Excel.
Now, all that’s left to do is to go through the data and use it for whatever you want. However, we again recommend that you don’t make a public commercial using it. Remember, Craigslist is much more likely to send a cease and desist to you if you do. This is why it’s a lot safer to use Craigslist data for personal use, as the worst they can do is block your IP address, which isn’t going to matter at all if you are using a proxy. Good luck!