So recently I had to scrape some information of a website and found out that it is actually quite easy. Only problems I encountered was that I ran out of heap space because of one of the webpages I was scraping had a lot of junk HTML in it. My approach was to use the PHP Simple HTML DOM library, which took care of all the hard work.
Let me show you an example using Slashdot.org to show you how easy it is to grab elements using Simple HTML DOM. Here I want all the headings for the news on Slashdot. I know this could be done by grabbing the RSS, but this is just to demonstrate.
<?php
include_once 'simple_html_dom.php';
$url = "http://slashdot.org/";
$html = file_get_html($url);
$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";
foreach($html->find('h2') as $heading) { //for each header
echo preg_replace($pat, $rep, $heading->find('span a', 0)->plaintext) . "\n"; //echo the text inside
}
?>
So in the example above. What I am doing is finding all the H2 occurrences in the HTML code, which is how Slashdot presents their headings for their news entries. Inside the H2 tags there is a span and a link, where the link is containing the heading. I am just grabbing that heading and removing all unnecessary spacing, because I don't need that. The output of running the script is like shown below.
5.8 Earthquake Hits East Coast of the US
Origins of Lager Found In Argentina
Inside Oregon State University's Open Source Lab
WebAPI: Mozilla Proposes Open App Interface For Smartphones
Using Tablets Becoming Popular Bathroom Activity
The Syrian Government's Internet Strategy
Deus Ex: Human Revolution Released
Taken Over By Aliens? Google Has It Covered
The GIMP Now Has a Working Single-Window Mode
Zombie Cookies Just Won't Die
Motorola's Most Important 18 Patents
MK-1 Robotic Arm Capable of Near-Human Dexterity, Dancing
Evangelical Scientists Debate Creation Story
Android On HP TouchPad
Google Street View Gets Israeli Government's Nod
Internet Restored In Tripoli As Rebels Take Control
GA Tech: Internet's Mid-Layers Vulnerable To Attack
Serious Crypto Bug Found In PHP 5.3.7
Twitter To Meet With UK Government About Riots
EU Central Court Could Validate Software Patents
I call that easy! The script would of course need to be modified if you are trying to scrape something else. The Simple HTML DOM can find all sorts of information for you. I.e. paths to images from img tags, content in specific divs so on and so forth. Be warned, you might run into zend mm heap problems if you are trying to load a lot of data to be scraped. If you are doing so, I suggest you doing your scraping in iterations.