So recently I had to scrape some information of a website and found out that it is actually quite easy. Only problems I encountered was that I ran out of heap space because of one of the webpages I was scraping had a lot of junk HTML in it. My approach was to use the PHP Simple HTML DOM library, which took care of all the hard work.
Let me show you an example using Slashdot.org to show you how easy it is to grab elements using Simple HTML DOM. Here I want all the headings for the news on Slashdot. I know this could be done by grabbing the RSS, but this is just to demonstrate.
<?php
include_once 'simple_html_dom.php';
$url = "http://slashdot.org/";
$html = file_get_html($url);
$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";
foreach($html->find('h2') as $heading) { //for each header
echo preg_replace($pat, $rep, $heading->find('span a', 0)->plaintext) . "\n"; //echo the text inside
}
?>
So in the example above. What I am doing is finding all the H2 occurrences in the HTML code, which is how Slashdot presents their headings for their news entries. Inside the H2 tags there is a span and a link, where the link is containing the heading. I am just grabbing that heading and removing all unnecessary spacing, because I don't need that. The output of running the script is like shown below.
5.8 Earthquake Hits East Coast of the US Origins of Lager Found In Argentina Inside Oregon State University's Open Source Lab WebAPI: Mozilla Proposes Open App Interface For Smartphones Using Tablets Becoming Popular Bathroom Activity The Syrian Government's Internet Strategy Deus Ex: Human Revolution Released Taken Over By Aliens? Google Has It Covered The GIMP Now Has a Working Single-Window Mode Zombie Cookies Just Won't Die Motorola's Most Important 18 Patents MK-1 Robotic Arm Capable of Near-Human Dexterity, Dancing Evangelical Scientists Debate Creation Story Android On HP TouchPad Google Street View Gets Israeli Government's Nod Internet Restored In Tripoli As Rebels Take Control GA Tech: Internet's Mid-Layers Vulnerable To Attack Serious Crypto Bug Found In PHP 5.3.7 Twitter To Meet With UK Government About Riots EU Central Court Could Validate Software Patents
I call that easy! The script would of course need to be modified if you are trying to scrape something else. The Simple HTML DOM can find all sorts of information for you. I.e. paths to images from img tags, content in specific divs so on and so forth. Be warned, you might run into zend mm heap problems if you are trying to load a lot of data to be scraped. If you are doing so, I suggest you doing your scraping in iterations.

Add new comment