PHP

Matching two arrays and removing values from one of them

Nothing too exciting but here it comes.

So while I was doing some scraping, I had this small problem where I already had some data, which didn't exactly match up with the new stuff I had scraped, but parts of it was contained in the new stuff. So what I needed to do was to find all the occurrences of the old data in the new data and remove it, because I already had that lying around.

I had two files with the two data sets to be matched, each of them with their data separated by a line break

<?php
//load data
$dataset1 = file_get_contents('data1.csv');
$dataset2 = file_get_contents('data2.csv');

//make data into arrays
$dataarr1 = explode("\n", $dataset1);
$dataarr2 = explode("\n", $dataset2);

foreach($dataarr1 as $d1) { //shoop da loop!
        foreach($dataarr2 as $d2key => $d2) {
                preg_match("/".$d1."/", $d2, $matches);
                if (!empty($matches) && $matches[0] != "") {
                        unset($dataset2[$d2key]);
                }
        }
}

//write the data
$fileres = fopen("outdata.csv", "w+");
fwrite($fileres, implode("\n", $dataset2));
fclose($fileres);
?>

So what this essentially does is to take two datasets. It takes the first and matches up against the second. If the first one is contained in the second one, delete the data from the second dataset. Try all possibilities. Afterwards it just writes it to a file.

Scraping web pages with PHP

So recently I had to scrape some information of a website and found out that it is actually quite easy. Only problems I encountered was that I ran out of heap space because of one of the webpages I was scraping had a lot of junk HTML in it. My approach was to use the PHP Simple HTML DOM library, which took care of all the hard work.

Let me show you an example using Slashdot.org to show you how easy it is to grab elements using Simple HTML DOM. Here I want all the headings for the news on Slashdot. I know this could be done by grabbing the RSS, but this is just to demonstrate.

 <?php
include_once 'simple_html_dom.php';

$url = "http://slashdot.org/";
$html = file_get_html($url);

$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";

foreach($html->find('h2') as $heading) { //for each header
        echo preg_replace($pat, $rep, $heading->find('span a', 0)->plaintext) . "\n"; //echo the text inside
}
?>

So in the example above. What I am doing is finding all the H2 occurrences in the HTML code, which is how Slashdot presents their headings for their news entries. Inside the H2 tags there is a span and a link, where the link is containing the heading. I am just grabbing that heading and removing all unnecessary spacing, because I don't need that. The output of running the script is like shown below.

5.8 Earthquake Hits East Coast of the US
Origins of Lager Found In Argentina
Inside Oregon State University's Open Source Lab
WebAPI: Mozilla Proposes Open App Interface For Smartphones
Using Tablets Becoming Popular Bathroom Activity
The Syrian Government's Internet Strategy
Deus Ex: Human Revolution Released
Taken Over By Aliens? Google Has It Covered
The GIMP Now Has a Working Single-Window Mode
Zombie Cookies Just Won't Die
Motorola's Most Important 18 Patents
MK-1 Robotic Arm Capable of Near-Human Dexterity, Dancing
Evangelical Scientists Debate Creation Story
Android On HP TouchPad
Google Street View Gets Israeli Government's Nod
Internet Restored In Tripoli As Rebels Take Control
GA Tech: Internet's Mid-Layers Vulnerable To Attack
Serious Crypto Bug Found In PHP 5.3.7
Twitter To Meet With UK Government About Riots
EU Central Court Could Validate Software Patents

I call that easy! The script would of course need to be modified if you are trying to scrape something else. The Simple HTML DOM can find all sorts of information for you. I.e. paths to images from img tags, content in specific divs so on and so forth. Be warned, you might run into zend mm heap problems if you are trying to load a lot of data to be scraped. If you are doing so, I suggest you doing your scraping in iterations.

PHP script for updating DDNS entries at GratisDNS.dk

Some of you might use the DDNS service over at GratisDNS.dk, which is a pretty cool service if you do not have a static IP. For purpose of updating the DNS entries automagically I have made a PHP Curl script, which calls the URL GratisDNS provides for updating DDNS entries. It is a pretty simple script, though if you are paranoid you should modify it so it takes the SSL certificate GratisDNS uses and use it along with CURLOPT_SSL_VERIFYPEER. More information on that at the unitstep blog.

So as said this is a pretty simple script, which just takes the entries in an array and and calls an URL with the information from the array. Here it goes:

<?php
//Enter domains to be updated here and the gratis DNS username and DDNS password.
//Format is: $domains["<your domain entry (primary DNS)>"] =
//                      array(
//                              "username" => "<gratisDNS username>",
//                              "password" => "<gratisDNS DDNS password>",
//                              "host" => "<the host you want to update under the domain entry>",
//                      );
$domains = array();
$domains["ostebaronen.dk"] = array("username" => "myusername", "password" => "mypassword", "host" => "ostebaronen.dk");
$domains["ostebaronen.dk"] = array("username" => "myusername", "password" => "mypassword", "host" => "test.ostebaronen.dk");

// create a new cURL resource
$ch = curl_init();
// Do not output all HTML when executing
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Do not verify SSL
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
// Go through all domains
foreach($domains as $key => $value) {
        $url = "https://ssl.gratisdns.dk/ddns.phtml?u=".$value["username"]."&p=".$value["password"]."&d=".$key."&h=".$value["host"];
        // Set the URL to request
        curl_setopt($ch, CURLOPT_URL, $url);
        // Execute
        curl_exec($ch);
}

// close cURL resource, and free up system resources
curl_close($ch);
?>

The code should explain itself, feel free to use it. Questions regarding it are welcome.