Time to time developers need to parse content to extract needed data from it. Usually it's just HTML pages, but sometimes you need to scrape data from more advanced sites where you have to use more powerful tools. In this blog posts seria I want to show you how you can accomplish this. I'll describe approaches one by one and show their pros and cons. First of all, together we will check what PHP proposes us out of the box to work with XML (SimpleXML and DOM). Then we will explore more and more powerful libraries like CssSelector, DomCrawler, Goutte and CasperJs that can help you achieve all needed goals and make your life much much easier and pleasant. Are you ready to dive in? Let's go then.
Challenge
For example, we have to scrape all blue ray players from here. We're interested in next data:
- title
- current price
- discount
- old price
- url
The way we will use depends on library's capabilities, it may support just xml traversing, XPath or CSS selectors and so on. So now we need to choose our first tool and scrape all needed data.
Instrument: SimpleXML
The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.
This extension is built on top of libxml and enabled by default from version 5.1.2 of PHP, so you can instantly use it in your application. Also I want to point out, that you won't like to process big html documents with SimpleXML, but I want to show this solution too in educational purposes and to give you opportunity to compare this approach to other ones later and feel all benefits.
Here we define `$domain` variable (we will need it later to form absolute urls to products' pages) and $targetPage (list of products).
$domain = 'http://bgbstudio.com'; $playersCategory = 'proizvodi/blu-ray-plejeri'; $targetPage = $domain . '/' . $playersCategory;
The next thing we need to do is to create a SimpleXMLElement. As we work with an html page, not xml, we can't directly load its content into SimpleXMLElement. We can load html first via DomDocument object and then create a SimpleXMLElement from it using an import function:
$dom = new DOMDocument(); // We use @ here to negotiate any html errors and make code sample shorter in demo purposes. @$dom->loadHTMLFile($targetPage); // Create SimpleXMLElement from DOMDocument object $xml = simplexml_import_dom($dom);
Ok, now we have simple xml structure. What's next? We have to keep in mind that SimpleXML doesn't support XPath either Css selectors, so our hands are bound quite a lot. We can go deeper in DOM an element by an element only. Let's check source code of the page with players list and figure out how we can traverse to needed DOM elements.
Here you can see that this page has a lot of nested html elements. `<html>` tag is the parent tag of xml and we need to go through a lot of elements to find `<li>` which is one of containers through which we will iterate later to get product's info.
$productsInfo = []; // As <html> is the parent, we can access <body> directly from it $body = $xml->body; // Accordingly DOM structure only this way we can find products $productOption = $body->div[2]->div->div->div[1]->form->div[2]->ul->li;
Now we have products wrappers and can iterate through them and get all needed information in more or less convenient way. Let's check products DOM structure:
In each `<li>` element we just need to find div with class `.product-block-content` and then find all data accordingly this the DOM tree above:
foreach ($productOption as $li) { // div.product-block-content - the product closest wrapper $wrapper = $li->div; $currentPrice = $oldPrice = 'N/A'; // We have two <p> tags with classes from which we can scrape prices foreach ($wrapper->p as $p) { if ($p['class'] == 'product-block-price') { $currentPrice = (string)$p; } elseif ($p['class'] == 'product-block-price-old') { $oldPrice = (string)$p; } } // Also here we collect the rest of information // Check DOM tree structure above $productsInfo[] = [ 'discount' => (string)$wrapper->div ?: 0, 'url' => $domain . $wrapper->a['href'], 'title' => (string)$wrapper->h2->a, 'current_price' => $currentPrice, 'old_price' => $oldPrice, ]; } print_r($productsInfo);
Result:
Pros
+ quite helpful with xml documents
+ easy to use with trees without a deep structure
Cons
- can't use directly with html/js documents
- unconvenient and hard to follow deep DOM structures
- doesn't support XPath selectors
- doesn't support CSS selectors
- hard to support (you will need to change a lot of code if they change parents' relation of their DOM elements)
Conclusion
I suppose that SimpleXML may be helpful for processing of small XML documents (it's developed for such purpose :) ), but please don't use it for big html documents, there exist a lot of more pleasant ways.
What's next?
Next time we will check what PHP DOM extension can show us and compare it to SimpleXML.