Step 3. The easier way to traverse DOM: CssSelector

We already checked the SimpleXML and DOM extensions. Now I want to show how you can improve work with DOM using the Symfony2 component called CssSelector.

As you may remember, the DOM extension supports XPath queries, but they're really long and complicated sometimes. Here the CssSelector component really shines. As any web developer knows CSS selectors, we can easily use this component to convert small, convenient selectors to bulk XPath ones.

Instrument: CssSelector + DOM extension

The CssSelector component converts CSS selectors to XPath expressions.

The beauty of Symony2's components is that you can easily use any of them in any PHP project. Let's install it, check what it can do for us and then use it with the DOM extension to get the best from both.

Installation

CssSelector's insallation is really simple, you just need to be a little bit familiar with composer:

composer require symfony/css-selector

That is simple.

CssSelector usage example

We have installed this component in our project, let's check what it can do:

// We need to add Composer's autoload or another one
require __DIR__ . '/../vendor/autoload.php';

use Symfony\Component\CssSelector\CssSelector;

// toXPath just accepts Css selector and converts it to an XPath selector
echo CssSelector::toXPath('p.product-block-price');

// The result is:
descendant-or-self::p[@class and contains(concat(' ', normalize-space(@class), ' '), ' product-block-price ')]    

We can use this method and feed DOMXPath::query with the result. Great! Wouldn't it be nice to have an upgraded DOMXPath with Css support? I bet it would be, yes? So, let's create it. First thing we need to do is to configure autoload using composer:

"autoload": {
        "psr-4": {
            "FDevs\\": "src/"
        }
}

All you need is to add `autoload` section to your `composer.json` file. In my case, I store all code in the `src` directory under `FDevs` namespace. Now we can create an improved DOMSelector:

<?php

// src/DOMSelector.php
namespace FDevs;

use Symfony\Component\CssSelector\CssSelector;

class DOMSelector  extends \DOMXPath
{
    /**
     * @param string $selector
     * @param \DOMNode|null $contextNode
     * @return \DOMNodeList
     */
    public function queryCss($selector, $contextNode = null)
    {
        return $this->query(CssSelector::toXPath($selector), $contextNode);
    }
}

It was that simple. Now we have a DOMSelector which supports both XPath and Css selectors. I want you to remember the challange we have and rewrite our old code with new knowledge:

// Configure endpoint
$domain = 'http://bgbstudio.com';
$playersCategory = 'proizvodi/blu-ray-plejeri';
$targetPage = $domain . '/' . $playersCategory;

// Next we need to load HTML and create DOMDocument
$dom = new DOMDocument();

// @ here is to make the example shorter, only in demo purposes
@$dom->loadHTMLFile($targetPage);
$productsInfo = [];

// Now we can create our improved DOMSelector
$domSelector = new \FDevs\DOMSelector($dom);

/** @var DOMNodeList $productWrappers */
$productWrappers = $domSelector->queryCss('.product-block-content');

foreach ($productWrappers as $productWrapper) {
    $info = [];

    // discount
    /** @var DomElement $discount */
    $discount = $domSelector->queryCss('.badge-sale', $productWrapper)->item(0);
    $info['discount'] = $discount ? $discount->nodeValue : 0;

    // url
    /** @var DomElement $link */
    $link = $domSelector->queryCss('a', $productWrapper)->item(0);
    // href is relative here, so we need to add host to it
    $info['url'] = $baseEndPoint . $link->getAttribute('href');

    // title
    // huge xpath to simple and easy to understand css
    $info['title'] = $domSelector->queryCss('h2 a', $productWrapper)->item(0)->nodeValue;

    // current price
    /** @var DomElement $currentPrice */
    $currentPrice = $domSelector->queryCss('p.product-block-price', $productWrapper)->item(0);
    $info['current_price'] = $currentPrice ? $currentPrice->nodeValue : 'N/A';

    // current price
    /** @var DomElement $currentPrice */
    $oldPrice = $domSelector->queryCss('p.product-block-price-old', $productWrapper)->item(0);
    $info['old_price'] = $oldPrice ? $oldPrice->nodeValue : 'N/A';

    $productsInfo[] = $info;
}

// Here we have all needed information:
print_r($productsInfo);

It seems that we've moved forward again. This approach seems quite powerful, supports both XPath and Css selectors and you can target elements almost as you do with jQuery.

The Result:

The results of data scraping using php simplexml

Pros

+ supports both XPath and CSS selectors

+ easy to target needed elements

+ don't need to construct bulk XPath expressions, you can use familiar CSS selectors

+ easy to support

+ really flexible

Cons

- hmm... any? :)

Conclusion

We spent less than 10 minutes and now have quite a powerful tool for data scraping with many less cons that we had before. Now you can be proud of yourself, but only a little :). It's because we're still quite limited in the things we can do. We can scrape data from html pages, but it's hard to imagine that we can imitate a user's behaviour if we have to achieve something more complicated. Also, we can't even dream about parsing heavily asyncronous sites with tons of JavaScript for now. Nothing is impossible when you are well-motivated, so stay tuned and check this blog for new articles.

What's next?

Next time I'll introduce DomCrawler - Symfony2's even more interesting component with a bunch of syntactic sugar.

See also:

Parsing content using SimpleXML

Time to time developers need to parse content to extract needed data from it. Usually it's just HTML pages, but sometimes you need to scrape data from more advanced sites where you have to use more powerful tools. In this blog posts seria I want to show you how you can accomplish this. I'll describe approaches one by one and show their pros and cons. First of all, together  we will check what PHP proposes us out of the box to work with XML (SimpleXML and DOM). Then we will explore more and more powerful libraries like CssSelector, DomCrawler, Goutte and CasperJs that can help you achieve all needed goals and make your life much much easier and pleasant. Are you ready to dive in? Let's go then.