Step 2: Parsing content using DOM extension

In the previous blog post I described how we can scrape data using SimpleXML, its pros and cons. Now I want to go one step further and introduce a little bit more convenient way to do this - DOM extension.

If you don't know what data we want to scrape in this seria of blog posts, please read related parts from the SimpleXML article. I suppose that now you have understanding and we can explore DOM extension.

Instrument: DOM extension

The DOM extension allows you to operate on XML documents through the DOM API with PHP 5.

This extension provides us a few useful ways to process DOM elements: DOM API and XPath selectors. Both ways have their limitations and we will look on both of them.

// Configure endpoint
$domain = 'http://bgbstudio.com';
$playersCategory = 'proizvodi/blu-ray-plejeri';
$targetPage = $domain . '/' . $playersCategory;

// Next we need to load HTML and create DOMDocument
$dom = new DOMDocument();

// @ here is to make the example shorter, only in demo purposes
@$dom->loadHTMLFile($targetPage);
$productsInfo = [];

Ok, we loaded html and created DOMDocument instance. Let's look on two ways that we can use now:

DOM API

There exist a few frequently used methods that are really convenient. For example, we can get an element by id:

/** @var DOMElement $domElement */
$domElement = $dom->getElementById('some-wrapper-id');

Also we can get elements by tag names:

/** @var DOMNodeList $nodeList */
$nodeList = $dom->getElementsByTagName('div');

This method returns us an instance of DOMNodeList which implements Traversable interface, so you can use the results with `foreach` and iterate through all its elements.

And one more interesting method to get DOMElement's attributes:

/** @var DOMElement $domElement */
$domElement = $dom->getElementById('test');
$class = $domElement->getAttribute('class');

Let's collect all products' data using this approach:

// Get all products container by id
$container = $dom->getElementById('lista-proizvoda-na-akciji');

// Each product itself holds its data in div with class "product-block-content"
$divs = $container->getElementsByTagName('div');
foreach ($divs as $div) {
    /** @var DomElement $div */
    if ($div->getAttribute('class') == 'product-block-content') {
        $info = [];
        // Just to make it a little bit more readable
        $productWrapper = $div;

        // discount
        /** @var DomElement $discount */
        $discount = $productWrapper->getElementsByTagName('div')->item(0);
        $info['discount'] = $discount && $discount->getAttribute('class') == 'badge-sale'
            ? $discount->nodeValue
            : 0
        ;

        // url
        /** @var DomElement $link */
        $link = $productWrapper->getElementsByTagName('a')->item(0);
        // href is relative here, so we need to add host to it
        $info['url'] = $domain . $link->getAttribute('href');

        // title
        $info['title'] = $productWrapper
            ->getElementsByTagName('h2')
            ->item(0)
            ->getElementsByTagName('a')
            ->item(0)
            ->nodeValue
        ;

        // current price
        /** @var DomElement $currentPrice */
        $currentPrice = $productWrapper->getElementsByTagName('p')->item(1);
        $info['current_price'] = $currentPrice && $currentPrice->getAttribute('class') == 'product-block-price'
            ? $currentPrice->nodeValue
            : 'N/A'
        ;

        // old price
        /** @var DomElement $currentPrice */
        $oldPrice = $productWrapper->getElementsByTagName('p')->item(2);
        $info['old_price'] = $oldPrice && $oldPrice->getAttribute('class') == 'product-block-price-old'
            ? $oldPrice->nodeValue
            : 'N/A'
        ;

        $productsInfo[] = $info;
    }
}

// Here we have all needed information:
print_r($productsInfo);

As you can see, this approach is powerful, but no so flexible as we want. We can't point to exact elements we want with it.

XPath queries

XPath is a language for addressing parts of an XML document

Another option is to create a DOMXPath object and then traverse DOM elements via XPath queries. You can do it like this:

// Create DOMDocument and load content into it
$dom = new DOMDocument();
$dom->loadHTMLFile($html);

// Create DOMXPath from DOMDocument
$xpath = new DOMXPath($dom);

// Now you can query for DOM elements via XPath queries
$linkXPath = 'descendant-or-self::a';
$nodeList = $xpath->query($linkXPath);

Let's try this approach too:

// Create DOMXPath from DOMDocument
$xpath = new DOMXPath($dom);

// Huge, but still simple to understand XPath selector for a product container
$productWrapperXPath = "descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' product-block-content ')]";

/** @var DOMNodeList $productWrappers */
$productWrappers = $xpath->query($productWrapperXPath);

foreach ($productWrappers as $productWrapper) {
    $info = [];

    // discount
    $discountXPath = "descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' badge-sale ')]";
    /** @var DomElement $discount */
    $discount = $xpath->query($discountXPath, $productWrapper)->item(0);
    $info['discount'] = $discount ? $discount->nodeValue : 0;

    // url
    $linkXPath = 'descendant-or-self::a';
    /** @var DomElement $link */
    $link = $xpath->query($linkXPath, $productWrapper)->item(0);
    // href is relative here, so we need to add host to it
    $info['url'] = $baseEndPoint . $link->getAttribute('href');

    // title
    $titleXPath = 'descendant-or-self::h2/descendant-or-self::*/a';
    $info['title'] = $xpath->query($titleXPath, $productWrapper)->item(0)->nodeValue;

    // current price
    /** @var DomElement $currentPrice */
    $currentPriceXPath = "descendant-or-self::p[@class and contains(concat(' ', normalize-space(@class), ' '), ' product-block-price ')]";
    $currentPrice = $xpath->query($currentPriceXPath, $productWrapper)->item(0);
    $info['current_price'] = $currentPrice ? $currentPrice->nodeValue : 'N/A';

    // old price
    /** @var DomElement $currentPrice */
    $oldPriceXPath = "descendant-or-self::p[@class and contains(concat(' ', normalize-space(@class), ' '), ' product-block-price-old ')]";
    $oldPrice = $xpath->query($oldPriceXPath, $productWrapper)->item(0);
    $info['old_price'] = $oldPrice ? $oldPrice->nodeValue : 'N/A';

    $productsInfo[] = $info;
}

// Here we have all needed information:
print_r($productsInfo);

With this approach we can select needed elements more precisely, but you need to know XPath and such selectors are really long and complicated sometimes. I think it's a big blocker for many web developers, because they're more familiar with CSS selectors, not XPath.

Result in both cases:

The results of data scraping using php simplexml

Pros

+ quite easy to traverse even trees with many levels of nested DOM elements

+ it's possible to get elements by id or tag name

+ you can describe elements with even higher precision using XPath selectors

+ easy to support

+ isn't bound to parent-child DOM elements relation so much

Cons

- XPath selectors are quite big and complicated sometimes

- doesn't support CSS selectors

Conclusion

We went through a little bit more powerful tool - DOM extension. It's easier to work with and it provides more API methods to traverse DOM. In the same time it still has serious cons, like one with bulk XPath selectors. I want to point out that you don't have to use one approach or another, you can combine them and get the best from both.

What's next?

Next time I'll introduce CssSelector - one of Symfony2's components that can solve our problems with bulk XPath selectors and make our work easier.

See also:

Parsing content using SimpleXML

Time to time developers need to parse content to extract needed data from it. Usually it's just HTML pages, but sometimes you need to scrape data from more advanced sites where you have to use more powerful tools. In this blog posts seria I want to show you how you can accomplish this. I'll describe approaches one by one and show their pros and cons. First of all, together  we will check what PHP proposes us out of the box to work with XML (SimpleXML and DOM). Then we will explore more and more powerful libraries like CssSelector, DomCrawler, Goutte and CasperJs that can help you achieve all needed goals and make your life much much easier and pleasant. Are you ready to dive in? Let's go then.