Step 4. Pleasant work with DOM - meet DomCrawler

We're moving forward in our way in understanding the best tools for web scraping. In this blog post I want to introduce the Symfony2 DomCrawler component. It provides even more flexibility and you will definitely love it.

In the previous blog post about the Symfony2 CssSelector component we figured out how to create a flexible crawler which supports both XPath and CSS selectors on top of the DOM extension. We've got quite a good solution, but, how can we process forms, links and other DOM elements better? DomCrawler is the answer to this question.

Instrument: DomCrawler

The DomCrawler component eases DOM navigation for HTML and XML documents.

Let's install this component, check what it can do and update our old code using it.

Installation

As always, we can use composer to easily install DomCrawler:

composer require symfony/dom-crawler

If you look at the `composer.json` file of the just installed component, you will see the next sections:

"require": {
    "php": ">=5.3.3"
},
"require-dev": {
    "symfony/phpunit-bridge": "~2.7",
    "symfony/css-selector": "~2.3"
},
"suggest": {
    "symfony/css-selector": ""
}

Now it's clear that by default DomCrawler doesn't require the CssSelector component (check the `require` section). In our case, CSS selectors are must have, so we need to check the `suggest` section and install the CssSelector component too:

composer require symfony/css-selector

Great. Now DomCrawler supports both XPath and CSS queries. You can check it in the `Crawler` class:

<?php

// vendor/symfony/dom-crawler/Symfony/Component/DomCrawler/Crawler.php

namespace Symfony\Component\DomCrawler;

use Symfony\Component\CssSelector\CssSelector;

class Crawler extends \SplObjectStorage
{
    // ...
    
    // Using the `filter` method we can search for elements using CSS selectors
    public function filter($selector)
    {
        if (!class_exists('Symfony\\Component\\CssSelector\\CssSelector')) {
            throw new \RuntimeException('Unable to filter with a CSS selector as the Symfony CssSelector is not installed (you can use filterXPath instead).');
        }

        return $this->filterRelativeXPath(CssSelector::toXPath($selector));
    }
    
    // ...
    
    public function filterXPath($xpath)
    {
        $xpath = $this->relativize($xpath);

        if ('' === $xpath) {
            return new static(null, $this->uri, $this->baseHref);
        }

        return $this->filterRelativeXPath($xpath);
    }
    
    // ...
}

The Structure

The DomCrawler component structure

As you see, DomCrawler sits on top of CssSelector and the DOM extension, so you can treat it as a wrapper which provides a features set from both. Also, DomCrawler isn't just a wrapper, it provides a few handy methods plus a few useful classes:

The DomCrawler classes

You can check their public methods in Symfony2's API.

DomCrawler usage example

Now I want to go through the most useful features and describe them a little. The Crawler class:

require __DIR__ . '/../vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;

// Crawler
$crawler = new Crawler();

// Fill with content
// \DOMNodeList|\DOMNode|array|string|null
$crawler->add($content);

// Target elements with the XPath expression
$subsetCrawler = $crawler->filterXPath('descendant-or-self::a');

// Or query with the CSS selector
$subsetCrawler = $crawler->filter('a');

Also you can get an element by its order number:

// Check if the set has elements
$count = $crawler->count();

$firstElementCrawler = $crawler->first();
$lastElementCrawler = $crawler->last();

// Ger an element by its position
$posCrawler = $crawler->eq($positionNum)

And, of course, you have a way to get both text and html:

// Get html of the first child
$crawler->html();

// Get text of the first child
$crawler->text();

Sometimes we need an even more precise way to filter elements and DomCrawler has the `reduce` method for that:

$reducedSubsetCrawler = $crawler->reduce(function (Crawler $crawler, $i) {
    // Just return `false` if you want to remove an element from a set:
    return preg_match('/^Symfony2/', $crawler->text());
});

This way we retrieved all elements with the `Symfony2` word in their text. You can form data iteratively for an each element from a set:

$data = $crawler->each(function (Crawler $crawler, $orderNum) {
    return [
        'text' => $crawler->text(),
    ];
});    

In the end you will get an array with nodes' text. Referring back to our data scraping challenge, let's improve our code one more time:

<?php

// Challenge: get information about each Blue-Ray player

require __DIR__ . '/../vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;

$baseEndPoint = 'http://bgbstudio.com';
$blueRayPlayersCategory = 'proizvodi/blu-ray-plejeri';
$target = $baseEndPoint . '/' . $blueRayPlayersCategory;

// Crawler
$crawler = new Crawler();
$crawler->add(file_get_contents($target));

// Collect info
$productsInfo = $crawler
    ->filter('.product-block-content')
    ->each(function (Crawler $nodeCrawler) use ($baseEndPoint) {
        $discount = $nodeCrawler->filter('.badge-sale');
        $oldPrice = $nodeCrawler->filter('p.product-block-price-old');

        return [
            'discount' => $discount->count() ? $discount->text() : 0,
            'url' => $baseEndPoint . $nodeCrawler->filter('a')->attr('href')),
            'title' => $nodeCrawler->filter('h2 a')->text(),
            'current_price' => $nodeCrawler->filter('p.product-block-price')->text(),
            'old_price' => $oldPrice->count() ? $oldPrice->text() : 'N/A',
        ];
    })
;

// Here we have all needed information:
print_r($productsInfo);

Personally I don't remember an easier way to parse DOM. 

The Result:

The results of data scraping using php simplexml

We've updated our example, but we haven't checked the `Form` and `Link` classes yet. There exist a few additional methods to find forms and links:

// Selects links by their name or alt value
$linkCrawler = $crawler->selectLink('Symfony2 API');

// Selects buttons by the same principle as links
$buttonCrawler = $crawler->selectButton('The DomCrawler component');

More over, you can convert special node elements to the more specified classes:

// The Link class
$link = $crawler->link();

// Then we can retrive URI:
$link->getUri();

// The Form class
$form = $crawler->form();

// For `Form` we have the next useful methods:
$methodAttr = $form->getMethod();

// Get the form values
$arrayOfFormValues = $form->getPhpValues();

// The FormField class
$formField = $form->get('fieldName');

With this in mind we can improve our example above a little:

// ...

// Replace
$crawler = new Crawler();
$crawler->add(file_get_contents($target));

// With
$crawler = new Crawler(file_get_contents($target), $target, $baseEndPoint);

// ...

// And replace
'url' => $baseEndPoint . $nodeCrawler->filter('a')->attr('href')),

// With:
'url' => $nodeCrawler->filter('a')->link()->getUri(),

// ...

Pros

+ CSS and XPath support

+ really flexible

+ very precise

+ easy to support

Cons

- doesn't help us imitate a user's behaviour

- doesn't support JavaScript

Conclusion

I showed only the basic things from DomCrawler and recommend you to check Symfony2's docs and play with this component a little to feel more comfortable. We've done a great job, but as always, there exists a room for further improvements.

What's next?

Next time we will solve the con with a user behaviour using Goutte - the browser emulator.