1 novembre 2011

Website screenshots with PhantomJS

Introduction

PhantomJS is an headless (i.e. without frontend) web browser based on Webkit with a Javascript API. It is mainly used for headless testing of webpages. It also allows the capture of a webpage into PNG, JPEG, and PDF files.

Compiling PhantomJS

Please follow the instructions here.

sudo apt-get install libqt4-dev libqtwebkit-dev qt4-qmake
  git clone git://github.com/ariya/phantomjs.git && cd phantomjs
  git checkout 1.3
  qmake-qt4 && make

Rasterize a website

The script to render a website to a file is provided as an example of PhantomJS. The code and explanations can be found here.

var page = new WebPage(),
    address, output, size;

if (phantom.args.length < 2 || phantom.args.length > 3) {
    console.log('Usage: rasterize.js URL filename');
    phantom.exit();
} else {
    address = phantom.args[0];
    output = phantom.args[1];
    page.viewportSize = { width: 600, height: 600 };
    page.open(address, function (status) {
        if (status !== 'success') {
            console.log('Unable to load the address!');
        } else {
            window.setTimeout(function () {
                page.render(output);
                phantom.exit();
            }, 200);
        }
    });
}

Now run:

bin/phantomjs rasterize.js http://aventures-logicielles.blogspot.com test.png

And magically you will see the file test.png appear on your disk.

This is absolutely awesome, we just created a screen capture of the website. However this will capture the whole page, and that means for a blog, a very long image. The above will produce an image of 1000 pixels per 15'000+ pixels (2.3Mb).

Getting only the head of the page

With a small modification to the script we can render only the first part of the page. This is possible by setting the clip rectangle before rendering the page. Let's use the page.clipRect function just before calling render as explained on StackOverflow.

var page = new WebPage(),
    address, output, size;

if (phantom.args.length < 2 || phantom.args.length > 3) {
    console.log('Usage: rasterize.js URL filename');
    phantom.exit();
} else {
    address = phantom.args[0];
    output = phantom.args[1];
    page.viewportSize = { width: 800, height: 600 };
    page.open(address, function (status) {
        if (status !== 'success') {
            console.log('Unable to load the address!');
        } else {
            window.setTimeout(function () {
  // ----- CHANGE HERE -------------------------------------------
  page.clipRect = { top: 0, left: 0, width: 800, height: 600 };
  // -------------------------------------------------------------
                page.render(output);
                phantom.exit();
            }, 200);
        }
    });
}

This will produce images of 800 x 600 pixels with an acceptable size (here 360Kb).

Here we go

Let's try the script on my three blogs.







First observation, the third screenshot contains a big grey area where there was originally a youtube video. Apparently the rasterize script is not able to grab it.

Create a screenshot from Apache


The next step is to try to grab a screenshot from within a php script running in Apache. The following snippet searches for the PhantomJS binary and the rasterize script on the same path as the PHP file.

$phantomjs = __DIR__.'/phantomjs';
$script = __DIR__.'/rasterize.js';
$url = 'http://http://aventures-logicielles.blogspot.com';
$outfile = 'webss_' . uniqid() . '.png';

$command = "$phantomjs $script $url $outfile 2>&1";
$output = shell_exec($command);
header('Content-Type: image/png');
echo file_get_contents($outfile);

If you try to run the above script you will not get any result. To understand what's going on you have to var_dump the $output of the shell_exec command. You will get the following error message: "phantomjs: cannot connect to X server".

Strange, isn't PhantomJS supposed to be a headless web browser? The reply can be easily found by googleing the error string. There is an open issue with PhantomJS on *nix systems, Ariya Hidayat, the creator of PhantomJS says: "PhantomJS is not 'pure headless' yet. On Unix, it still needs X Server for font stuff, etc.".

Fortunately there is a quite simple workaround using XVFB. See here for the full explanation.

Once you have set XVFB up following the above explanations, you can modify the PHP script to use XVFB as X server.

$phantomjs = __DIR__.'/phantomjs';
$script = __DIR__.'/rasterize.js';
$url = 'http://http://aventures-logicielles.blogspot.com';
$outfile = 'webss_' . uniqid() . '.png';

$command = "DISPLAY=:0 $phantomjs $script $url $outfile 2>&1";
$output = shell_exec($command);
header('Content-Type: image/png');
echo file_get_contents($outfile);


The sequel


I wanted to play a bit more with PhantomJS and the rasterize script so I created a small Symfony2 bundle using it to create web pages screenshots on the fly. You will find the sources on github.

Take a look here !


Portability


The above code works well on my Ubuntu workstation but a friend of mine just reported the following facts:

  • The rasterize script causes a segmentation fault on my mac
  • On Debian 5 (lenny) the QT libraries are too old to use the rasterize script
  • On Debian 6 (squeeze) everything runs fine

Links