At work, we make and sell software written in a number of languages; our flagship product is written in PHP. During pre-sales, we’ve always had to handle some questions about our choice of PHP – normally from IT staff with a preference for Java or lately .NET – and we normally manage to convince the potential customer that PHP isn’t the bad choice that they’ve been led to believe.

But one of the unfortunate side-effects of Stefan Esser’s much-publicised (self publicised? :) ) departure from the PHP Security Team has been an increase in the number of IT staff we’re coming across who “believe” both that open-source is inherently insecure, and that PHP in particular has incurable problems. These “beliefs” hurt ISVs trying to sell PHP-based applications into skeptical organisations.

Why isn’t there a central resource containing the answers to “Why PHP?” in a business-oriented way? Something that ISVs can refer their clients to, and it not only promotes the excellent advantages of PHP (and include success stories from vertical markets), but also include substantial rebuttals to the FUD that ISVs have to deal with during the pre-sales process.

I’m not surprised that PHP.net doesn’t contain such a resource (it’s not really the place for it, one could argue), but it’s disappointing to see that Zend doesn’t provide one. What’s good for ISVs should be good for Zend, after all, and this is an area where they could help all the ISVs that they want to sell their products to :)

Is there interest from other folks in having a resource like this? Or maybe working together to build such a resource?

21 comments »

I’ve recently switched my blog from b2evolution back to WordPress. The good news is both “no more spam :)” and “the admin panel works in Safari”, but on the downside I missed the multiblog feature that attracted me to b2evolution in the first place. There is WordPress MU, I suppose, but after coming across a few plugins that warned they didn’t work with WordPress MU, that option didn’t look very appealing.

Ah ha – thinks I – I can fake the multiblog by putting several different blogs on the site, and generating a homepage from the RSS feeds of the individual blogs. Should be simple enough, and it sounds like the perfect nail to hit with the SimpleXML hammer of PHP 5 :) Funnily enough, in work last week we were wondering whether you could use SimpleXML with XML namespaces (alas, we still use PHP 4 at work atm), so armed with the perfect excuse, I set to work.

Getting an RSS 2 feed into SimpleXML is trivial:

 
$feedUrl = 'http://blog.stuartherbert.com/php/?feed=rss2'; 
$rawFeed = file_get_contents($feedUrl); 
$xml = new SimpleXmlElement($rawFeed);

Extracting the information from the RSS ‘channel’ is equally trivial:

$channel['title'] = $xml->channel->title;
$channel['link']  = $xml->channel->link;

… and so on. Getting at the individual articles starts off just as easy:

foreach ($xml->channel->item as $item) 
{     
    $article = array();
    $article['title'] = $item->title;
    $article['link'] = $item->link; 
}

… but, if you’re relying on the very thin SimpleXML documentation on php.net, like me you’ll soon run into two problems.

Some of the elements in an item sit inside different XML namespaces. The only way to get at them is to use the children() method on a SimpleXMLElement:

 
$dc = $item->children('http://purl.org/dc/elements/1.1/'); 
$article['creator'] = $dc->creator;
foreach ($dc->subject as $subject)
    $article['subject'][] = $dc->subject;

That’s a bit of a mouthful. It’s a bit of a shame that I can’t do this:

// The following does NOT work!
$article['creator'] = $article->dc->creator;

… or some variation on that, but the design of XML namespaces makes that impractical. (The XML namespace is actually the URI; the ‘dc’ prefix in a tag like <dc:creator> is shorthand defined in the opening tag at the top of the XML document. Although it would look a bit odd, there’s nothing at all to stop someone defining the ‘dc’ component as ‘dublinCore’ instead if they wanted to).

Having to pass the full URI for a namespace into children() is not my idea of fun! It’d be much better if we could pass in a shorter string instead. The only way to safely do this is to define an array of shortcuts yourself:

// define the namespaces that we are interested in
$ns = array
(
    'content' => 'http://purl.org/rss/1.0/modules/content/',
    'wfw' => 'http://wellformedweb.org/CommentAPI/',
    'dc' => 'http://purl.org/dc/elements/1.1/'
);

// now we can get dublin core content with a lot less typing!
// we also only have to update the code in one place if the namespace URI changes
$dc = $item->children($ns['dc']);
$article['creator'] = $dc->creator;

You can get a list of the namespaces like this:

$ns = $xml->getNamespaces(true);
$dc = $item->children($ns['dc']);

… but that only works if the XML document defines the prefix ‘dc’ for the namespace ‘http://purl.org/dc/elements/1.1/’. You’ll have to decide for yourself whether it’s a risk worth taking or not.

That’s namespaces tamed, but we’re not quite home yet. The actual ‘content’ part of the article sits inside a CDATA section inside a ‘content’ namespace, and how to deal with CDATA is conspicuous by its absence in the SimpleXML docs (probably because older versions of SimpleXML simply threw CDATA sections away without asking you).

If you have a look at the source code for SimpleXML, test 004 shows how basic CDATA access works.

$content = $item->children($ns['content']);
$article['content'] = (string) trim($content->encoded);

With that, the final code to read a RSS 2 feed looks like this:

// define the namespaces that we are interested in
$ns = array
(
        'content' => 'http://purl.org/rss/1.0/modules/content/',
        'wfw' => 'http://wellformedweb.org/CommentAPI/',
        'dc' => 'http://purl.org/dc/elements/1.1/'
);

// obtain the articles in the feeds, and construct an array of articles

$articles = array();

// step 1: get the feed
$blog_url = 'http://blog.stuartherbert.com/php/?feed=rss2';

$rawFeed = file_get_contents($blog_url);
$xml = new SimpleXmlElement($rawFeed);

// step 2: extract the channel metadata

$channel = array();
$channel['title']       = $xml->channel->title;
$channel['link']        = $xml->channel->link;
$channel['description'] = $xml->channel->description;
$channel['pubDate']     = $xml->pubDate;
$channel['timestamp']   = strtotime($xml->pubDate);
$channel['generator']   = $xml->generator;
$channel['language']    = $xml->language;

// step 3: extract the articles

foreach ($xml->channel->item as $item)
{
        $article = array();
        $article['channel'] = $blog;
        $article['title'] = $item->title;
        $article['link'] = $item->link;
        $article['comments'] = $item->comments;
        $article['pubDate'] = $item->pubDate;
        $article['timestamp'] = strtotime($item->pubDate);
        $article['description'] = (string) trim($item->description);
        $article['isPermaLink'] = $item->guid['isPermaLink'];

        // get data held in namespaces
        $content = $item->children($ns['content']);
        $dc      = $item->children($ns['dc']);
        $wfw     = $item->children($ns['wfw']);

        $article['creator'] = (string) $dc->creator;
        foreach ($dc->subject as $subject)
                $article['subject'][] = (string)$subject;

        $article['content'] = (string)trim($content->encoded);
        $article['commentRss'] = $wfw->commentRss;

        // add this article to the list
        $articles[$article['timestamp']] = $article;
}

// at this point, $channel contains all the metadata about the RSS feed,
// and $articles contains an array of articles for us to repurpose

Don’t forget to add error handling :)

I hope this example helps anyone else who needs to work with RSS 2 feeds, or who needs to know how to work with namespaces and CDATA with SimpleXML.

71 comments »