Using SimpleXML To Parse RSS Feeds
Posted by Stuart Herbert @ 9:49 PM, Sun 07 Jan 07
Filed under: 1 - Beginner, Examples
22 Comments
I’ve recently switched my blog from b2evolution back to Wordpress. The good news is both “no more spam :)” and “the admin panel works in Safari”, but on the downside I missed the multiblog feature that attracted me to b2evolution in the first place. There is Wordpress MU, I suppose, but after coming across a few plugins that warned they didn’t work with Wordpress MU, that option didn’t look very appealing.
Ah ha - thinks I - I can fake the multiblog by putting several different blogs on the site, and generating a homepage from the RSS feeds of the individual blogs. Should be simple enough, and it sounds like the perfect nail to hit with the SimpleXML hammer of PHP 5
Funnily enough, in work last week we were wondering whether you could use SimpleXML with XML namespaces (alas, we still use PHP 4 at work atm), so armed with the perfect excuse, I set to work.
Getting an RSS 2 feed into SimpleXML is trivial:
$rawFeed = file_get_contents($feedUrl);
$xml = new SimpleXmlElement($rawFeed);
Extracting the information from the RSS ‘channel’ is equally trivial:
$channel[‘link’] = $xml->channel->link;
… and so on. Getting at the individual articles starts off just as easy:
{
$article = array();
$article[‘title’] = $item->title;
$article[‘link’] = $item->link;
}
… but, if you’re relying on the very thin SimpleXML documentation on php.net, like me you’ll soon run into two problems.
Some of the elements in an item sit inside different XML namespaces. The only way to get at them is to use the children() method on a SimpleXMLElement:
$article[‘creator’] = $dc->creator;
foreach ($dc->subject as $subject)
$article[’subject’][] = $dc->subject;
That’s a bit of a mouthful. It’s a bit of a shame that I can’t do this:
$article[‘creator’] = $article->dc->creator;
… or some variation on that, but the design of XML namespaces makes that impractical. (The XML namespace is actually the URI; the ‘dc’ prefix in a tag like <dc:creator> is shorthand defined in the opening tag at the top of the XML document. Although it would look a bit odd, there’s nothing at all to stop someone defining the ‘dc’ component as ‘dublinCore’ instead if they wanted to).
Having to pass the full URI for a namespace into children() is not my idea of fun! It’d be much better if we could pass in a shorter string instead. The only way to safely do this is to define an array of shortcuts yourself:
$ns = array
(
‘content’ => ‘http://purl.org/rss/1.0/modules/content/’,
‘wfw’ => ‘http://wellformedweb.org/CommentAPI/’,
‘dc’ => ‘http://purl.org/dc/elements/1.1/’
);
// now we can get dublin core content with a lot less typing!
// we also only have to update the code in one place if the namespace URI changes
$dc = $item->children($ns[‘dc’]);
$article[‘creator’] = $dc->creator;
You can get a list of the namespaces like this:
$dc = $item->children($ns[‘dc’]);
… but that only works if the XML document defines the prefix ‘dc’ for the namespace ‘http://purl.org/dc/elements/1.1/’. You’ll have to decide for yourself whether it’s a risk worth taking or not.
That’s namespaces tamed, but we’re not quite home yet. The actual ‘content’ part of the article sits inside a CDATA section inside a ‘content’ namespace, and how to deal with CDATA is conspicuous by its absence in the SimpleXML docs (probably because older versions of SimpleXML simply threw CDATA sections away without asking you).
If you have a look at the source code for SimpleXML, test 004 shows how basic CDATA access works.
With that, the final code to read a RSS 2 feed looks like this:
$ns = array
(
‘content’ => ‘http://purl.org/rss/1.0/modules/content/’,
‘wfw’ => ‘http://wellformedweb.org/CommentAPI/’,
‘dc’ => ‘http://purl.org/dc/elements/1.1/’
);
// obtain the articles in the feeds, and construct an array of articles
$articles = array();
// step 1: get the feed
$blog_url = ‘http://blog.stuartherbert.com/php/?feed=rss2′;
$rawFeed = file_get_contents($blog_url);
$xml = new SimpleXmlElement($rawFeed);
// step 2: extract the channel metadata
$channel = array();
$channel[‘title’] = $xml->channel->title;
$channel[‘link’] = $xml->channel->link;
$channel[‘description’] = $xml->channel->description;
$channel[‘pubDate’] = $xml->pubDate;
$channel[‘timestamp’] = strtotime($xml->pubDate);
$channel[‘generator’] = $xml->generator;
$channel[‘language’] = $xml->language;
// step 3: extract the articles
foreach ($xml->channel->item as $item)
{
$article = array();
$article[‘channel’] = $blog;
$article[‘title’] = $item->title;
$article[‘link’] = $item->link;
$article[‘comments’] = $item->comments;
$article[‘pubDate’] = $item->pubDate;
$article[‘timestamp’] = strtotime($item->pubDate);
$article[‘description’] = (string) trim($item->description);
$article[‘isPermaLink’] = $item->guid[‘isPermaLink’];
// get data held in namespaces
$content = $item->children($ns[‘content’]);
$dc = $item->children($ns[‘dc’]);
$wfw = $item->children($ns[‘wfw’]);
$article[‘creator’] = (string) $dc->creator;
foreach ($dc->subject as $subject)
$article[’subject’][] = (string)$subject;
$article[‘content’] = (string)trim($content->encoded);
$article[‘commentRss’] = $wfw->commentRss;
// add this article to the list
$articles[$article[‘timestamp’]] = $article;
}
// at this point, $channel contains all the metadata about the RSS feed,
// and $articles contains an array of articles for us to repurpose
Don’t forget to add error handling
I hope this example helps anyone else who needs to work with RSS 2 feeds, or who needs to know how to work with namespaces and CDATA with SimpleXML.
Did you enjoy this article? If so, subscribe to my RSS feed.
