Using SimpleXML To Parse RSS Feeds

Posted by Stuart Herbert @ 9:49 PM, Sun 07 Jan 07

Filed under: 1 - Beginner, Examples

18 Comments

I’ve recently switched my blog from b2evolution back to Wordpress. The good news is both “no more spam :)” and “the admin panel works in Safari”, but on the downside I missed the multiblog feature that attracted me to b2evolution in the first place. There is Wordpress MU, I suppose, but after coming across a few plugins that warned they didn’t work with Wordpress MU, that option didn’t look very appealing.

Ah ha - thinks I - I can fake the multiblog by putting several different blogs on the site, and generating a homepage from the RSS feeds of the individual blogs. Should be simple enough, and it sounds like the perfect nail to hit with the SimpleXML hammer of PHP 5 :) Funnily enough, in work last week we were wondering whether you could use SimpleXML with XML namespaces (alas, we still use PHP 4 at work atm), so armed with the perfect excuse, I set to work.

Getting an RSS 2 feed into SimpleXML is trivial:

$feedUrl = ‘http://blog.stuartherbert.com/php/?feed=rss2′;
$rawFeed = file_get_contents($feedUrl);
$xml = new SimpleXmlElement($rawFeed);

Extracting the information from the RSS ‘channel’ is equally trivial:

$channel[‘title’] = $xml->channel->title;
$channel[‘link’]  = $xml->channel->link;

… and so on. Getting at the individual articles starts off just as easy:

foreach ($xml->channel->item as $item)
{     
    $article = array();
    $article[‘title’] = $item->title;
    $article[‘link’] = $item->link;
}

… but, if you’re relying on the very thin SimpleXML documentation on php.net, like me you’ll soon run into two problems.

Some of the elements in an item sit inside different XML namespaces. The only way to get at them is to use the children() method on a SimpleXMLElement:

$dc = $item->children(‘http://purl.org/dc/elements/1.1/’);
$article[‘creator’] = $dc->creator;
foreach ($dc->subject as $subject)
    $article[’subject’][] = $dc->subject;

That’s a bit of a mouthful. It’s a bit of a shame that I can’t do this:

// The following does NOT work!
$article[‘creator’] = $article->dc->creator;

… or some variation on that, but the design of XML namespaces makes that impractical. (The XML namespace is actually the URI; the ‘dc’ prefix in a tag like <dc:creator> is shorthand defined in the opening tag at the top of the XML document. Although it would look a bit odd, there’s nothing at all to stop someone defining the ‘dc’ component as ‘dublinCore’ instead if they wanted to).

Having to pass the full URI for a namespace into children() is not my idea of fun! It’d be much better if we could pass in a shorter string instead. The only way to safely do this is to define an array of shortcuts yourself:

// define the namespaces that we are interested in
$ns = array
(
    ‘content’ => ‘http://purl.org/rss/1.0/modules/content/’,
    ‘wfw’ => ‘http://wellformedweb.org/CommentAPI/’,
    ‘dc’ => ‘http://purl.org/dc/elements/1.1/’
);

// now we can get dublin core content with a lot less typing!
// we also only have to update the code in one place if the namespace URI changes
$dc = $item->children($ns[‘dc’]);
$article[‘creator’] = $dc->creator;

You can get a list of the namespaces like this:

$ns = $xml->getNamespaces(true);
$dc = $item->children($ns[‘dc’]);

… but that only works if the XML document defines the prefix ‘dc’ for the namespace ‘http://purl.org/dc/elements/1.1/’. You’ll have to decide for yourself whether it’s a risk worth taking or not.

That’s namespaces tamed, but we’re not quite home yet. The actual ‘content’ part of the article sits inside a CDATA section inside a ‘content’ namespace, and how to deal with CDATA is conspicuous by its absence in the SimpleXML docs (probably because older versions of SimpleXML simply threw CDATA sections away without asking you).

If you have a look at the source code for SimpleXML, test 004 shows how basic CDATA access works.

$content = $item->children($ns[‘content’]);
$article[‘content’] = (string) trim($content->encoded);

With that, the final code to read a RSS 2 feed looks like this:

// define the namespaces that we are interested in
$ns = array
(
        ‘content’ => ‘http://purl.org/rss/1.0/modules/content/’,
        ‘wfw’ => ‘http://wellformedweb.org/CommentAPI/’,
        ‘dc’ => ‘http://purl.org/dc/elements/1.1/’
);

// obtain the articles in the feeds, and construct an array of articles

$articles = array();

// step 1: get the feed
$blog_url = ‘http://blog.stuartherbert.com/php/?feed=rss2′;

$rawFeed = file_get_contents($blog_url);
$xml = new SimpleXmlElement($rawFeed);

// step 2: extract the channel metadata

$channel = array();
$channel[‘title’]       = $xml->channel->title;
$channel[‘link’]        = $xml->channel->link;
$channel[‘description’] = $xml->channel->description;
$channel[‘pubDate’]     = $xml->pubDate;
$channel[‘timestamp’]   = strtotime($xml->pubDate);
$channel[‘generator’]   = $xml->generator;
$channel[‘language’]    = $xml->language;

// step 3: extract the articles

foreach ($xml->channel->item as $item)
{
        $article = array();
        $article[‘channel’] = $blog;
        $article[‘title’] = $item->title;
        $article[‘link’] = $item->link;
        $article[‘comments’] = $item->comments;
        $article[‘pubDate’] = $item->pubDate;
        $article[‘timestamp’] = strtotime($item->pubDate);
        $article[‘description’] = (string) trim($item->description);
        $article[‘isPermaLink’] = $item->guid[‘isPermaLink’];

        // get data held in namespaces
        $content = $item->children($ns[‘content’]);
        $dc      = $item->children($ns[‘dc’]);
        $wfw     = $item->children($ns[‘wfw’]);

        $article[‘creator’] = (string) $dc->creator;
        foreach ($dc->subject as $subject)
                $article[’subject’][] = (string)$subject;

        $article[‘content’] = (string)trim($content->encoded);
        $article[‘commentRss’] = $wfw->commentRss;

        // add this article to the list
        $articles[$article[‘timestamp’]] = $article;
}

// at this point, $channel contains all the metadata about the RSS feed,
// and $articles contains an array of articles for us to repurpose
 

Don’t forget to add error handling :)

I hope this example helps anyone else who needs to work with RSS 2 feeds, or who needs to know how to work with namespaces and CDATA with SimpleXML.

Did you enjoy this article? If so, subscribe to my RSS feed.

18 Comments

  1. vb says:
    January 8th, 2007 at 1:14 am

    Thanks for all !

    I’was looking for help about my tag all the day !

    At least, i’ve found a good paper and… it’s work !

  2. Benjamin Klaile says:
    January 8th, 2007 at 8:44 am

    Hi,
    you made a little mistake in your first code section


    $feedUrl = ‘http://blog.stuartherbert.com/php/?feed=rss2′;
    $rawFeed = file_get_contents($blog_url);

    $feedUrl should be $blog_url, else the $blog_url variable is not defined.

    Nice article though.
    Regards, Ben.

  3. Stu says:
    January 8th, 2007 at 9:09 am

    Hi Ben,
    Thanks for spotting the mistake; it should be fixed now :)
    Best regards,
    Stu

  4. nick says:
    January 8th, 2007 at 4:59 pm

    but that only works if the XML document defines the prefix ‘dc’ for the namespace ‘http://purl.org/dc/elements/1.1/’. You’ll have to decide for yourself whether it’s a risk worth taking or not.

    You can avoid this altogether - getDocNamespaces() returns an associative array where the key is the namespace prefix. For instance, run the code from the example on the manual page for getNamespaces(). Just search the array values for your namespace URI if you really want to know the document’s prefix for that namespace.

    Then another good way to deal with namespaces with SimpleXML is to use xpath to get your nodes, and register the namespaces using registerXpathNamespace(). Once you register the namespace you can query for nodes in that namespace using SimpleXML->xpath().

  5. nick says:
    January 8th, 2007 at 5:22 pm

    Also you may want to cast your atomic SimpleXML element values as strings (or whatever) because as-written, the $articles sub-array elements will be of type SimpleXMLElement. Not sure that you want that later on.

    $channel['title'] = (string)$xml->channel->title;

  6. Richard@Home says:
    January 18th, 2007 at 5:10 pm

    Have you checked out MagpieRSS? ( http://magpierss.sourceforge.net/ )

  7. rob ganly says:
    February 22nd, 2007 at 5:16 pm

    hey there, nice work.

    i’m using simplexml to get a business bbc newsfeed for a site that i’m working on but for some reason it doesn’t return it by published order… in fact, i can see no logic to the order the items are returned.

    i would’ve thought it’d be natural for the rss feed to return the items in published date/time order (chronologically descending).

    i can do the ordering myself but want to know if there’s a sweet way of doing it or making the rss return it as desired.

  8. Yelena says:
    October 27th, 2007 at 12:44 am

    Hi
    Thank you for this parser. It is really good
    The only probllem I have now, it is ugly characters microsoft quotes etc.
    for example ““predators”” on my page should be “predators”. How to fix it?
    Thank you

  9. Bojan says:
    December 18th, 2007 at 5:26 pm

    Great stuff! you helped me a lot with the CDATA “children” !

  10. Gwyneth Llewelyn says:
    February 16th, 2008 at 3:27 am

    Thanks for the many precious tips. This was pretty useful!

  11. Glou says:
    February 17th, 2008 at 12:12 am

    Hello,

    I don’t understand, i try it there :http://www.spx.be/XML/rss2.php5 but it’s doesn’t show !

    Can you understand why ?

    Thanks

  12. Michael says:
    February 22nd, 2008 at 9:41 pm

    This info is very good. Thanks for your works :)

  13. Vivek says:
    February 25th, 2008 at 1:24 pm

    Hi, it’s very useful. Thanks!

  14. Vivek says:
    February 25th, 2008 at 1:27 pm

    Can you use the same code for Atom feed also? I guess not. How do we tweak it to accept atom feed?

  15. Dave says:
    February 25th, 2008 at 6:02 pm

    An excellent and very useful article, especially with regards to retrieving information from namespaces. Many thanks!

  16. Karpacz Noclegi says:
    March 16th, 2008 at 3:30 pm

    Good work! i It is very usefull. i hope that i understand everything, and Your advice will work:)

  17. » 用php5的simplexml解析各种feed - 某人的栖息地 says:
    April 1st, 2008 at 7:37 am

    [...] Using SimpleXML To Parse RSS Feeds [...]

  18. Mike says:
    May 10th, 2008 at 5:11 pm

    Thanks for the code - it’s been a great help!

    Another way to get the CDATA is to load the xml using
    $xml = simplexml_load_string($rawFeed, ‘SimpleXMLElement’, LIBXML_NOCDATA);
    (PHP >= 5.1.0)

    Then you just get the CDATA elements without complications.
    See: http://us.php.net/manual/en/function.simplexml-load-string.php

Categories

Archives

What's Stu Doing Now?

Latest Photos

Cardiff City Hall At Dawn
The Gatso Is Your Friend
The Way Is Blocked
The Leafy Road To Llantrisant
N82 Test Shot
N82 Test Shot
N82 Test Shot
N82 Test Shot
N82 Test Shot
N82 Test Shot

This Month

January 2007
S M T W T F S
    Feb »
 123456
78910111213
14151617181920
21222324252627
28293031