Using SimpleXML To Parse RSS Feeds

Posted by Stuart Herbert @ 9:49 PM, Sun 07 Jan 07

Filed under: 1 - Beginner, Examples

32 Comments

I’ve recently switched my blog from b2evolution back to Wordpress. The good news is both “no more spam :)” and “the admin panel works in Safari”, but on the downside I missed the multiblog feature that attracted me to b2evolution in the first place. There is Wordpress MU, I suppose, but after coming across a few plugins that warned they didn’t work with Wordpress MU, that option didn’t look very appealing.

Ah ha - thinks I - I can fake the multiblog by putting several different blogs on the site, and generating a homepage from the RSS feeds of the individual blogs. Should be simple enough, and it sounds like the perfect nail to hit with the SimpleXML hammer of PHP 5 :) Funnily enough, in work last week we were wondering whether you could use SimpleXML with XML namespaces (alas, we still use PHP 4 at work atm), so armed with the perfect excuse, I set to work.

Getting an RSS 2 feed into SimpleXML is trivial:

$feedUrl = ‘http://blog.stuartherbert.com/php/?feed=rss2′;
$rawFeed = file_get_contents($feedUrl);
$xml = new SimpleXmlElement($rawFeed);

Extracting the information from the RSS ‘channel’ is equally trivial:

$channel[‘title’] = $xml->channel->title;
$channel[‘link’]  = $xml->channel->link;

… and so on. Getting at the individual articles starts off just as easy:

foreach ($xml->channel->item as $item)
{     
    $article = array();
    $article[‘title’] = $item->title;
    $article[‘link’] = $item->link;
}

… but, if you’re relying on the very thin SimpleXML documentation on php.net, like me you’ll soon run into two problems.

Some of the elements in an item sit inside different XML namespaces. The only way to get at them is to use the children() method on a SimpleXMLElement:

$dc = $item->children(‘http://purl.org/dc/elements/1.1/’);
$article[‘creator’] = $dc->creator;
foreach ($dc->subject as $subject)
    $article[’subject’][] = $dc->subject;

That’s a bit of a mouthful. It’s a bit of a shame that I can’t do this:

// The following does NOT work!
$article[‘creator’] = $article->dc->creator;

… or some variation on that, but the design of XML namespaces makes that impractical. (The XML namespace is actually the URI; the ‘dc’ prefix in a tag like <dc:creator> is shorthand defined in the opening tag at the top of the XML document. Although it would look a bit odd, there’s nothing at all to stop someone defining the ‘dc’ component as ‘dublinCore’ instead if they wanted to).

Having to pass the full URI for a namespace into children() is not my idea of fun! It’d be much better if we could pass in a shorter string instead. The only way to safely do this is to define an array of shortcuts yourself:

// define the namespaces that we are interested in
$ns = array
(
    ‘content’ => ‘http://purl.org/rss/1.0/modules/content/’,
    ‘wfw’ => ‘http://wellformedweb.org/CommentAPI/’,
    ‘dc’ => ‘http://purl.org/dc/elements/1.1/’
);

// now we can get dublin core content with a lot less typing!
// we also only have to update the code in one place if the namespace URI changes
$dc = $item->children($ns[‘dc’]);
$article[‘creator’] = $dc->creator;

You can get a list of the namespaces like this:

$ns = $xml->getNamespaces(true);
$dc = $item->children($ns[‘dc’]);

… but that only works if the XML document defines the prefix ‘dc’ for the namespace ‘http://purl.org/dc/elements/1.1/’. You’ll have to decide for yourself whether it’s a risk worth taking or not.

That’s namespaces tamed, but we’re not quite home yet. The actual ‘content’ part of the article sits inside a CDATA section inside a ‘content’ namespace, and how to deal with CDATA is conspicuous by its absence in the SimpleXML docs (probably because older versions of SimpleXML simply threw CDATA sections away without asking you).

If you have a look at the source code for SimpleXML, test 004 shows how basic CDATA access works.

$content = $item->children($ns[‘content’]);
$article[‘content’] = (string) trim($content->encoded);

With that, the final code to read a RSS 2 feed looks like this:

// define the namespaces that we are interested in
$ns = array
(
        ‘content’ => ‘http://purl.org/rss/1.0/modules/content/’,
        ‘wfw’ => ‘http://wellformedweb.org/CommentAPI/’,
        ‘dc’ => ‘http://purl.org/dc/elements/1.1/’
);

// obtain the articles in the feeds, and construct an array of articles

$articles = array();

// step 1: get the feed
$blog_url = ‘http://blog.stuartherbert.com/php/?feed=rss2′;

$rawFeed = file_get_contents($blog_url);
$xml = new SimpleXmlElement($rawFeed);

// step 2: extract the channel metadata

$channel = array();
$channel[‘title’]       = $xml->channel->title;
$channel[‘link’]        = $xml->channel->link;
$channel[‘description’] = $xml->channel->description;
$channel[‘pubDate’]     = $xml->pubDate;
$channel[‘timestamp’]   = strtotime($xml->pubDate);
$channel[‘generator’]   = $xml->generator;
$channel[‘language’]    = $xml->language;

// step 3: extract the articles

foreach ($xml->channel->item as $item)
{
        $article = array();
        $article[‘channel’] = $blog;
        $article[‘title’] = $item->title;
        $article[‘link’] = $item->link;
        $article[‘comments’] = $item->comments;
        $article[‘pubDate’] = $item->pubDate;
        $article[‘timestamp’] = strtotime($item->pubDate);
        $article[‘description’] = (string) trim($item->description);
        $article[‘isPermaLink’] = $item->guid[‘isPermaLink’];

        // get data held in namespaces
        $content = $item->children($ns[‘content’]);
        $dc      = $item->children($ns[‘dc’]);
        $wfw     = $item->children($ns[‘wfw’]);

        $article[‘creator’] = (string) $dc->creator;
        foreach ($dc->subject as $subject)
                $article[’subject’][] = (string)$subject;

        $article[‘content’] = (string)trim($content->encoded);
        $article[‘commentRss’] = $wfw->commentRss;

        // add this article to the list
        $articles[$article[‘timestamp’]] = $article;
}

// at this point, $channel contains all the metadata about the RSS feed,
// and $articles contains an array of articles for us to repurpose
 

Don’t forget to add error handling :)

I hope this example helps anyone else who needs to work with RSS 2 feeds, or who needs to know how to work with namespaces and CDATA with SimpleXML.

Did you enjoy this article? If so, subscribe to my RSS feed.

32 Comments

  1. vb says:
    January 8th, 2007 at 1:14 am

    Thanks for all !

    I’was looking for help about my tag all the day !

    At least, i’ve found a good paper and… it’s work !

  2. Benjamin Klaile says:
    January 8th, 2007 at 8:44 am

    Hi,
    you made a little mistake in your first code section


    $feedUrl = ‘http://blog.stuartherbert.com/php/?feed=rss2′;
    $rawFeed = file_get_contents($blog_url);

    $feedUrl should be $blog_url, else the $blog_url variable is not defined.

    Nice article though.
    Regards, Ben.

  3. Stu says:
    January 8th, 2007 at 9:09 am

    Hi Ben,
    Thanks for spotting the mistake; it should be fixed now :)
    Best regards,
    Stu

  4. nick says:
    January 8th, 2007 at 4:59 pm

    but that only works if the XML document defines the prefix ‘dc’ for the namespace ‘http://purl.org/dc/elements/1.1/’. You’ll have to decide for yourself whether it’s a risk worth taking or not.

    You can avoid this altogether - getDocNamespaces() returns an associative array where the key is the namespace prefix. For instance, run the code from the example on the manual page for getNamespaces(). Just search the array values for your namespace URI if you really want to know the document’s prefix for that namespace.

    Then another good way to deal with namespaces with SimpleXML is to use xpath to get your nodes, and register the namespaces using registerXpathNamespace(). Once you register the namespace you can query for nodes in that namespace using SimpleXML->xpath().

  5. nick says:
    January 8th, 2007 at 5:22 pm

    Also you may want to cast your atomic SimpleXML element values as strings (or whatever) because as-written, the $articles sub-array elements will be of type SimpleXMLElement. Not sure that you want that later on.

    $channel['title'] = (string)$xml->channel->title;

  6. Richard@Home says:
    January 18th, 2007 at 5:10 pm

    Have you checked out MagpieRSS? ( http://magpierss.sourceforge.net/ )

  7. rob ganly says:
    February 22nd, 2007 at 5:16 pm

    hey there, nice work.

    i’m using simplexml to get a business bbc newsfeed for a site that i’m working on but for some reason it doesn’t return it by published order… in fact, i can see no logic to the order the items are returned.

    i would’ve thought it’d be natural for the rss feed to return the items in published date/time order (chronologically descending).

    i can do the ordering myself but want to know if there’s a sweet way of doing it or making the rss return it as desired.

  8. Yelena says:
    October 27th, 2007 at 12:44 am

    Hi
    Thank you for this parser. It is really good
    The only probllem I have now, it is ugly characters microsoft quotes etc.
    for example ““predators”” on my page should be “predators”. How to fix it?
    Thank you

  9. Bojan says:
    December 18th, 2007 at 5:26 pm

    Great stuff! you helped me a lot with the CDATA “children” !

  10. Gwyneth Llewelyn says:
    February 16th, 2008 at 3:27 am

    Thanks for the many precious tips. This was pretty useful!

  11. Glou says:
    February 17th, 2008 at 12:12 am

    Hello,

    I don’t understand, i try it there :http://www.spx.be/XML/rss2.php5 but it’s doesn’t show !

    Can you understand why ?

    Thanks

  12. Michael says:
    February 22nd, 2008 at 9:41 pm

    This info is very good. Thanks for your works :)

  13. Vivek says:
    February 25th, 2008 at 1:24 pm

    Hi, it’s very useful. Thanks!

  14. Vivek says:
    February 25th, 2008 at 1:27 pm

    Can you use the same code for Atom feed also? I guess not. How do we tweak it to accept atom feed?

  15. Dave says:
    February 25th, 2008 at 6:02 pm

    An excellent and very useful article, especially with regards to retrieving information from namespaces. Many thanks!

  16. Karpacz Noclegi says:
    March 16th, 2008 at 3:30 pm

    Good work! i It is very usefull. i hope that i understand everything, and Your advice will work:)

  17. » 用php5的simplexml解析各种feed - 某人的栖息地 says:
    April 1st, 2008 at 7:37 am

    [...] Using SimpleXML To Parse RSS Feeds [...]

  18. Mike says:
    May 10th, 2008 at 5:11 pm

    Thanks for the code - it’s been a great help!

    Another way to get the CDATA is to load the xml using
    $xml = simplexml_load_string($rawFeed, ‘SimpleXMLElement’, LIBXML_NOCDATA);
    (PHP >= 5.1.0)

    Then you just get the CDATA elements without complications.
    See: http://us.php.net/manual/en/function.simplexml-load-string.php

  19. Tristan Bailey says:
    July 8th, 2008 at 4:10 am

    Nice article and good to point out the limits of the documentation rather than just compete the example without it.

  20. webborne says:
    July 10th, 2008 at 1:34 pm

    Man. You just itched my scratch, thanks for the article!

  21. MTHarris Blogs » Parsing XML with SimpleXML says:
    July 10th, 2008 at 2:12 pm

    [...] refering to grabbing content from inside different namespaces.  Stuart Herbert on php “using simpleXML to pull rss feeds” seemed to have the same problem as I - which was solved using a get children method.  [...]

  22. slloyd says:
    October 31st, 2008 at 6:16 am

    Thanks a bunch for the great article! It really helped me out. Thanks!

  23. Ian Rose says:
    December 2nd, 2008 at 11:22 pm

    Perfect. Exactly what I needed to get to my encoded CDATA feed content. Thanks a ton.

  24. Fuzzy says:
    December 5th, 2008 at 10:38 pm

    the code is great, thank you. What i would like to do now is limit the number of entries shown on my webpage. how would i do this please? What i am trying to eventually do, is once the articles are shown on the webpage, a user can click a check box next to each article, and then click a submit button which would save the article in a mysql database.

    So i will have a table similar to

    ArticleID
    Title
    Link
    Description
    Date
    Body /// for body the whole article would be stored in there, so somehow pull the data from the page in ‘link’ and store that.

    Any ideas ?

  25. Andy G. says:
    January 7th, 2009 at 3:42 am

    Thanks for your help with solving this namespace issue. Working great.

  26. Tine Müller says:
    January 21st, 2009 at 12:01 pm

    Shouldn’t it be possible to see YOUR blog on my site if I copy your code?

    After changing all ‘ and ’ to ‘ i uploaded the file but it’s blank. Should I change some of the code before it’s functioning?

  27. Tine Müller says:
    January 26th, 2009 at 11:01 am

    Why not also put the error code in your tutorial so us beginners can understand why the blog isn’t showed, please?

  28. Whoila Blog » Blog Archive » Some useful information on parsing RSS 2 feeds says:
    February 14th, 2009 at 8:54 am

    [...] http://blog.stuartherbert.com/php/2007/01/07/using-simplexml-to-parse-rss-feeds/ [...]

  29. Work at Home says:
    April 16th, 2009 at 2:52 am

    I just recently switched (today!) from my hosted windows platform to linux. I am not versed in php at all, and of course, all of my asp broke when I moved to linux. My whole site revolves around a feed that I get from another site, I was using an asp script and xsl to publish the feed. I came upon your site in my search on Google :)

    I wanted to comment that this is filed under beginners… and my gosh… I am less than a beginner! I don’t quite understand all of this, but I am reading the explanations a few times over. I appreciate how detailed this article is.

  30. Jack says:
    May 19th, 2009 at 8:52 pm

    Nice one!!! Your post helped me a lot!!

  31. Richard Williams says:
    May 24th, 2009 at 5:23 pm

    Very good and needed article. RSS can be parsed also using biterscripting. I use biterscripting in addition to java.

    Richard

  32. Gregg says:
    June 12th, 2009 at 4:17 pm

    Incredible writeup. this will come in handy! thank you.

    for error handling be sure to do a try/catch on creating new SimpleXMLElement object

    try { $xmlData = @new SimpleXMLElement($rawFeed); } catch (Exception $e) { //error handling in here }

Categories

Archives

What's Stu Doing Now?

Latest Photos

Attending The Hill Fire
Carving Up Close - HDR
Tasteless? You Decide - HDR
Gone Fishing - HDR
Cliffs and Clouds - HDR
Tasteless? You Decide
Garden Ornament #2
Garden Ornament
Gone Fishing
Carving #2

This Month

January 2007
S M T W T F S
    Feb »
 123456
78910111213
14151617181920
21222324252627
28293031