I’ve recently switched my blog from b2evolution back to WordPress. The good news is both “no more spam :)” and “the admin panel works in Safari”, but on the downside I missed the multiblog feature that attracted me to b2evolution in the first place. There is WordPress MU, I suppose, but after coming across a few plugins that warned they didn’t work with WordPress MU, that option didn’t look very appealing.

Ah ha – thinks I – I can fake the multiblog by putting several different blogs on the site, and generating a homepage from the RSS feeds of the individual blogs. Should be simple enough, and it sounds like the perfect nail to hit with the SimpleXML hammer of PHP 5 :) Funnily enough, in work last week we were wondering whether you could use SimpleXML with XML namespaces (alas, we still use PHP 4 at work atm), so armed with the perfect excuse, I set to work.

Getting an RSS 2 feed into SimpleXML is trivial:

 
$feedUrl = 'http://blog.stuartherbert.com/php/?feed=rss2'; 
$rawFeed = file_get_contents($feedUrl); 
$xml = new SimpleXmlElement($rawFeed);

Extracting the information from the RSS ‘channel’ is equally trivial:

$channel['title'] = $xml->channel->title;
$channel['link']  = $xml->channel->link;

… and so on. Getting at the individual articles starts off just as easy:

foreach ($xml->channel->item as $item) 
{     
    $article = array();
    $article['title'] = $item->title;
    $article['link'] = $item->link; 
}

… but, if you’re relying on the very thin SimpleXML documentation on php.net, like me you’ll soon run into two problems.

Some of the elements in an item sit inside different XML namespaces. The only way to get at them is to use the children() method on a SimpleXMLElement:

 
$dc = $item->children('http://purl.org/dc/elements/1.1/'); 
$article['creator'] = $dc->creator;
foreach ($dc->subject as $subject)
    $article['subject'][] = $dc->subject;

That’s a bit of a mouthful. It’s a bit of a shame that I can’t do this:

// The following does NOT work!
$article['creator'] = $article->dc->creator;

… or some variation on that, but the design of XML namespaces makes that impractical. (The XML namespace is actually the URI; the ‘dc’ prefix in a tag like <dc:creator> is shorthand defined in the opening tag at the top of the XML document. Although it would look a bit odd, there’s nothing at all to stop someone defining the ‘dc’ component as ‘dublinCore’ instead if they wanted to).

Having to pass the full URI for a namespace into children() is not my idea of fun! It’d be much better if we could pass in a shorter string instead. The only way to safely do this is to define an array of shortcuts yourself:

// define the namespaces that we are interested in
$ns = array
(
    'content' => 'http://purl.org/rss/1.0/modules/content/',
    'wfw' => 'http://wellformedweb.org/CommentAPI/',
    'dc' => 'http://purl.org/dc/elements/1.1/'
);

// now we can get dublin core content with a lot less typing!
// we also only have to update the code in one place if the namespace URI changes
$dc = $item->children($ns['dc']);
$article['creator'] = $dc->creator;

You can get a list of the namespaces like this:

$ns = $xml->getNamespaces(true);
$dc = $item->children($ns['dc']);

… but that only works if the XML document defines the prefix ‘dc’ for the namespace ‘http://purl.org/dc/elements/1.1/’. You’ll have to decide for yourself whether it’s a risk worth taking or not.

That’s namespaces tamed, but we’re not quite home yet. The actual ‘content’ part of the article sits inside a CDATA section inside a ‘content’ namespace, and how to deal with CDATA is conspicuous by its absence in the SimpleXML docs (probably because older versions of SimpleXML simply threw CDATA sections away without asking you).

If you have a look at the source code for SimpleXML, test 004 shows how basic CDATA access works.

$content = $item->children($ns['content']);
$article['content'] = (string) trim($content->encoded);

With that, the final code to read a RSS 2 feed looks like this:

// define the namespaces that we are interested in
$ns = array
(
        'content' => 'http://purl.org/rss/1.0/modules/content/',
        'wfw' => 'http://wellformedweb.org/CommentAPI/',
        'dc' => 'http://purl.org/dc/elements/1.1/'
);

// obtain the articles in the feeds, and construct an array of articles

$articles = array();

// step 1: get the feed
$blog_url = 'http://blog.stuartherbert.com/php/?feed=rss2';

$rawFeed = file_get_contents($blog_url);
$xml = new SimpleXmlElement($rawFeed);

// step 2: extract the channel metadata

$channel = array();
$channel['title']       = $xml->channel->title;
$channel['link']        = $xml->channel->link;
$channel['description'] = $xml->channel->description;
$channel['pubDate']     = $xml->pubDate;
$channel['timestamp']   = strtotime($xml->pubDate);
$channel['generator']   = $xml->generator;
$channel['language']    = $xml->language;

// step 3: extract the articles

foreach ($xml->channel->item as $item)
{
        $article = array();
        $article['channel'] = $blog;
        $article['title'] = $item->title;
        $article['link'] = $item->link;
        $article['comments'] = $item->comments;
        $article['pubDate'] = $item->pubDate;
        $article['timestamp'] = strtotime($item->pubDate);
        $article['description'] = (string) trim($item->description);
        $article['isPermaLink'] = $item->guid['isPermaLink'];

        // get data held in namespaces
        $content = $item->children($ns['content']);
        $dc      = $item->children($ns['dc']);
        $wfw     = $item->children($ns['wfw']);

        $article['creator'] = (string) $dc->creator;
        foreach ($dc->subject as $subject)
                $article['subject'][] = (string)$subject;

        $article['content'] = (string)trim($content->encoded);
        $article['commentRss'] = $wfw->commentRss;

        // add this article to the list
        $articles[$article['timestamp']] = $article;
}

// at this point, $channel contains all the metadata about the RSS feed,
// and $articles contains an array of articles for us to repurpose

Don’t forget to add error handling :)

I hope this example helps anyone else who needs to work with RSS 2 feeds, or who needs to know how to work with namespaces and CDATA with SimpleXML.

71 Comments

  1. vb says:
    January 8th, 2007 at 1:14 am

    Thanks for all !

    I’was looking for help about my tag all the day !

    At least, i’ve found a good paper and… it’s work !

  2. Benjamin Klaile says:
    January 8th, 2007 at 8:44 am

    Hi,
    you made a little mistake in your first code section


    $feedUrl = http://blog.stuartherbert.com/php/?feed=rss2?;
    $rawFeed = file_get_contents($blog_url);

    $feedUrl should be $blog_url, else the $blog_url variable is not defined.

    Nice article though.
    Regards, Ben.

  3. Stu says:
    January 8th, 2007 at 9:09 am

    Hi Ben,
    Thanks for spotting the mistake; it should be fixed now :)
    Best regards,
    Stu

  4. nick says:
    January 8th, 2007 at 4:59 pm

    but that only works if the XML document defines the prefix dc for the namespace http://purl.org/dc/elements/1.1/. Youll have to decide for yourself whether its a risk worth taking or not.

    You can avoid this altogether – getDocNamespaces() returns an associative array where the key is the namespace prefix. For instance, run the code from the example on the manual page for getNamespaces(). Just search the array values for your namespace URI if you really want to know the document’s prefix for that namespace.

    Then another good way to deal with namespaces with SimpleXML is to use xpath to get your nodes, and register the namespaces using registerXpathNamespace(). Once you register the namespace you can query for nodes in that namespace using SimpleXML->xpath().

  5. nick says:
    January 8th, 2007 at 5:22 pm

    Also you may want to cast your atomic SimpleXML element values as strings (or whatever) because as-written, the $articles sub-array elements will be of type SimpleXMLElement. Not sure that you want that later on.

    $channel['title'] = (string)$xml->channel->title;

  6. Richard@Home says:
    January 18th, 2007 at 5:10 pm

    Have you checked out MagpieRSS? ( http://magpierss.sourceforge.net/ )

  7. rob ganly says:
    February 22nd, 2007 at 5:16 pm

    hey there, nice work.

    i’m using simplexml to get a business bbc newsfeed for a site that i’m working on but for some reason it doesn’t return it by published order… in fact, i can see no logic to the order the items are returned.

    i would’ve thought it’d be natural for the rss feed to return the items in published date/time order (chronologically descending).

    i can do the ordering myself but want to know if there’s a sweet way of doing it or making the rss return it as desired.

  8. Yelena says:
    October 27th, 2007 at 12:44 am

    Hi
    Thank you for this parser. It is really good
    The only probllem I have now, it is ugly characters microsoft quotes etc.
    for example ““predators”” on my page should be predators. How to fix it?
    Thank you

  9. Bojan says:
    December 18th, 2007 at 5:26 pm

    Great stuff! you helped me a lot with the CDATA “children” !

  10. Gwyneth Llewelyn says:
    February 16th, 2008 at 3:27 am

    Thanks for the many precious tips. This was pretty useful!

  11. Glou says:
    February 17th, 2008 at 12:12 am

    Hello,

    I don’t understand, i try it there :http://www.spx.be/XML/rss2.php5 but it’s doesn’t show !

    Can you understand why ?

    Thanks

  12. Michael says:
    February 22nd, 2008 at 9:41 pm

    This info is very good. Thanks for your works :)

  13. Vivek says:
    February 25th, 2008 at 1:24 pm

    Hi, it’s very useful. Thanks!

  14. Vivek says:
    February 25th, 2008 at 1:27 pm

    Can you use the same code for Atom feed also? I guess not. How do we tweak it to accept atom feed?

  15. Dave says:
    February 25th, 2008 at 6:02 pm

    An excellent and very useful article, especially with regards to retrieving information from namespaces. Many thanks!

  16. Karpacz Noclegi says:
    March 16th, 2008 at 3:30 pm

    Good work! i It is very usefull. i hope that i understand everything, and Your advice will work:)

  17. » ?php5?simplexml????feed - ?????? says:
    April 1st, 2008 at 7:37 am

    [...] Using SimpleXML To Parse RSS Feeds [...]

  18. Mike says:
    May 10th, 2008 at 5:11 pm

    Thanks for the code – it’s been a great help!

    Another way to get the CDATA is to load the xml using
    $xml = simplexml_load_string($rawFeed, ‘SimpleXMLElement’, LIBXML_NOCDATA);
    (PHP >= 5.1.0)

    Then you just get the CDATA elements without complications.
    See: http://us.php.net/manual/en/function.simplexml-load-string.php

  19. Tristan Bailey says:
    July 8th, 2008 at 4:10 am

    Nice article and good to point out the limits of the documentation rather than just compete the example without it.

  20. webborne says:
    July 10th, 2008 at 1:34 pm

    Man. You just itched my scratch, thanks for the article!

  21. MTHarris Blogs » Parsing XML with SimpleXML says:
    July 10th, 2008 at 2:12 pm

    [...] refering to grabbing content from inside different namespaces. Stuart Herbert on php “using simpleXML to pull rss feeds” seemed to have the same problem as I – which was solved using a get children method. [...]

  22. slloyd says:
    October 31st, 2008 at 6:16 am

    Thanks a bunch for the great article! It really helped me out. Thanks!

  23. Ian Rose says:
    December 2nd, 2008 at 11:22 pm

    Perfect. Exactly what I needed to get to my encoded CDATA feed content. Thanks a ton.

  24. Fuzzy says:
    December 5th, 2008 at 10:38 pm

    the code is great, thank you. What i would like to do now is limit the number of entries shown on my webpage. how would i do this please? What i am trying to eventually do, is once the articles are shown on the webpage, a user can click a check box next to each article, and then click a submit button which would save the article in a mysql database.

    So i will have a table similar to

    ArticleID
    Title
    Link
    Description
    Date
    Body /// for body the whole article would be stored in there, so somehow pull the data from the page in ‘link’ and store that.

    Any ideas ?

  25. Andy G. says:
    January 7th, 2009 at 3:42 am

    Thanks for your help with solving this namespace issue. Working great.

  26. Tine Mller says:
    January 21st, 2009 at 12:01 pm

    Shouldn’t it be possible to see YOUR blog on my site if I copy your code?

    After changing all and to ‘ i uploaded the file but it’s blank. Should I change some of the code before it’s functioning?

  27. Tine Mller says:
    January 26th, 2009 at 11:01 am

    Why not also put the error code in your tutorial so us beginners can understand why the blog isn’t showed, please?

  28. Whoila Blog » Blog Archive » Some useful information on parsing RSS 2 feeds says:
    February 14th, 2009 at 8:54 am

    [...] http://blog.stuartherbert.com/php/2007/01/07/using-simplexml-to-parse-rss-feeds/ [...]

  29. Work at Home says:
    April 16th, 2009 at 2:52 am

    I just recently switched (today!) from my hosted windows platform to linux. I am not versed in php at all, and of course, all of my asp broke when I moved to linux. My whole site revolves around a feed that I get from another site, I was using an asp script and xsl to publish the feed. I came upon your site in my search on Google :)

    I wanted to comment that this is filed under beginners… and my gosh… I am less than a beginner! I don’t quite understand all of this, but I am reading the explanations a few times over. I appreciate how detailed this article is.

  30. Jack says:
    May 19th, 2009 at 8:52 pm

    Nice one!!! Your post helped me a lot!!

  31. Richard Williams says:
    May 24th, 2009 at 5:23 pm

    Very good and needed article. RSS can be parsed also using biterscripting. I use biterscripting in addition to java.

    Richard

  32. Gregg says:
    June 12th, 2009 at 4:17 pm

    Incredible writeup. this will come in handy! thank you.

    for error handling be sure to do a try/catch on creating new SimpleXMLElement object

    try { $xmlData = @new SimpleXMLElement($rawFeed); } catch (Exception $e) { //error handling in here }

  33. Amy Varga says:
    July 6th, 2009 at 4:55 pm

    Very valuable information, thank you so much!

  34. swathi says:
    July 25th, 2009 at 9:29 am

    thanks a lot for this article…….very understandable

  35. Nicolas says:
    September 12th, 2009 at 10:40 am

    Your just…god !
    4 hours i’m looking for content:encoded extraction. Thanks !

  36. Feyi says:
    September 30th, 2009 at 8:46 am

    Thanks Stuart. I found your site from a thread on http://www.neowin.net/forum/index.php?showtopic=761114.
    This article is very enlightening. I’m trying to parse a namespace-based XML and it’s proving problematic.
    Thanks to this, it now problem solved.
    Thanks a lot

  37. Julian says:
    January 8th, 2010 at 1:18 pm

    Very nice!
    Thanks.

  38. Karpacz says:
    April 11th, 2010 at 9:15 am

    I used some elements of the code, but it’s hard to understand.

  39. PHP: What You Need To Know To Play With The Web - Smashing Magazine says:
    April 15th, 2010 at 2:48 pm

    [...] [...]

  40. PHP: What You Need To Know To Play With The Web | Web Design Cool says:
    April 15th, 2010 at 3:22 pm

    [...] That’s all. The only stumbling block you will encounter is CDATA blocks and namespaces in SimpleXML. Stuart Herbert has a good introduction to these two issues in this article. [...]

  41. PHP: What You Need To Know To Play With The Web | DesignerLinks | Home to Web design news, jQuery Tutorials, CSS tutorials, Web Designing tutorials, JavaScript tutorials and more! says:
    April 16th, 2010 at 12:47 am

    [...] [...]

  42. TG Designer » PHP: What You Need To Know To Play With The Web says:
    April 16th, 2010 at 2:45 am

    [...] That's all. The only stumbling block you will encounter is CDATA blocks and namespaces in SimpleXML. Stuart Herbert has a good introduction to these two issues in this article. [...]

  43. avşa says:
    April 16th, 2010 at 6:39 am

    very useful article i will use this in my blog.
    thank you,

  44. PHP: What You Need To Know To Play With The Web | CMS Code says:
    April 16th, 2010 at 8:01 am

    [...] That’s all. The only stumbling block you will encounter is CDATA blocks and namespaces in SimpleXML. Stuart Herbert has a good introduction to these two issues in this article. [...]

  45. PHP: What You Need To Know To Play With The Web | Best Web Magazine says:
    April 17th, 2010 at 8:21 am

    [...] [...]

  46. PHP: What You Need To Know To Play With The Web says:
    April 19th, 2010 at 3:50 pm

    [...] That's all. The only stumbling block you will encounter is CDATA blocks and namespaces in SimpleXML. Stuart Herbert has a good introduction to these two issues in this article. [...]

  47. PHP: What You Need To Know To Play With The Web | CMS Code says:
    April 22nd, 2010 at 5:42 am

    [...] That's all. The only stumbling block you will encounter is CDATA blocks and namespaces in SimpleXML. Stuart Herbert has a good introduction to these two issues in this article. [...]

  48. PHP: What You Need To Know To Play With The Web | JK Technologies | WJK News World says:
    April 23rd, 2010 at 9:40 am

    [...] [...]

  49. PHP: What You Need To Know To Play With The Web | Creative Man Studio says:
    May 4th, 2010 at 3:42 pm

    [...] [...]

  50. Geek is a Lift-Style. » Blog Archive » PHP: What You Need To Know To Play With The Web says:
    June 10th, 2010 at 9:15 am

    [...] That’s all. The only stumbling block you will encounter is CDATA blocks and namespaces in SimpleXML. Stuart Herbert has a good introduction to these two issues in this article. [...]

  51. Geek is a Lift-Style. » PHP: What You Need To Know To Play With The Web says:
    June 11th, 2010 at 9:04 am

    [...] That’s all. The only stumbling block you will encounter is CDATA blocks and namespaces in SimpleXML. Stuart Herbert has a good introduction to these two issues in this article. [...]

  52. Vashistha says:
    June 12th, 2010 at 11:36 am

    U rock.. solved half of my rss reading problems..

  53. Dan says:
    June 20th, 2010 at 6:11 pm

    Thanks this is a nice tutorial, precisely what I was looking for.

    Regards

    Dan

  54. Amit says:
    July 22nd, 2010 at 2:53 am

    Thanks a lot.

    It really helped.

    Regards,
    Amit

  55. yegle says:
    August 30th, 2010 at 6:28 am

    Let me add SEO keywords to this post :-)
    CDATA, PHP, simpleXML, LIBXML_NOCDATA

    And MANY THANKS!

  56. Jose Luis says:
    September 12th, 2010 at 2:30 pm

    Gracias Stuart, me ayudo bastante tu articulo.
    Saludos.

  57. | Ricardo JV Cruz says:
    September 26th, 2010 at 2:21 pm

    [...] That's all. The only stumbling block you will encounter is CDATA blocks and namespaces in SimpleXML. Stuart Herbert has a good introduction to these two issues in this article. [...]

  58. Parsing RSS feeds with SimpleXML | boyhappy says:
    September 29th, 2010 at 9:13 am

    [...] To get and parse a rss feed is fairly simple in php, when you use SimpleXML. But you will run into problems when you want to reach the full content and a couple other things. They are inside different XML namespaces. I found this exelent post which explanes how to handle this: http://blog.stuartherbert.com/php/2007/01/07/using-simplexml-to-parse-rss-feeds/ [...]

  59. Pete's food processor reviews says:
    January 14th, 2011 at 7:23 pm

    Thanks for the tip. I’ll have to read it a few times to get my head around it since I don’t come from a programming background but I’m sure I can use it on my website.

  60. Doc Stupid says:
    February 17th, 2011 at 5:53 am

    Thanks man! This saved me some time!

  61. 怎么快速祛痘 says:
    March 10th, 2011 at 5:28 am

    The only stumbling block you will encounter is CDATA

  62. Popular toys says:
    May 10th, 2011 at 11:08 am

    Great code tip, I used it on one of my own wordpress sites and it really works easy and stable.

    Regards
    Pop

  63. Aashish Aggarwal says:
    June 1st, 2011 at 8:12 am

    If you are struggling with xml namespaces, there is a great tutorial on xpath namespaces at xml reports. It walks you through it in very simple steps

    xml reports

  64. Niklas says:
    June 14th, 2011 at 11:33 am

    How can I load Atom feeds with SimpleXML? I’ve tried loading an xml document containing an atom feed but the retrieved object is empty, why is that?

  65. erick says:
    June 23rd, 2011 at 4:30 pm

    You are the best! It´s all I can say. I have spend several weeks looking for this incredible information in order to parse feeds contents which includes the “funny” dc namespace…This post entry deserves to be in the very first position on Google. It´s by far the best information about the topic. And the best thing: it works! Thanks a lot.

  66. James says:
    July 5th, 2011 at 3:02 pm

    Found this very useful, thanks.

  67. PuSH deel 3 | Kblog says:
    August 1st, 2011 at 1:17 am

    [...] van XML. Ik kende ze al, want voor de VKblog importer had ik ze al bekeken. En ik vond op internet een blog van iemand die een keer een RSS feed had ingelezen, en daarvoor een aantal problemen had moeten [...]

  68. Sophie says:
    January 27th, 2012 at 9:32 am

    It really helped, thanks a lot !

  69. Where is the Error Handling? says:
    March 25th, 2012 at 12:04 am

    I am very impressed by the code, but why stop short of error handling? I should have liked to see more.

  70. Vidal says:
    June 30th, 2012 at 7:08 pm

    Cool, I like the multiple feeds, also check out this free php snippet, a php rssparser with cache support http://rssparser.com/php-simplexml-rss-parser-with-caching-support/

  71. sajid hussain says:
    October 11th, 2012 at 9:15 am

    I was looking for this everything was done but not getting able to resolve the namespaces issue. Thanks a lot for writing this. It worked amazing for me