Using SimpleXML To Parse RSS Feeds
Posted by Stuart Herbert on January 7th, 2007 in 1 - Beginner, Examples.
I’ve recently switched my blog from b2evolution back to WordPress. The good news is both “no more spam :)” and “the admin panel works in Safari”, but on the downside I missed the multiblog feature that attracted me to b2evolution in the first place. There is WordPress MU, I suppose, but after coming across a few plugins that warned they didn’t work with WordPress MU, that option didn’t look very appealing.
Ah ha – thinks I – I can fake the multiblog by putting several different blogs on the site, and generating a homepage from the RSS feeds of the individual blogs. Should be simple enough, and it sounds like the perfect nail to hit with the SimpleXML hammer of PHP 5 :) Funnily enough, in work last week we were wondering whether you could use SimpleXML with XML namespaces (alas, we still use PHP 4 at work atm), so armed with the perfect excuse, I set to work.
Getting an RSS 2 feed into SimpleXML is trivial:
$feedUrl = 'http://blog.stuartherbert.com/php/?feed=rss2'; $rawFeed = file_get_contents($feedUrl); $xml = new SimpleXmlElement($rawFeed);
Extracting the information from the RSS ‘channel’ is equally trivial:
$channel['title'] = $xml->channel->title; $channel['link'] = $xml->channel->link;
… and so on. Getting at the individual articles starts off just as easy:
foreach ($xml->channel->item as $item)
{
$article = array();
$article['title'] = $item->title;
$article['link'] = $item->link;
}
… but, if you’re relying on the very thin SimpleXML documentation on php.net, like me you’ll soon run into two problems.
Some of the elements in an item sit inside different XML namespaces. The only way to get at them is to use the children() method on a SimpleXMLElement:
$dc = $item->children('http://purl.org/dc/elements/1.1/');
$article['creator'] = $dc->creator;
foreach ($dc->subject as $subject)
$article['subject'][] = $dc->subject;
That’s a bit of a mouthful. It’s a bit of a shame that I can’t do this:
// The following does NOT work! $article['creator'] = $article->dc->creator;
… or some variation on that, but the design of XML namespaces makes that impractical. (The XML namespace is actually the URI; the ‘dc’ prefix in a tag like <dc:creator> is shorthand defined in the opening tag at the top of the XML document. Although it would look a bit odd, there’s nothing at all to stop someone defining the ‘dc’ component as ‘dublinCore’ instead if they wanted to).
Having to pass the full URI for a namespace into children() is not my idea of fun! It’d be much better if we could pass in a shorter string instead. The only way to safely do this is to define an array of shortcuts yourself:
// define the namespaces that we are interested in
$ns = array
(
'content' => 'http://purl.org/rss/1.0/modules/content/',
'wfw' => 'http://wellformedweb.org/CommentAPI/',
'dc' => 'http://purl.org/dc/elements/1.1/'
);
// now we can get dublin core content with a lot less typing!
// we also only have to update the code in one place if the namespace URI changes
$dc = $item->children($ns['dc']);
$article['creator'] = $dc->creator;
You can get a list of the namespaces like this:
$ns = $xml->getNamespaces(true); $dc = $item->children($ns['dc']);
… but that only works if the XML document defines the prefix ‘dc’ for the namespace ‘http://purl.org/dc/elements/1.1/’. You’ll have to decide for yourself whether it’s a risk worth taking or not.
That’s namespaces tamed, but we’re not quite home yet. The actual ‘content’ part of the article sits inside a CDATA section inside a ‘content’ namespace, and how to deal with CDATA is conspicuous by its absence in the SimpleXML docs (probably because older versions of SimpleXML simply threw CDATA sections away without asking you).
If you have a look at the source code for SimpleXML, test 004 shows how basic CDATA access works.
$content = $item->children($ns['content']); $article['content'] = (string) trim($content->encoded);
With that, the final code to read a RSS 2 feed looks like this:
// define the namespaces that we are interested in
$ns = array
(
'content' => 'http://purl.org/rss/1.0/modules/content/',
'wfw' => 'http://wellformedweb.org/CommentAPI/',
'dc' => 'http://purl.org/dc/elements/1.1/'
);
// obtain the articles in the feeds, and construct an array of articles
$articles = array();
// step 1: get the feed
$blog_url = 'http://blog.stuartherbert.com/php/?feed=rss2';
$rawFeed = file_get_contents($blog_url);
$xml = new SimpleXmlElement($rawFeed);
// step 2: extract the channel metadata
$channel = array();
$channel['title'] = $xml->channel->title;
$channel['link'] = $xml->channel->link;
$channel['description'] = $xml->channel->description;
$channel['pubDate'] = $xml->pubDate;
$channel['timestamp'] = strtotime($xml->pubDate);
$channel['generator'] = $xml->generator;
$channel['language'] = $xml->language;
// step 3: extract the articles
foreach ($xml->channel->item as $item)
{
$article = array();
$article['channel'] = $blog;
$article['title'] = $item->title;
$article['link'] = $item->link;
$article['comments'] = $item->comments;
$article['pubDate'] = $item->pubDate;
$article['timestamp'] = strtotime($item->pubDate);
$article['description'] = (string) trim($item->description);
$article['isPermaLink'] = $item->guid['isPermaLink'];
// get data held in namespaces
$content = $item->children($ns['content']);
$dc = $item->children($ns['dc']);
$wfw = $item->children($ns['wfw']);
$article['creator'] = (string) $dc->creator;
foreach ($dc->subject as $subject)
$article['subject'][] = (string)$subject;
$article['content'] = (string)trim($content->encoded);
$article['commentRss'] = $wfw->commentRss;
// add this article to the list
$articles[$article['timestamp']] = $article;
}
// at this point, $channel contains all the metadata about the RSS feed,
// and $articles contains an array of articles for us to repurpose
Don’t forget to add error handling :)
I hope this example helps anyone else who needs to work with RSS 2 feeds, or who needs to know how to work with namespaces and CDATA with SimpleXML.
71 Comments
January 8th, 2007 at 1:14 am
Thanks for all !
I’was looking for help about my tag all the day !
At least, i’ve found a good paper and… it’s work !
January 8th, 2007 at 8:44 am
Hi,
you made a little mistake in your first code section
—
$feedUrl = ‘http://blog.stuartherbert.com/php/?feed=rss2?;
$rawFeed = file_get_contents($blog_url);
—
$feedUrl should be $blog_url, else the $blog_url variable is not defined.
Nice article though.
Regards, Ben.
January 8th, 2007 at 9:09 am
Hi Ben,
Thanks for spotting the mistake; it should be fixed now :)
Best regards,
Stu
January 8th, 2007 at 4:59 pm
You can avoid this altogether – getDocNamespaces() returns an associative array where the key is the namespace prefix. For instance, run the code from the example on the manual page for getNamespaces(). Just search the array values for your namespace URI if you really want to know the document’s prefix for that namespace.
Then another good way to deal with namespaces with SimpleXML is to use xpath to get your nodes, and register the namespaces using registerXpathNamespace(). Once you register the namespace you can query for nodes in that namespace using SimpleXML->xpath().
January 8th, 2007 at 5:22 pm
Also you may want to cast your atomic SimpleXML element values as strings (or whatever) because as-written, the $articles sub-array elements will be of type SimpleXMLElement. Not sure that you want that later on.
$channel['title'] = (string)$xml->channel->title;
January 18th, 2007 at 5:10 pm
Have you checked out MagpieRSS? ( http://magpierss.sourceforge.net/ )
February 22nd, 2007 at 5:16 pm
hey there, nice work.
i’m using simplexml to get a business bbc newsfeed for a site that i’m working on but for some reason it doesn’t return it by published order… in fact, i can see no logic to the order the items are returned.
i would’ve thought it’d be natural for the rss feed to return the items in published date/time order (chronologically descending).
i can do the ordering myself but want to know if there’s a sweet way of doing it or making the rss return it as desired.
October 27th, 2007 at 12:44 am
Hi
Thank you for this parser. It is really good
The only probllem I have now, it is ugly characters microsoft quotes etc.
for example ““predators— on my page should be “predators”. How to fix it?
Thank you
December 18th, 2007 at 5:26 pm
Great stuff! you helped me a lot with the CDATA “children” !
February 16th, 2008 at 3:27 am
Thanks for the many precious tips. This was pretty useful!
February 17th, 2008 at 12:12 am
Hello,
I don’t understand, i try it there :http://www.spx.be/XML/rss2.php5 but it’s doesn’t show !
Can you understand why ?
Thanks
February 22nd, 2008 at 9:41 pm
This info is very good. Thanks for your works :)
February 25th, 2008 at 1:24 pm
Hi, it’s very useful. Thanks!
February 25th, 2008 at 1:27 pm
Can you use the same code for Atom feed also? I guess not. How do we tweak it to accept atom feed?
February 25th, 2008 at 6:02 pm
An excellent and very useful article, especially with regards to retrieving information from namespaces. Many thanks!
March 16th, 2008 at 3:30 pm
Good work! i It is very usefull. i hope that i understand everything, and Your advice will work:)
April 1st, 2008 at 7:37 am
[...] Using SimpleXML To Parse RSS Feeds [...]
May 10th, 2008 at 5:11 pm
Thanks for the code – it’s been a great help!
Another way to get the CDATA is to load the xml using
$xml = simplexml_load_string($rawFeed, ‘SimpleXMLElement’, LIBXML_NOCDATA);
(PHP >= 5.1.0)
Then you just get the CDATA elements without complications.
See: http://us.php.net/manual/en/function.simplexml-load-string.php
July 8th, 2008 at 4:10 am
Nice article and good to point out the limits of the documentation rather than just compete the example without it.
July 10th, 2008 at 1:34 pm
Man. You just itched my scratch, thanks for the article!
July 10th, 2008 at 2:12 pm
[...] refering to grabbing content from inside different namespaces. Stuart Herbert on php “using simpleXML to pull rss feeds” seemed to have the same problem as I – which was solved using a get children method. [...]
October 31st, 2008 at 6:16 am
Thanks a bunch for the great article! It really helped me out. Thanks!
December 2nd, 2008 at 11:22 pm
Perfect. Exactly what I needed to get to my encoded CDATA feed content. Thanks a ton.
December 5th, 2008 at 10:38 pm
the code is great, thank you. What i would like to do now is limit the number of entries shown on my webpage. how would i do this please? What i am trying to eventually do, is once the articles are shown on the webpage, a user can click a check box next to each article, and then click a submit button which would save the article in a mysql database.
So i will have a table similar to
ArticleID
Title
Link
Description
Date
Body /// for body the whole article would be stored in there, so somehow pull the data from the page in ‘link’ and store that.
Any ideas ?
January 7th, 2009 at 3:42 am
Thanks for your help with solving this namespace issue. Working great.
January 21st, 2009 at 12:01 pm
Shouldn’t it be possible to see YOUR blog on my site if I copy your code?
After changing all ‘ and ’ to ‘ i uploaded the file but it’s blank. Should I change some of the code before it’s functioning?
January 26th, 2009 at 11:01 am
Why not also put the error code in your tutorial so us beginners can understand why the blog isn’t showed, please?
February 14th, 2009 at 8:54 am
[...] http://blog.stuartherbert.com/php/2007/01/07/using-simplexml-to-parse-rss-feeds/ [...]
April 16th, 2009 at 2:52 am
I just recently switched (today!) from my hosted windows platform to linux. I am not versed in php at all, and of course, all of my asp broke when I moved to linux. My whole site revolves around a feed that I get from another site, I was using an asp script and xsl to publish the feed. I came upon your site in my search on Google :)
I wanted to comment that this is filed under beginners… and my gosh… I am less than a beginner! I don’t quite understand all of this, but I am reading the explanations a few times over. I appreciate how detailed this article is.
May 19th, 2009 at 8:52 pm
Nice one!!! Your post helped me a lot!!
May 24th, 2009 at 5:23 pm
Very good and needed article. RSS can be parsed also using biterscripting. I use biterscripting in addition to java.
Richard
June 12th, 2009 at 4:17 pm
Incredible writeup. this will come in handy! thank you.
for error handling be sure to do a try/catch on creating new SimpleXMLElement object
try { $xmlData = @new SimpleXMLElement($rawFeed); } catch (Exception $e) { //error handling in here }
July 6th, 2009 at 4:55 pm
Very valuable information, thank you so much!
July 25th, 2009 at 9:29 am
thanks a lot for this article…….very understandable
September 12th, 2009 at 10:40 am
Your just…god !
4 hours i’m looking for content:encoded extraction. Thanks !
September 30th, 2009 at 8:46 am
Thanks Stuart. I found your site from a thread on http://www.neowin.net/forum/index.php?showtopic=761114.
This article is very enlightening. I’m trying to parse a namespace-based XML and it’s proving problematic.
Thanks to this, it now problem solved.
Thanks a lot
January 8th, 2010 at 1:18 pm
Very nice!
Thanks.
April 11th, 2010 at 9:15 am
I used some elements of the code, but it’s hard to understand.
April 15th, 2010 at 2:48 pm
[...] [...]
April 15th, 2010 at 3:22 pm
[...] That’s all. The only stumbling block you will encounter is CDATA blocks and namespaces in SimpleXML. Stuart Herbert has a good introduction to these two issues in this article. [...]
April 16th, 2010 at 12:47 am
[...] [...]
April 16th, 2010 at 2:45 am
[...] That's all. The only stumbling block you will encounter is CDATA blocks and namespaces in SimpleXML. Stuart Herbert has a good introduction to these two issues in this article. [...]
April 16th, 2010 at 6:39 am
very useful article i will use this in my blog.
thank you,
April 16th, 2010 at 8:01 am
[...] That’s all. The only stumbling block you will encounter is CDATA blocks and namespaces in SimpleXML. Stuart Herbert has a good introduction to these two issues in this article. [...]
April 17th, 2010 at 8:21 am
[...] [...]
April 19th, 2010 at 3:50 pm
[...] That's all. The only stumbling block you will encounter is CDATA blocks and namespaces in SimpleXML. Stuart Herbert has a good introduction to these two issues in this article. [...]
April 22nd, 2010 at 5:42 am
[...] That's all. The only stumbling block you will encounter is CDATA blocks and namespaces in SimpleXML. Stuart Herbert has a good introduction to these two issues in this article. [...]
April 23rd, 2010 at 9:40 am
[...] [...]
May 4th, 2010 at 3:42 pm
[...] [...]
June 10th, 2010 at 9:15 am
[...] That’s all. The only stumbling block you will encounter is CDATA blocks and namespaces in SimpleXML. Stuart Herbert has a good introduction to these two issues in this article. [...]
June 11th, 2010 at 9:04 am
[...] That’s all. The only stumbling block you will encounter is CDATA blocks and namespaces in SimpleXML. Stuart Herbert has a good introduction to these two issues in this article. [...]
June 12th, 2010 at 11:36 am
U rock.. solved half of my rss reading problems..
June 20th, 2010 at 6:11 pm
Thanks this is a nice tutorial, precisely what I was looking for.
Regards
Dan
July 22nd, 2010 at 2:53 am
Thanks a lot.
It really helped.
Regards,
Amit
August 30th, 2010 at 6:28 am
Let me add SEO keywords to this post :-)
CDATA, PHP, simpleXML, LIBXML_NOCDATA
And MANY THANKS!
September 12th, 2010 at 2:30 pm
Gracias Stuart, me ayudo bastante tu articulo.
Saludos.
September 26th, 2010 at 2:21 pm
[...] That's all. The only stumbling block you will encounter is CDATA blocks and namespaces in SimpleXML. Stuart Herbert has a good introduction to these two issues in this article. [...]
September 29th, 2010 at 9:13 am
[...] To get and parse a rss feed is fairly simple in php, when you use SimpleXML. But you will run into problems when you want to reach the full content and a couple other things. They are inside different XML namespaces. I found this exelent post which explanes how to handle this: http://blog.stuartherbert.com/php/2007/01/07/using-simplexml-to-parse-rss-feeds/ [...]
January 14th, 2011 at 7:23 pm
Thanks for the tip. I’ll have to read it a few times to get my head around it since I don’t come from a programming background but I’m sure I can use it on my website.
February 17th, 2011 at 5:53 am
Thanks man! This saved me some time!
March 10th, 2011 at 5:28 am
The only stumbling block you will encounter is CDATA
May 10th, 2011 at 11:08 am
Great code tip, I used it on one of my own wordpress sites and it really works easy and stable.
Regards
Pop
June 1st, 2011 at 8:12 am
If you are struggling with xml namespaces, there is a great tutorial on xpath namespaces at xml reports. It walks you through it in very simple steps
xml reports
June 14th, 2011 at 11:33 am
How can I load Atom feeds with SimpleXML? I’ve tried loading an xml document containing an atom feed but the retrieved object is empty, why is that?
June 23rd, 2011 at 4:30 pm
You are the best! It´s all I can say. I have spend several weeks looking for this incredible information in order to parse feeds contents which includes the “funny” dc namespace…This post entry deserves to be in the very first position on Google. It´s by far the best information about the topic. And the best thing: it works! Thanks a lot.
July 5th, 2011 at 3:02 pm
Found this very useful, thanks.
August 1st, 2011 at 1:17 am
[...] van XML. Ik kende ze al, want voor de VKblog importer had ik ze al bekeken. En ik vond op internet een blog van iemand die een keer een RSS feed had ingelezen, en daarvoor een aantal problemen had moeten [...]
January 27th, 2012 at 9:32 am
It really helped, thanks a lot !
March 25th, 2012 at 12:04 am
I am very impressed by the code, but why stop short of error handling? I should have liked to see more.
June 30th, 2012 at 7:08 pm
Cool, I like the multiple feeds, also check out this free php snippet, a php rssparser with cache support http://rssparser.com/php-simplexml-rss-parser-with-caching-support/
October 11th, 2012 at 9:15 am
I was looking for this everything was done but not getting able to resolve the namespaces issue. Thanks a lot for writing this. It worked amazing for me