Getting ALL your data out of Google Reader #

Update on July 3: The reader_archive and feed_archive scripts are no longer operational, since Reader (and its API) has been shut down. Thanks to everyone that tried the script and gave feedback. For more discussion, see also Hacker News.

There remain only a few days until Google Reader shuts down. Besides the emotions1 and the practicalities of finding a replacement2, I've also been pondering the data loss aspects. As a bit of a digital pack rat, the idea of not being able to get at a large chunk of the information that I've consumed over the past seven and a half years seems very scary. Technically most of it is public data, and just a web search away. However, the items that I've read, tagged, starred, etc. represent a curated subset of that, and I don't see an easy of recovering those bits.

Reader has Takeout support, but it's incomplete. I've therefore built the reader_archive tool that dumps everything related to your account in Reader via the "API". This means every read item3, every tagged item, every comment, every like, every bundle, etc. There's also a companion site at readerisdead.com that explains how to use the tool, provides pointers to the archive format and collects related tools4.

Additionally, Reader is for better or worse the papersite of record for public feed content on the internet. Beyond my 545 subscriptions, there are millions of feeds whose histories are best preserved in Reader. Thankfully, ArchiveTeam has stepped up. I've also provided a feed_archive tool that lets you dump Reader's full history for feeds for your own use.5

I don't fault Google for providing only partial data via Takeout. Exporting all 612,599 read items in my account (and a few hundred thousand more from subscriptions, recommendations, etc.) results in almost 4 GB of data. Even if I'm in the 99th percentile for Reader users (I've got the badge to prove it), providing hundreds of megabytes of data per user would not be feasible. I'm actually happy that Takeout support happened at all, since my understanding is that it was all during 20% time. It's certainly better than other outcomes.

Of course, I've had 3 months to work on this tool, but per Parkinson's law, it's been a bit of a scramble over the past few days to get it all together. I'm now reasonably confident that the tool is getting everything it can. The biggest missing piece is a way to browse the extracted data. I've started on reader_browser, which exposes a web UI for an archive directory. I'm also hoping to write some more selective exporters (e.g. from tagged items to Evernote for Ann's tagged recipes). Help is appreciated.

  1. I am of course saddened to see something that I spent 5 years working on get shut down. And yet, I'm excited to see renewed interest and activity in a field that had been thought fallow. Hopefully not having a a disinterested incumbent will be for the best.
  2. Still a toss-up between NewsBlur and Digg Reader.
  3. Up to a limit of 300,000, imposed by Reader's backend.
  4. If these command-line tools are too unfriendly, CloudPull is a nice-looking app that backs up subscriptions, tags and starred items.
  5. Google's Feed API will continue to exist, and it's served by the same backend that served Google Reader. However it does not expose items beyond recent ones in the feed.

64 Comments

Can this tool be used on windows?
Awesome, makes me feel better that all the data is out now :)
I see a ton of 'Requested item id tag:google.com,2005:reader/item/xxxxxxx but it was not found in the result'

Like....thousands of them. Bug or feature?
@Austin: Most of those were harmless, fixed with https://github.com/mihaip/readerisdead/commit/19d3159c985b6e1eb3f06360df2921e5e0dc7a2e.
You deserve a medal :)
Wow! Thanks for your work on this! :)
Try Fireplug!

https://getfireplug.com/
Thanks for making this!

I'm trying to run this on my Debian server, and I was pleasantly surprised to see the Links browser open up for the authenticating step. Unfortunately, after I sign in to my Google account and authorize the app, I can't get to the point where I get the code, due to lack of JavaScript support.

Any way around this, or will I just have to run it locally on a box with X?
@Anonymous: You can use the --use_client_login option to instead provide your username and password via the shell.

Alternatively, the tool will tell you what URL to navigate to, you can open that in a graphical browser and then paste the code back into the terminal.
Thank you. To get it running, first at webfaction and then on my Macbook 10.6.8, with python2.7 freshly installed, I had to add
import getpass
to base/url_fetcher.py
to avoid the "getpass not defined" error. Then it ran like a charm.
Thanks for the reply. I didn't think --use_client_login would work for me because of two-factor auth on my Google account, and hadn't seen the URL due to it being pushed too high up on my terminal by the time I exited Links. Got the URL method to work.
Thank you for creating this tool. Is image data included in an "item"?
Sorry to be that guy, but how do you run this on windows?
Shakil, I used the comments on this Hacker News post to get the incantations right.

https://news.ycombinator.com/item?id=5958188

You'll have to install Python 2.7 to get things going (I think I have 2.7.3 installed). Then you run it from the command line as posted at the link.
Thanks! 1.34 million items for me. Toward the end I get an "Access token has expired" error, but it seems to continue after a while.

I also ran this twice (OCD) and got a different result each time. Timestamps seem to be the main culprit but my first try has two extra files. Any ideas?

Amazing work either way though. Especially the feed_archive.
Thanks a lot for this script! I now have directory containing 8.77 GB of data. I think I began using reader in the beginning of 2006.

Is there a way to look at what are the first things I read on google reader?
I must recommend that you try http://www.silverreader.com
It is 5x faster than Digg reader, Feedly or Newsblur

Try it yourself.
What always puzzled me in ALL these readers incl Google Reader, Netvibes, etc., is the obscure way they handle old unread posts. Google Reader secretly marked posts older than 30 days unread.

Is there any information on NewsBlur or Digg Reader how they handle unread items which are say two years old?

Thanks!
On windows I get the error "Unable to find base.api". What is the solution?
Hello! Thanks for building this.

Unfortunately, I'm having some trouble -- I think. How can I tell if it's been successful? At the very end I get an I0Error: [Errno 22] invalid mode ('w') or filename which then leads to the place where the mihaip folder is located. When I browse to that folder, I only have files which are a few kb in size.

Any idea on a fix? I installed the 2.x path of Python on my Windows 7 computer.
Finally, I can backup the Reader's data.
Thanks for making this.
I suppose its a bit late but am trying this in terminal and keep getting the following error: No such file or directory. Any ideas?
@DavidP: Are you in the right directory in the terminal? You should "cd" to the readerisdead directory that you got when you downloaded the archive.
Thanks for the suggestion, Mihai! I cd to Dowloads and ran it again and got this error: env: 'python2.7: No such file or directory' Any further suggestions? As far as I can tell I only have Python 2.6. Do I need to upgrade Python?
Awesome...Thanks so much for providing the archive script. I'm looking forward to further progress with the browser!
Mihail - I upgraded to Python 3 anyway but it didn't seem to help; still geting the same error after cd when using the zip version. By the time I get a reply it might be too late, but it would be great to find out how I could have fixed the problem for a similar situation in the future. Email me at dp1974 at g m a i l dot c o m if you have the time to spare.
running the tool on 32bit winxp with python 2.7.5 i get a MemoryError

[E 130701 01:17:45 worker:43] Exception when running worker
Traceback (most recent call last):
File "C:\temp\python_reader\base\worker.py", line 41, in run
response = self._worker.work(request)
File "C:\temp\python_reader\reader_archive\reader_archive.py", line 377, in work
result.extend(item_refs)
MemoryError
[E 130701 01:17:45 reader_archive:112] Could not load item refs from user/10846466241544354272/state/com.google/reading-list
Traceback (most recent call last):
File "C:\temp\python_reader\reader_archive\reader_archive.py", line 475, in
main()
File "C:\temp\python_reader\reader_archive\reader_archive.py", line 139, in main
item_ids.update([item_ref.item_id for item_ref in item_refs])
MemoryError
@DavidP: Python 2.6 is too old, and 3 is too new. It needs to be 2.7.

@Anonymous: The program tends to use lots of memory (it may be a bug, see https://github.com/mihaip/readerisdead/issues/6), but I haven't had time to investigate. It completes for me with a 64-bit version of Python.
I keep getting the error message

'bin' is not recognized as an internal or external command, operable program or batch file.

I downloaded Python but am still having trouble. Sorry--I am on vacation and dont have access to my Mac and am having trouble getting this to work on a PC. Thanks!
@duner: I had a typo in the instructions, it should bin\reader_archive.bat (with a backslash). Does that work for you?
Do I need to do the same for the output directory?
@duner: Yes, Windows \ as the path delimiter (unlike Mac OS X and Linux, where it's /)
Never mind--I got it to work. Thank you so so much for your help and for making this awesome script.
I get an error because I started some of my tag names with an * and then your script tries to save a json file that includes the tag name but that's not a valid character for a filename.
Thank Mihai, this is great!
Fixed my problem ... Changed this line ...

stream_file_name = base.paths.stream_id_to_file_name(stream_id) + '.json'

to this ...

stream_file_name = filter(str.isalnum, base.paths.stream_id_to_file_name(stream_id).encode('ascii','ignore')) + '.json'
@Ryan: Thanks for noticing that. I've incorporated something along the lines of your fix in https://github.com/mihaip/readerisdead/commit/41aa1838a962e2f550693858590fb285baea15d7. The use of the query_params to url_to_file_name is to that file names are unique. Otherwise with your code, if you have tag names with only non-ASCII characters, they will all end up with the same filename, and thus clobber each other.
Thanks Mihal - I'll install 2.7.3 and see if I still get the error.

Anyone else still able to access Google Reader? I can but I would have thought it have nedded by now.

Also my Takeout is telling me that I'll be able to download the files until 7 July.
Reader Archive sorted! 3.3 GB downloaded! Will definitely donnate zip to ArchiveTeam.

However, tried to run archive-feed and got this:

Traceback (most recent call last):
File "bin/../feed_archive/feed_archive.py", line 257, in
main()
File "bin/../feed_archive/feed_archive.py", line 66, in main
base.paths.normalize(args.opml_file))
File "bin/../feed_archive/feed_archive.py", line 218, in extract_feed_urls_from_opml_file
tree = ET.parse(opml_file_path)
File "", line 62, in parse
File "", line 26, in parse
IOError: [Errno 2] No such file or directory: '/Users/dpecotic/Downloads/feeds.opml'

Any ideas why I'm getting this error anyone? Do I even need the feed archive if I've got the rest?
Thanks Mihai for creating these scripts. I still haven't managed to get the browser bit going but I may dive into the code and look around if I find the time.
Maybe I still have a little more time to run this archive. I'm getting this error:

File "reader_archive\reader_archive.py", line 389
except urllib2.HTTPError, e:
^
SyntaxError: invalid syntax

Any ideas? :P
FYI to the group - this does not work with Python 3.3.2 on Windows. Otherwise, it's awesome! :)
Yep - that was my problem with the urllib2 error. I was using Python 3.x. I'm now using 2.7 and it's working perfectly. Or rather, at least so far. 6GB so far, and still going strong. :)
I had Python 3.3, switched to 2.7, but still can't make it work on Windows 7.

No matter what I type, it gives me "SyntaxError: unexpected character after line continuation character"

Thank you all in advance for any help!
I'm encountering a vexing problem -- the script runs okay, but seems to just... stop, before it grabs all 120 content streams. And it seems random. First it stopped with 9 streams left, then with 5, then with 35. There's no error message or anything -- it just stops updating.

More troublingly, the folder size of the most complete version is only ~100-200 MB, whereas others I've heard from with similarly sized Reader histories clock in closer to a gigabyte.

Is there something wrong?
For reference, tried running it again just before posting my last comment. It went along at a good clip for about five minutes, loading a new feed with hundreds or thousands of items every couple of seconds, sometimes taking a minute or so to process large streams with ~250,000 items. But now, with only THREE STREAMS LEFT, it's been sitting there for 15 minutes with no indication of any activity. And I've got a solid web connection and plenty of memory left.

(I left it running all night twice, FWIW, so I don't think it's one particular stream taking longer.)

Does anyone know if the archive browser can load partial archives? I'd hate for all this downloaded data to be unusable because a glitch stopped it at 99% completion.
Ran it again. ONE. STREAM. LEFT. WHARGARBL.

It seems to save the big ones for last; most recent were 216K, 177K, 299K, and finally 647K items. Looking at task manager, Python is frozen at just over 500,200K memory usage. But that plus all other processes is less than half my available memory.

And now my lunch break is over, so Reader will likely be dead by the time I get home. Damn it.
Managed to make it work. Thanks for the app!

The link that got it done for me was https://news.ycombinator.com/item?id=5958188, which I got from "Anonymous"' June 28 comment above. So thank you too, whoever you are!

DavidP: You don't strictly need feed_archive if you're using reader_archive. feed_archive is meant to archive public feed content (I created it before I was aware of ArchiveTeam's efforts). reader_archive does all that, as well as private, user-specific data in your Reader account. As for the specific error that you were getting, it's because it can't find an OPML file at the path that you passed in. Are you sure that's where you downloaded it?
@Rhaomi: I've added more progress reporting with https://github.com/mihaip/readerisdead/commit/e727550369932747745b0c0518c7fa61550c88d9. The script doesn't leave the big ones for last (it tries to start them first actually), but since they take the longest, they usually end up being the long pole anyway.
Thanks, Mihai -- I tried running it again plugged directly into my router to avoid any WiFi issues and it seemed to do better -- it grabbed all the streams and then started fetching items at a decent pace. But then it locked up again, at 416,077 out of 1,549,057 items, with memory usage frozen at 392,100K.

I'll have to try the updated script -- hopefully it will run smoother (and I have time to complete it before they pull the plug!)
Ah bugger ... she's gone.

So, looks like a moot point now: cant generate the OPML file now. I do have the subscriptions.xml file from Google Takeout? Could I use that? And if so, do I need to edit or rename it in any way.
Also, just tried the lok up:
bin/item_lookup --archive_directory=~/Downloads/reader_archive 0306277b9d275db1

Got this error: -bash: bin/item_lookup: No such file or directory
Turning subscription.xml to feeds.opml seemed to work but I have no idea where it saved the files too!? Do I literally have to specify where it goes in the directory, e.g., Downloads/feedarchive (after I create the folder)?
@DavidP: feed_archive also uses Reader's API, it won't work. item_lookup was added to the suite a couple of days ago, you may need to redownload the .zip to get it.
It looks like the main archive API, https://www.google.com/reader/api/0/, is now offline. Seems this tool is dead :(
Thanks a lot,

i need more help...

Google said " You can download a copy of your Google Reader data via Google Takeout until 12PM PST July 15, 2013."

Is this mean the history date still keep well until july 15?
Is there any way to download the data that had not finished download?
识 意: Unfortunately this tool cannot be used. The only thing you can do until July 15 is to use Google Takeout (https://www.google.com/takeout/).
I just wanted to say thanks again for making this tool available, Mihai. I finally got it working the evening of the 1st and was downloading at a good pace with very few errors, only about 0.2%. But at midnight (Pacific time), the errors became much more frequent, and the script eventually froze -- whether from Google pulling the plug or some problem on my end, I don't know.

I ended up saving 86% of all my items, which is a lot better than nothing. It was weird seeing the Reader archives basically disintegrate before my eyes as the script ran, bittersweet. Thanks for helping me and others save at least some of that data.
Hello !
First thanks a lot for these tools !
Unfortunately I did read too fast your indications and I downloaded "feed_archive" instead of "reader_archive".
Looks like I can't download anymore with the reader archive my feeds, as the script can't pull JSON objects with the good token being given.

I really really would like some feeds which don't exist anymore on the Internet ( and which were downloaded by feed_archive) to show up in my reader, as I like to often refer to some old articles.

Do you have any suggestion or turnaround ?
Moreover, is there any possibility I can load my feed_archive into an open source RSS Reader like Tiny Tiny RSS Reader or Owl Reader, so that I can read them ?

Yann
yeah, just wanted to say thanks for your post and this tools.
Hi, Mihai. First of all - thanks for an incredible work. I'm getting this error. Any ideas what am I doing wrong?

C:\mihaip-readerisdead-25ba2c5>c:\python27\python reader_archive\reader_archive.
py --output_directory
Traceback (most recent call last):
File "reader_archive\reader_archive.py", line 533, in
main()
File "reader_archive\reader_archive.py", line 96, in main
user_info = api.fetch_user_info()
File "C:\mihaip-readerisdead-25ba2c5\reader_archive\base\api.py", line 36, in
fetch_user_info
user_info_json = self._fetch_json('user-info')
File "C:\mihaip-readerisdead-25ba2c5\reader_archive\base\api.py", line 287, in
_fetch_json
return json.loads(response_text)
File "c:\python27\lib\json\__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "c:\python27\lib\json\decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "c:\python27\lib\json\decoder.py", line 383, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
@Matko: The tool is no longer operational, since Google Reader has been shut down.

Post a Comment