persistent.info: Getting ALL your data out of Google Reader

Getting ALL your data out of Google Reader #

Date: Friday, June 28, 2013

Labels: Reader

Update on July 3: The reader_archive and feed_archive scripts are no longer operational, since Reader (and its API) has been shut down. Thanks to everyone that tried the script and gave feedback. For more discussion, see also Hacker News.

There remain only a few days until Google Reader shuts down. Besides the emotions¹ and the practicalities of finding a replacement², I've also been pondering the data loss aspects. As a bit of a digital pack rat, the idea of not being able to get at a large chunk of the information that I've consumed over the past seven and a half years seems very scary. Technically most of it is public data, and just a web search away. However, the items that I've read, tagged, starred, etc. represent a curated subset of that, and I don't see an easy of recovering those bits.

Reader has Takeout support, but it's incomplete. I've therefore built the reader_archive tool that dumps everything related to your account in Reader via the "API". This means every read item³, every tagged item, every comment, every like, every bundle, etc. There's also a companion site at readerisdead.com that explains how to use the tool, provides pointers to the archive format and collects related tools⁴.

Additionally, Reader is for better or worse the ~~paper~~site of record for public feed content on the internet. Beyond my 545 subscriptions, there are millions of feeds whose histories are best preserved in Reader. Thankfully, ArchiveTeam has stepped up. I've also provided a feed_archive tool that lets you dump Reader's full history for feeds for your own use.⁵

I don't fault Google for providing only partial data via Takeout. Exporting all 612,599 read items in my account (and a few hundred thousand more from subscriptions, recommendations, etc.) results in almost 4 GB of data. Even if I'm in the 99th percentile for Reader users (I've got the badge to prove it), providing hundreds of megabytes of data per user would not be feasible. I'm actually happy that Takeout support happened at all, since my understanding is that it was all during 20% time. It's certainly better than other outcomes.

Of course, I've had 3 months to work on this tool, but per Parkinson's law, it's been a bit of a scramble over the past few days to get it all together. I'm now reasonably confident that the tool is getting everything it can. The biggest missing piece is a way to browse the extracted data. I've started on reader_browser, which exposes a web UI for an archive directory. I'm also hoping to write some more selective exporters (e.g. from tagged items to Evernote for Ann's tagged recipes). Help is appreciated.

I am of course saddened to see something that I spent 5 years working on get shut down. And yet, I'm excited to see renewed interest and activity in a field that had been thought fallow. Hopefully not having a a disinterested incumbent will be for the best.
Still a toss-up between NewsBlur and Digg Reader.
Up to a limit of 300,000, imposed by Reader's backend.
If these command-line tools are too unfriendly, CloudPull is a nice-looking app that backs up subscriptions, tags and starred items.
Google's Feed API will continue to exist, and it's served by the same backend that served Google Reader. However it does not expose items beyond recent ones in the feed.

63 Comments

Can this tool be used on windows?

Posted by: Anonymous June 28, 2013 at 7:38 AM

I see a ton of 'Requested item id tag:google.com,2005:reader/item/xxxxxxx but it was not found in the result'

Like....thousands of them. Bug or feature?

Posted by: Austin June 28, 2013 at 8:39 AM

@Austin: Most of those were harmless, fixed with https://github.com/mihaip/readerisdead/commit/19d3159c985b6e1eb3f06360df2921e5e0dc7a2e.

Posted by: Mihai Parparita June 28, 2013 at 9:20 AM

You deserve a medal :)

Posted by: Unknown June 28, 2013 at 10:08 AM

Wow! Thanks for your work on this! :)

Posted by: Enrico Lamperti June 28, 2013 at 10:29 AM

Thanks, Mihai!

Posted by: <> June 28, 2013 at 11:33 AM

Try Fireplug!

https://getfireplug.com/

Posted by: Anonymous June 28, 2013 at 11:33 AM

Thanks for making this!

I'm trying to run this on my Debian server, and I was pleasantly surprised to see the Links browser open up for the authenticating step. Unfortunately, after I sign in to my Google account and authorize the app, I can't get to the point where I get the code, due to lack of JavaScript support.

Any way around this, or will I just have to run it locally on a box with X?

Posted by: Anonymous June 28, 2013 at 1:37 PM

@Anonymous: You can use the --use_client_login option to instead provide your username and password via the shell.

Alternatively, the tool will tell you what URL to navigate to, you can open that in a graphical browser and then paste the code back into the terminal.

Posted by: Mihai Parparita June 28, 2013 at 2:23 PM

Thank you. To get it running, first at webfaction and then on my Macbook 10.6.8, with python2.7 freshly installed, I had to add
import getpass
to base/url_fetcher.py
to avoid the "getpass not defined" error. Then it ran like a charm.

Posted by: citizentools June 28, 2013 at 2:31 PM

Thanks for the reply. I didn't think --use_client_login would work for me because of two-factor auth on my Google account, and hadn't seen the URL due to it being pushed too high up on my terminal by the time I exited Links. Got the URL method to work.

Posted by: Anonymous June 28, 2013 at 2:54 PM

Thank you for creating this tool. Is image data included in an "item"?

Posted by: Anonymous June 28, 2013 at 3:00 PM

Sorry to be that guy, but how do you run this on windows?

Posted by: Shak June 28, 2013 at 3:18 PM

Shakil, I used the comments on this Hacker News post to get the incantations right.

https://news.ycombinator.com/item?id=5958188

You'll have to install Python 2.7 to get things going (I think I have 2.7.3 installed). Then you run it from the command line as posted at the link.

Posted by: Anonymous June 28, 2013 at 4:26 PM

Thanks! 1.34 million items for me. Toward the end I get an "Access token has expired" error, but it seems to continue after a while.

I also ran this twice (OCD) and got a different result each time. Timestamps seem to be the main culprit but my first try has two extra files. Any ideas?

Amazing work either way though. Especially the feed_archive.

Posted by: Shak June 29, 2013 at 3:12 AM

Thanks a lot for this script! I now have directory containing 8.77 GB of data. I think I began using reader in the beginning of 2006.

Is there a way to look at what are the first things I read on google reader?

Posted by: Anonymous June 29, 2013 at 3:57 AM

I must recommend that you try http://www.silverreader.com
It is 5x faster than Digg reader, Feedly or Newsblur

Try it yourself.

Posted by: Unknown June 29, 2013 at 5:34 AM

What always puzzled me in ALL these readers incl Google Reader, Netvibes, etc., is the obscure way they handle old unread posts. Google Reader secretly marked posts older than 30 days unread.

Is there any information on NewsBlur or Digg Reader how they handle unread items which are say two years old?

Thanks!

Posted by: Daniel Sparing June 29, 2013 at 6:17 AM

On windows I get the error "Unable to find base.api". What is the solution?

Posted by: Anonymous June 29, 2013 at 8:58 AM

Hello! Thanks for building this.

Unfortunately, I'm having some trouble -- I think. How can I tell if it's been successful? At the very end I get an I0Error: [Errno 22] invalid mode ('w') or filename which then leads to the place where the mihaip folder is located. When I browse to that folder, I only have files which are a few kb in size.

Any idea on a fix? I installed the 2.x path of Python on my Windows 7 computer.

Posted by: Unknown June 29, 2013 at 2:56 PM

Finally, I can backup the Reader's data.
Thanks for making this.

Posted by: Qinghui June 29, 2013 at 11:46 PM

I suppose its a bit late but am trying this in terminal and keep getting the following error: No such file or directory. Any ideas?

Posted by: DavidP June 30, 2013 at 5:52 AM

@DavidP: Are you in the right directory in the terminal? You should "cd" to the readerisdead directory that you got when you downloaded the archive.

Posted by: Mihai Parparita June 30, 2013 at 11:21 AM

Thanks for the suggestion, Mihai! I cd to Dowloads and ran it again and got this error: env: 'python2.7: No such file or directory' Any further suggestions? As far as I can tell I only have Python 2.6. Do I need to upgrade Python?

Posted by: DavidP June 30, 2013 at 2:07 PM

Awesome...Thanks so much for providing the archive script. I'm looking forward to further progress with the browser!

Posted by: Rinserepeat June 30, 2013 at 2:34 PM

Mihail - I upgraded to Python 3 anyway but it didn't seem to help; still geting the same error after cd when using the zip version. By the time I get a reply it might be too late, but it would be great to find out how I could have fixed the problem for a similar situation in the future. Email me at dp1974 at g m a i l dot c o m if you have the time to spare.

Posted by: DavidP June 30, 2013 at 3:21 PM

running the tool on 32bit winxp with python 2.7.5 i get a MemoryError

[E 130701 01:17:45 worker:43] Exception when running worker
Traceback (most recent call last):
File "C:\temp\python_reader\base\worker.py", line 41, in run
response = self._worker.work(request)
File "C:\temp\python_reader\reader_archive\reader_archive.py", line 377, in work
result.extend(item_refs)
MemoryError
[E 130701 01:17:45 reader_archive:112] Could not load item refs from user/10846466241544354272/state/com.google/reading-list
Traceback (most recent call last):
File "C:\temp\python_reader\reader_archive\reader_archive.py", line 475, in
main()
File "C:\temp\python_reader\reader_archive\reader_archive.py", line 139, in main
item_ids.update([item_ref.item_id for item_ref in item_refs])
MemoryError

Posted by: Anonymous June 30, 2013 at 4:39 PM

@DavidP: Python 2.6 is too old, and 3 is too new. It needs to be 2.7.

@Anonymous: The program tends to use lots of memory (it may be a bug, see https://github.com/mihaip/readerisdead/issues/6), but I haven't had time to investigate. It completes for me with a 64-bit version of Python.

Posted by: Mihai Parparita June 30, 2013 at 4:49 PM

I keep getting the error message

'bin' is not recognized as an internal or external command, operable program or batch file.

I downloaded Python but am still having trouble. Sorry--I am on vacation and dont have access to my Mac and am having trouble getting this to work on a PC. Thanks!

Posted by: duner June 30, 2013 at 5:19 PM

@duner: I had a typo in the instructions, it should bin\reader_archive.bat (with a backslash). Does that work for you?

Posted by: Mihai Parparita June 30, 2013 at 5:21 PM

Do I need to do the same for the output directory?

Posted by: duner June 30, 2013 at 5:25 PM

@duner: Yes, Windows \ as the path delimiter (unlike Mac OS X and Linux, where it's /)

Posted by: Mihai Parparita June 30, 2013 at 5:26 PM

Never mind--I got it to work. Thank you so so much for your help and for making this awesome script.

Posted by: duner June 30, 2013 at 5:27 PM

I get an error because I started some of my tag names with an * and then your script tries to save a json file that includes the tag name but that's not a valid character for a filename.

Posted by: Ryan Herr June 30, 2013 at 6:38 PM

Thank Mihai, this is great!

Posted by: Anonymous June 30, 2013 at 7:43 PM

Fixed my problem ... Changed this line ...

stream_file_name = base.paths.stream_id_to_file_name(stream_id) + '.json'

to this ...

stream_file_name = filter(str.isalnum, base.paths.stream_id_to_file_name(stream_id).encode('ascii','ignore')) + '.json'

Posted by: Ryan Herr June 30, 2013 at 8:46 PM

@Ryan: Thanks for noticing that. I've incorporated something along the lines of your fix in https://github.com/mihaip/readerisdead/commit/41aa1838a962e2f550693858590fb285baea15d7. The use of the query_params to url_to_file_name is to that file names are unique. Otherwise with your code, if you have tag names with only non-ASCII characters, they will all end up with the same filename, and thus clobber each other.

Posted by: Mihai Parparita June 30, 2013 at 11:10 PM

Thanks Mihal - I'll install 2.7.3 and see if I still get the error.

Anyone else still able to access Google Reader? I can but I would have thought it have nedded by now.

Also my Takeout is telling me that I'll be able to download the files until 7 July.

Posted by: DavidP July 1, 2013 at 2:39 AM

Reader Archive sorted! 3.3 GB downloaded! Will definitely donnate zip to ArchiveTeam.

However, tried to run archive-feed and got this:

Traceback (most recent call last):
File "bin/../feed_archive/feed_archive.py", line 257, in
main()
File "bin/../feed_archive/feed_archive.py", line 66, in main
base.paths.normalize(args.opml_file))
File "bin/../feed_archive/feed_archive.py", line 218, in extract_feed_urls_from_opml_file
tree = ET.parse(opml_file_path)
File "", line 62, in parse
File "", line 26, in parse
IOError: [Errno 2] No such file or directory: '/Users/dpecotic/Downloads/feeds.opml'

Any ideas why I'm getting this error anyone? Do I even need the feed archive if I've got the rest?

Posted by: DavidP July 1, 2013 at 3:54 AM

Thanks Mihai for creating these scripts. I still haven't managed to get the browser bit going but I may dive into the code and look around if I find the time.

Posted by: Donncha O Caoimh July 1, 2013 at 8:15 AM

Maybe I still have a little more time to run this archive. I'm getting this error:

File "reader_archive\reader_archive.py", line 389
except urllib2.HTTPError, e:
^
SyntaxError: invalid syntax

Any ideas? :P

Posted by: Anonymous July 1, 2013 at 12:26 PM

FYI to the group - this does not work with Python 3.3.2 on Windows. Otherwise, it's awesome! :)

Posted by: Anonymous July 1, 2013 at 12:40 PM

Yep - that was my problem with the urllib2 error. I was using Python 3.x. I'm now using 2.7 and it's working perfectly. Or rather, at least so far. 6GB so far, and still going strong. :)

Posted by: Anonymous July 1, 2013 at 1:22 PM

I had Python 3.3, switched to 2.7, but still can't make it work on Windows 7.

No matter what I type, it gives me "SyntaxError: unexpected character after line continuation character"

Thank you all in advance for any help!

Posted by: Luiz Mello July 1, 2013 at 3:00 PM

I'm encountering a vexing problem -- the script runs okay, but seems to just... stop, before it grabs all 120 content streams. And it seems random. First it stopped with 9 streams left, then with 5, then with 35. There's no error message or anything -- it just stops updating.

More troublingly, the folder size of the most complete version is only ~100-200 MB, whereas others I've heard from with similarly sized Reader histories clock in closer to a gigabyte.

Is there something wrong?

Posted by: Rhaomi July 1, 2013 at 3:19 PM

For reference, tried running it again just before posting my last comment. It went along at a good clip for about five minutes, loading a new feed with hundreds or thousands of items every couple of seconds, sometimes taking a minute or so to process large streams with ~250,000 items. But now, with only THREE STREAMS LEFT, it's been sitting there for 15 minutes with no indication of any activity. And I've got a solid web connection and plenty of memory left.

(I left it running all night twice, FWIW, so I don't think it's one particular stream taking longer.)

Does anyone know if the archive browser can load partial archives? I'd hate for all this downloaded data to be unusable because a glitch stopped it at 99% completion.

Posted by: Rhaomi July 1, 2013 at 3:31 PM

Ran it again. ONE. STREAM. LEFT. WHARGARBL.

It seems to save the big ones for last; most recent were 216K, 177K, 299K, and finally 647K items. Looking at task manager, Python is frozen at just over 500,200K memory usage. But that plus all other processes is less than half my available memory.

And now my lunch break is over, so Reader will likely be dead by the time I get home. Damn it.

Posted by: Rhaomi July 1, 2013 at 3:52 PM

Managed to make it work. Thanks for the app!

The link that got it done for me was https://news.ycombinator.com/item?id=5958188, which I got from "Anonymous"' June 28 comment above. So thank you too, whoever you are!

Posted by: Luiz Mello July 1, 2013 at 6:47 PM

DavidP: You don't strictly need feed_archive if you're using reader_archive. feed_archive is meant to archive public feed content (I created it before I was aware of ArchiveTeam's efforts). reader_archive does all that, as well as private, user-specific data in your Reader account. As for the specific error that you were getting, it's because it can't find an OPML file at the path that you passed in. Are you sure that's where you downloaded it?

Posted by: Mihai Parparita July 1, 2013 at 9:04 PM

@Rhaomi: I've added more progress reporting with https://github.com/mihaip/readerisdead/commit/e727550369932747745b0c0518c7fa61550c88d9. The script doesn't leave the big ones for last (it tries to start them first actually), but since they take the longest, they usually end up being the long pole anyway.

Posted by: Mihai Parparita July 1, 2013 at 9:17 PM

Thanks, Mihai -- I tried running it again plugged directly into my router to avoid any WiFi issues and it seemed to do better -- it grabbed all the streams and then started fetching items at a decent pace. But then it locked up again, at 416,077 out of 1,549,057 items, with memory usage frozen at 392,100K.

I'll have to try the updated script -- hopefully it will run smoother (and I have time to complete it before they pull the plug!)

Posted by: Rhaomi July 1, 2013 at 9:42 PM

Ah bugger ... she's gone.

So, looks like a moot point now: cant generate the OPML file now. I do have the subscriptions.xml file from Google Takeout? Could I use that? And if so, do I need to edit or rename it in any way.

Posted by: DavidP July 2, 2013 at 1:50 AM

Also, just tried the lok up:
bin/item_lookup --archive_directory=~/Downloads/reader_archive 0306277b9d275db1

Got this error: -bash: bin/item_lookup: No such file or directory

Posted by: DavidP July 2, 2013 at 1:54 AM

Turning subscription.xml to feeds.opml seemed to work but I have no idea where it saved the files too!? Do I literally have to specify where it goes in the directory, e.g., Downloads/feedarchive (after I create the folder)?

Posted by: DavidP July 2, 2013 at 2:13 AM

@DavidP: feed_archive also uses Reader's API, it won't work. item_lookup was added to the suite a couple of days ago, you may need to redownload the .zip to get it.

Posted by: Mihai Parparita July 2, 2013 at 9:22 AM

It looks like the main archive API, https://www.google.com/reader/api/0/, is now offline. Seems this tool is dead :(

Posted by: Anonymous July 2, 2013 at 10:41 AM

Thanks a lot,

i need more help...

Google said " You can download a copy of your Google Reader data via Google Takeout until 12PM PST July 15, 2013."

Is this mean the history date still keep well until july 15？
Is there any way to download the data that had not finished download？

Posted by: 识意 July 2, 2013 at 5:09 PM

识意: Unfortunately this tool cannot be used. The only thing you can do until July 15 is to use Google Takeout (https://www.google.com/takeout/).

Posted by: Mihai Parparita July 2, 2013 at 9:08 PM

I just wanted to say thanks again for making this tool available, Mihai. I finally got it working the evening of the 1st and was downloading at a good pace with very few errors, only about 0.2%. But at midnight (Pacific time), the errors became much more frequent, and the script eventually froze -- whether from Google pulling the plug or some problem on my end, I don't know.

I ended up saving 86% of all my items, which is a lot better than nothing. It was weird seeing the Reader archives basically disintegrate before my eyes as the script ran, bittersweet. Thanks for helping me and others save at least some of that data.

Posted by: Rhaomi July 2, 2013 at 9:29 PM

Hello !
First thanks a lot for these tools !
Unfortunately I did read too fast your indications and I downloaded "feed_archive" instead of "reader_archive".
Looks like I can't download anymore with the reader archive my feeds, as the script can't pull JSON objects with the good token being given.

I really really would like some feeds which don't exist anymore on the Internet ( and which were downloaded by feed_archive) to show up in my reader, as I like to often refer to some old articles.

Do you have any suggestion or turnaround ?
Moreover, is there any possibility I can load my feed_archive into an open source RSS Reader like Tiny Tiny RSS Reader or Owl Reader, so that I can read them ?

Yann

Posted by: Yannovitch July 3, 2013 at 7:55 AM

yeah, just wanted to say thanks for your post and this tools.

Posted by: idc July 3, 2013 at 12:12 PM

Hi, Mihai. First of all - thanks for an incredible work. I'm getting this error. Any ideas what am I doing wrong?

C:\mihaip-readerisdead-25ba2c5>c:\python27\python reader_archive\reader_archive.
py --output_directory
Traceback (most recent call last):
File "reader_archive\reader_archive.py", line 533, in
main()
File "reader_archive\reader_archive.py", line 96, in main
user_info = api.fetch_user_info()
File "C:\mihaip-readerisdead-25ba2c5\reader_archive\base\api.py", line 36, in
fetch_user_info
user_info_json = self._fetch_json('user-info')
File "C:\mihaip-readerisdead-25ba2c5\reader_archive\base\api.py", line 287, in
_fetch_json
return json.loads(response_text)
File "c:\python27\lib\json\__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "c:\python27\lib\json\decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "c:\python27\lib\json\decoder.py", line 383, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

Posted by: Matko July 9, 2013 at 3:05 PM

@Matko: The tool is no longer operational, since Google Reader has been shut down.

Posted by: Mihai Parparita July 9, 2013 at 4:36 PM

persistent.info

Getting ALL your data out of Google Reader #

63 Comments

Post a Comment

Archives

By Year

By Label

About