Getting ALL your data out of Google Reader #
Update on July 3: The reader_archive
and feed_archive
scripts are no longer operational, since Reader (and its API) has been shut down. Thanks to everyone that tried the script and gave feedback. For more discussion, see also Hacker News.
There remain only a few days until Google Reader shuts down. Besides the emotions1 and the practicalities of finding a replacement2, I've also been pondering the data loss aspects. As a bit of a digital pack rat, the idea of not being able to get at a large chunk of the information that I've consumed over the past seven and a half years seems very scary. Technically most of it is public data, and just a web search away. However, the items that I've read, tagged, starred, etc. represent a curated subset of that, and I don't see an easy of recovering those bits.
Reader has Takeout support, but it's incomplete. I've therefore built the reader_archive
tool that dumps everything related to your account in Reader via the "API". This means every read item3, every tagged item, every comment, every like, every bundle, etc. There's also a companion site at readerisdead.com that explains how to use the tool, provides pointers to the archive format and collects related tools4.
Additionally, Reader is for better or worse the papersite of record for public feed content on the internet. Beyond my 545 subscriptions, there are millions of feeds whose histories are best preserved in Reader. Thankfully, ArchiveTeam has stepped up. I've also provided a feed_archive
tool that lets you dump Reader's full history for feeds for your own use.5
I don't fault Google for providing only partial data via Takeout. Exporting all 612,599 read items in my account (and a few hundred thousand more from subscriptions, recommendations, etc.) results in almost 4 GB of data. Even if I'm in the 99th percentile for Reader users (I've got the badge to prove it), providing hundreds of megabytes of data per user would not be feasible. I'm actually happy that Takeout support happened at all, since my understanding is that it was all during 20% time. It's certainly better than other outcomes.
Of course, I've had 3 months to work on this tool, but per Parkinson's law, it's been a bit of a scramble over the past few days to get it all together. I'm now reasonably confident that the tool is getting everything it can. The biggest missing piece is a way to browse the extracted data. I've started on reader_browser
, which exposes a web UI for an archive directory. I'm also hoping to write some more selective exporters (e.g. from tagged items to Evernote for Ann's tagged recipes). Help is appreciated.
- I am of course saddened to see something that I spent 5 years working on get shut down. And yet, I'm excited to see renewed interest and activity in a field that had been thought fallow. Hopefully not having a a disinterested incumbent will be for the best.
- Still a toss-up between NewsBlur and Digg Reader.
- Up to a limit of 300,000, imposed by Reader's backend.
- If these command-line tools are too unfriendly, CloudPull is a nice-looking app that backs up subscriptions, tags and starred items.
- Google's Feed API will continue to exist, and it's served by the same backend that served Google Reader. However it does not expose items beyond recent ones in the feed.
63 Comments
Like....thousands of them. Bug or feature?
https://getfireplug.com/
I'm trying to run this on my Debian server, and I was pleasantly surprised to see the Links browser open up for the authenticating step. Unfortunately, after I sign in to my Google account and authorize the app, I can't get to the point where I get the code, due to lack of JavaScript support.
Any way around this, or will I just have to run it locally on a box with X?
Alternatively, the tool will tell you what URL to navigate to, you can open that in a graphical browser and then paste the code back into the terminal.
import getpass
to base/url_fetcher.py
to avoid the "getpass not defined" error. Then it ran like a charm.
https://news.ycombinator.com/item?id=5958188
You'll have to install Python 2.7 to get things going (I think I have 2.7.3 installed). Then you run it from the command line as posted at the link.
I also ran this twice (OCD) and got a different result each time. Timestamps seem to be the main culprit but my first try has two extra files. Any ideas?
Amazing work either way though. Especially the feed_archive.
Is there a way to look at what are the first things I read on google reader?
It is 5x faster than Digg reader, Feedly or Newsblur
Try it yourself.
Is there any information on NewsBlur or Digg Reader how they handle unread items which are say two years old?
Thanks!
Unfortunately, I'm having some trouble -- I think. How can I tell if it's been successful? At the very end I get an I0Error: [Errno 22] invalid mode ('w') or filename which then leads to the place where the mihaip folder is located. When I browse to that folder, I only have files which are a few kb in size.
Any idea on a fix? I installed the 2.x path of Python on my Windows 7 computer.
Thanks for making this.
[E 130701 01:17:45 worker:43] Exception when running worker
Traceback (most recent call last):
File "C:\temp\python_reader\base\worker.py", line 41, in run
response = self._worker.work(request)
File "C:\temp\python_reader\reader_archive\reader_archive.py", line 377, in work
result.extend(item_refs)
MemoryError
[E 130701 01:17:45 reader_archive:112] Could not load item refs from user/10846466241544354272/state/com.google/reading-list
Traceback (most recent call last):
File "C:\temp\python_reader\reader_archive\reader_archive.py", line 475, in
main()
File "C:\temp\python_reader\reader_archive\reader_archive.py", line 139, in main
item_ids.update([item_ref.item_id for item_ref in item_refs])
MemoryError
@Anonymous: The program tends to use lots of memory (it may be a bug, see https://github.com/mihaip/readerisdead/issues/6), but I haven't had time to investigate. It completes for me with a 64-bit version of Python.
'bin' is not recognized as an internal or external command, operable program or batch file.
I downloaded Python but am still having trouble. Sorry--I am on vacation and dont have access to my Mac and am having trouble getting this to work on a PC. Thanks!
stream_file_name = base.paths.stream_id_to_file_name(stream_id) + '.json'
to this ...
stream_file_name = filter(str.isalnum, base.paths.stream_id_to_file_name(stream_id).encode('ascii','ignore')) + '.json'
Anyone else still able to access Google Reader? I can but I would have thought it have nedded by now.
Also my Takeout is telling me that I'll be able to download the files until 7 July.
However, tried to run archive-feed and got this:
Traceback (most recent call last):
File "bin/../feed_archive/feed_archive.py", line 257, in
main()
File "bin/../feed_archive/feed_archive.py", line 66, in main
base.paths.normalize(args.opml_file))
File "bin/../feed_archive/feed_archive.py", line 218, in extract_feed_urls_from_opml_file
tree = ET.parse(opml_file_path)
File "", line 62, in parse
File "", line 26, in parse
IOError: [Errno 2] No such file or directory: '/Users/dpecotic/Downloads/feeds.opml'
Any ideas why I'm getting this error anyone? Do I even need the feed archive if I've got the rest?
File "reader_archive\reader_archive.py", line 389
except urllib2.HTTPError, e:
^
SyntaxError: invalid syntax
Any ideas? :P
No matter what I type, it gives me "SyntaxError: unexpected character after line continuation character"
Thank you all in advance for any help!
More troublingly, the folder size of the most complete version is only ~100-200 MB, whereas others I've heard from with similarly sized Reader histories clock in closer to a gigabyte.
Is there something wrong?
(I left it running all night twice, FWIW, so I don't think it's one particular stream taking longer.)
Does anyone know if the archive browser can load partial archives? I'd hate for all this downloaded data to be unusable because a glitch stopped it at 99% completion.
It seems to save the big ones for last; most recent were 216K, 177K, 299K, and finally 647K items. Looking at task manager, Python is frozen at just over 500,200K memory usage. But that plus all other processes is less than half my available memory.
And now my lunch break is over, so Reader will likely be dead by the time I get home. Damn it.
The link that got it done for me was https://news.ycombinator.com/item?id=5958188, which I got from "Anonymous"' June 28 comment above. So thank you too, whoever you are!
I'll have to try the updated script -- hopefully it will run smoother (and I have time to complete it before they pull the plug!)
So, looks like a moot point now: cant generate the OPML file now. I do have the subscriptions.xml file from Google Takeout? Could I use that? And if so, do I need to edit or rename it in any way.
bin/item_lookup --archive_directory=~/Downloads/reader_archive 0306277b9d275db1
Got this error: -bash: bin/item_lookup: No such file or directory
i need more help...
Google said " You can download a copy of your Google Reader data via Google Takeout until 12PM PST July 15, 2013."
Is this mean the history date still keep well until july 15?
Is there any way to download the data that had not finished download?
I ended up saving 86% of all my items, which is a lot better than nothing. It was weird seeing the Reader archives basically disintegrate before my eyes as the script ran, bittersweet. Thanks for helping me and others save at least some of that data.
First thanks a lot for these tools !
Unfortunately I did read too fast your indications and I downloaded "feed_archive" instead of "reader_archive".
Looks like I can't download anymore with the reader archive my feeds, as the script can't pull JSON objects with the good token being given.
I really really would like some feeds which don't exist anymore on the Internet ( and which were downloaded by feed_archive) to show up in my reader, as I like to often refer to some old articles.
Do you have any suggestion or turnaround ?
Moreover, is there any possibility I can load my feed_archive into an open source RSS Reader like Tiny Tiny RSS Reader or Owl Reader, so that I can read them ?
Yann
C:\mihaip-readerisdead-25ba2c5>c:\python27\python reader_archive\reader_archive.
py --output_directory
Traceback (most recent call last):
File "reader_archive\reader_archive.py", line 533, in
main()
File "reader_archive\reader_archive.py", line 96, in main
user_info = api.fetch_user_info()
File "C:\mihaip-readerisdead-25ba2c5\reader_archive\base\api.py", line 36, in
fetch_user_info
user_info_json = self._fetch_json('user-info')
File "C:\mihaip-readerisdead-25ba2c5\reader_archive\base\api.py", line 287, in
_fetch_json
return json.loads(response_text)
File "c:\python27\lib\json\__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "c:\python27\lib\json\decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "c:\python27\lib\json\decoder.py", line 383, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
Post a Comment