Finding Messages Explicitly Marked as Spam in Gmail #

tl;dr: Search Gmail for “is:spam -label:^os” to find messages that you manually marked as spam (as opposed to ones that Gmail automatically marked for you).

Gmail recently had a bug where some emails were accidentally moved to the trash or marked as spam. Google “encouraged” users that might have been affected to check their trash and spam folders for any messages that didn't belong. Since I get a lot of spam (one of the perks of having the same email address since 1996), I didn't relish the thought of going through thousands of messages to see if any of them were mislabeled¹.

I figured that Gmail must keep track of which messages were explicitly marked as spam by the user versus one that it automatically classifies (though I get a lot of spam, almost all of it is caught by Gmail's filters). Gmail (like Google Reader) keeps track of per-message state via internal system labels. For example, others have discovered that Gmail's Smart Labels are represented as ^smartlabel_type labels while Superstars uses names like ^ss_sy. Indeed, if you try to use a caret in a label name, Gmail says that it is not allowed.

It therefore seemed like a reasonable assumption that there was a system label that would tell us how a message came to be marked as spam. The problem was to figure out what it was called.

Thinking back to Reader (where all label operations went through an edit-tag HTTP API call, which listed the labels to added or removed), I figured I would see what the request was when marking a message as spam. Unfortunately, it looked like Gmail's requests were of slightly higher abstraction level, where marking a message as spam would send a request with an act=sp parameter (while marking as read uses act=rd, and so on).

I then figured I should look at HTTP response when loading the spam folder. There appeared to be a bunch of system label names associated with each message. One that I explicitly marked as spam had the labels:

"^a", "^ad_1391126400000", "^all", "^bsm"," ^clu_group", "^clu_unim", "^cob-processed-gmr", "^cob_pevent", "^oc_group", "^os_group", "^s", "^smartlabel_group", "^u"

Meanwhile, another that had been automatically marked as spam used:

"^ad_1391126400000", "^all"," ^bsm", "^clu_notification", "^cob-processed-gmr", "^oc_notification", "^os", "^os_notification", "^s", "^smartlabel_notification", "^u”

^s was present on all of them, and indeed doing a search for label:^s shows all spam messages (and the UI rewrites the search to in:spam). Others could also be puzzled out based on name, for example ^u is for unread messages. The more mysterious ones like ^cob_pevent I figured I could ignore².

After looking at a bunch of messages, both automatically and manually marked as spam, ^os stood out. It only seemed to be present on messages that Gmail itself had decided were spam. Doing the search is:spam -label:^os seemed to show only messages that I had marked as spam. Indeed, each of the messages in the result displayed the header: "Why is this message in Spam? You clicked 'Report spam' for this message." Thus I was able to go through the much shorter list and see if any where mistakenly marked (they weren't).

Seeing the plethora of labels that were present on all messages, I got curious what other internal labels there were. Between examining HTTP responses, looking through Gmail's JavaScript for strings that start with ^ and a simple dictionary attack for two-letter names, here's some others that I've found (those that are marked as “unknown” are ones that match some messages in my account, but with no apparent pattern):

  • ^a: archived conversations
  • ^b: chat transcripts (equivalent to is:chat, presumably the “b” is for “Buzz”, Google Talk's codename)
  • ^f: sent messages (equivalent to is:sent)
  • ^g: muted conversations (equivalent to is:muted, the “g” is most likely for “ignore”)
  • ^i: inbox (equivalent to in:inbox)
  • ^k: trashed messages (equivalent to in:trash, unclear why “k” is the abbreviation)
  • ^o: unknown
  • ^p: messages that were marked as phishing attempts
  • ^r: drafts (equivalent to is:draft)
  • ^s: spam (equivalent to is:spam)
  • ^t: starred messages (equivalent to is:starred, the “t” is most likely for “to do”)
  • ^u: unread messages (equivalent to is:unread)
  • ^ac: Google Buzz messages (equivalent to is:buzz)
  • ^act: Google Buzz messages (unclear how it's different from ^ac)
  • ^af: unknown
  • ^bc: unknown subset of chat transcripts
  • ^p_cc: another unknown subset of chat transcripts
  • ^fs: unknown
  • ^ia: unknown
  • ^ii: unknown
  • ^im: unknown
  • ^iim: unknown
  • ^mf: unknown
  • ^np: unknown
  • ^ns: unknown
  • ^bsm: unknown
  • ^op: messages that were automatically marked as phishing attempts
  • ^os: messages that were automatically marked as spam
  • ^vm: Google Voice voicemails (equivalent to is:voicemail)
  • ^pop: unknown, seems to match some (very old messages) that I imported via POP
  • ^ss_sy, ^ss_so, ^ss_sr, ^ss_sp, ^ss_sb, ^ss_sg, ^ss_cr, ^ss_co, ^ss_cy, ^ss_cg, ^ss_cb, ^ss_cp: Superstar stars
  • ^sl_root, ^smartlabel_promo, _receipt, _travel, _event, _group, _newsletter, _notification, _personal, _social, _receipt and _finance: Smart Labels
  • ^io_im: important messages (equivalent to is:important)
  • ^io_imc1 through ^io_imc5, ^io_lr: unknown, possibly more degrees of importance (“Info Overload” was the project that resulted in the importance filtering)
  • ^clu_unim: unknown, possibly unimportant messages
  • ^unsub and ^hunsub: messages where an unsubscribe link has been detected (when marking one as spam, the “In addition to marking this message as spam, you can unsubscribe...” dialog appears). ^unsub seems to be for messages where there's an unsubscribe link you have to click while ^hunsub is for ones where Gmail offers to unsubscribe on your behalf.
  • ^cff: sender is in a Google+ circle (equivalent to has:circle)
  • ^sps: unknown (no matches in my account, but it was referenced in the JavaScript next to ^p, if I had to guess I would say it's something related to spear phishing)
  • ^p_esnotif: Google+ notifications ("es" presumably being "Emerald Sea", Google+'s code name)
  1. Of course, in deciding to automate this task, I doomed myself to spend more time that I would have if I'd just gone through the messages by hand.
  2. It's somewhat interesting to see how features that were developed later (like Smart Labels — ^smartlabel_group) use longer system label names than ones of medium age (like Superstars — ^ss_sy) which are in turn longer than the original system labels (^u for unread, etc.). Bytes 10 years ago were clearly more precious.

14 Comments

"...a simple dictionary attack." wow. intimidating. love it!
also, kudos for the honest, self-aware xkcd reference. it is what we engineers do, for better or worse!
Very good post!
Since you're so skilled on Gmail label topic... you know a method to filter for messages without no label? I searched without success for months...

Ciao.
@Anonymous: Not sure if it's exactly what you mean, but there are has:userlabels and has:nouserlabels search operators that let you find messages with any or without any user labels (see https://support.google.com/mail/answer/7190 for the full list of operators).
Thanks!
Ivano
If I don't remember incorrectly ^o is for outbox. We added it for offline Gmail.
@Erik: That sounds plausible, except I have a lot of messages with the ^o label (25K), most of which I didn't send.
I have several hits for ^sps, most of them are newsletters in French, so I'm not sure what would be the rule here. :)
^pop also finds a lot of messages in my account, including very recent ones. The sent messages with this label are those I sent through desktop outlook. Not sure about the ones I received and have the label...
Isn't ^a just "archived"?
Great post!

I'm making a spreadsheet: https://docs.google.com/spreadsheets/d/1BS8yazyPcfqbWMG2jQb8HvPvCvQsDNQ3tsp_pBh5P6Q/edit#gid=0
I hope you don't mind using your discoveries this way?
I think I've discoverd some more labels' meanings, I'll try to find some new labels too.
Katarzyna: Thanks for pointing out that ^a is archived, that makes total sense (I had assumed that the archived view would be implemented as a "label:^all -label:^i" search, but I can see how doing the search every time would be expensive). Thanks for gathering the spreadsheet, it looks like a great start.
This doesn't seem to be working for me. I've looked at the HTML response for a message in spam and it definitely doesn't have "^os" (but does have "^s"). If I search for "in:spam -label:^os", it doesn't come up. The only message that comes up is one that I explicitly marked as spam. The one I'm interested in was marked as spam by google because "You previously marked messages from nobody@example.com as spam". The problem is that somehow a bunch of messages got manually marked as spam, and then subsequent messages from these senders got marked as spam. Now I've got about a month's worth of messages to go through and I can't find a filter that works. Somehow gmail is treating these messages as having the "^os" label even though it doesn't look like they do. It would be great if there were a label that was specific to this type of spam filtering. It seems like it must exist, how else would gmail know to display the reasoning on each of these emails?

Post a Comment