Welcome to P2PNET.net - The original daily p2p and digital news site. Always First!
Register | Login
RIAA News
Cool Stuff
MPAA News
Games / Consoles
News
Music
Movies
TV
Open Source
Mobiles
Advertising
Product News
P2P
Off Topic
Freedom
Politics
Interviews
Security
DRM
Links
Kids and Kartels
Search: 
Search
 
Web P2PNET   
Search: 
Search
Torrent Site Tracker
Teksavvy
 
Add real-time p2pnet headlines to YOUR site ! Click here to download our newsfeed code
p2pnet - rss feed: http://p2pnet.net/p2p.rss | p2pnet celebrities: http://p2pnet.net/celeb.rss | Mobile? http://p2pnet.net/index-wml.php

AOL data release debacle

p2pnet.net News:- AOL’s public release of well over half-a-million search records comprises one of the Net’s worst privacy violations ever.

Data have been online for about 10 days but the appalling phk-up escaped notice until this weekend.

Details of the search histories, gathered between March to May this year, were revealed in what AOL spokesman Andrew Weinstein describes as, "innocent-enough attempt to reach out to the academic community with new research tools".

But, "This was a screw up, and we’re angry and upset about it," Weinstein admits in a statement.

"Although there was no personally-identifiable data linked to these accounts, we’re absolutely not defending this," he says. "It was a mistake, and we apologize. We’ve launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again."

AOL, "must have missed the uproar over the DOJ’s demand for ‘anonymized’ search data last year that caused all sorts of pain for Microsoft and Google," observes TechCrunch, going on:

"The data includes [sic] all searches from those users for a three month period this year, as well as whether they clicked on a result, what that result was and where it appeared on the result page. It’s a 439 MB compressed download, expanded to just over 2 gigs."

It was, "only a matter of time before someone put up a simple web interface to the 20 million search queries published by AOL last week," says Michael Arrington on a Tech Crunch update.

He’s talking about Danny who on item 135 posted, "Here’s something you guys might like. I whipped this up to help those of you who don’t feel like grepping your way through 2 gigs of files. it’s a searchable mySQL database of these searches (most of them, anyway, I’m not done indexing yet) with all redundancies removed, searchable by categories. Hopefully this should make for a few hours of fun.

"Although a legal expert told the news agency that the incident did not violate AOL’s privacy policy as the data did not include personally identifiable information, bloggers have pointed out that users often search for their own names," says e-consultancy, adding, "At least one mirror site, which is still live at the time of writing, was set up before the data’s removal, according to TechCrunch. ‘Combine these ego searches with porn queries and you have a serious embarrassment. Combine them with ‘buy ecstasy’ and you have evidence of a crime. Combine it with an address, social security number, etc., and you have an identity theft waiting to happen,’ said TechCrunch’s Michael Arrington. ‘The possibilities are endless’."

http://www.gregsadetsky.com/aol-data/ has a mirror, and a link to AOL’s original U500k_README.txt file, which we’ve reproduced in full below.

500k User Session Collection

———————————————-

This collection is distributed for NON-COMMERCIAL RESEARCH USE ONLY.

Any application of this collection for commercial purposes is STRICTLY PROHIBITED.

Brief description:

This collection consists of ~20M web queries collected from ~650k users over three months.

The data is sorted by anonymous user ID and sequentially arranged.

The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization, query reformulation or other types of search research.

The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}.

AnonID - an anonymous user ID number.
Query - the query issued by the user, case shifted with most punctuation removed.
QueryTime - the time at which the query was submitted for search.
ItemRank - if the user clicked on a search result, the rank of the item on which they clicked is listed.
ClickURL - if the user clicked on a search result, the domain portion of the URL in the clicked result is listed.

Each line in the data represents one of two types of events:

1. A query that was NOT followed by the user clicking on a result item.
2. A click through on an item in the result list returned from a query.

In the first case (query only) there is data in only the first three columns/fields — namely AnonID, Query, and QueryTime (see above).

In the second case (click through), there is data in all five columns. For click through events, the query that preceded the click through is included. Note that if a user clicked on more than one result in the list returned from a single query, there will be TWO lines in the data to represent the two events. Also note that if the user requested the next "page" or results for some query, this appears as a subsequent identical query with a later time stamp.

CAVEAT EMPTOR — SEXUALLY EXPLICIT DATA! Please be aware that these queries are not filtered to remove any content. Pornography is prevalent on the Web and unfiltered search engine logs contain queries by users who are looking for pornographic material. There are queries in this collection that use SEXUALLY EXPLICIT LANGUAGE. This collection of data is intended for use by mature adults who are not easily offended by the use of pornographic search terms. If you are offended by sexually explicit language you should not read through this data. Also be aware that in some states it may be illegal to expose a minor to this data. Please understand that the data represents REAL WORLD USERS, un-edited and randomly sampled, and that AOL is not the author of this data.

Basic Collection Statistics

Dates:

01 March, 2006 - 31 May, 2006

Normalized queries:

36,389,567 lines of data
21,011,340 instances of new queries (w/ or w/o click-through)
7,887,022 requests for "next page" of results
19,442,629 user click-through events
16,946,938 queries w/o user click-through
10,154,742 unique (normalized) queries
657,426 unique user ID’s

Please reference the following publication when using this collection:

G. Pass, A. Chowdhury, C. Torgeson, "A Picture of Search" The First

International Conference on Scalable Information Systems, Hong Kong, June,

2006.
Copyright (2006) AOL

Also See:
TechCrunch - AOL Proudly Releases Massive Amounts of Private Data, August 6, 2006
update - AOL Data: First Web Interface Up, August 8,m 2006
e-consultancy - AOL admits ‘screw up’ over user privacy, August 8, 2006


p2pnet newsfeeds for your site.

rss feed: http://p2pnet.net/p2p.rss
Mobile - http://p2pnet.net/index-wml.php

HOME

Leave a Reply

    Advertisments
MP3rocket