AOL released a bunch of search data for research purposes. Obviously this upset a lot of people. (You can read about that elsewhere.) I figure it’s a great chance to mess with a large data set and to check out some search data.
I’m assuming that you have a copy of AOL-data.tgz
$ tar xzf AOL-data.tgz
$ cd AOL-user-ct-collection
$ gunzip *gz
$ mysqladmin create aol
$ mysql aol
mysql> create table d (
anon_id int(8),
query varchar(522),
query_time datetime,
item_rank int(3),
click_url varchar(255)
) engine=myisam;
mysql> load data local
infile 'user-ct-test-collection-01.txt' into table d
Repeat the data load for other 9 files.
I received a lot of warnings during the import. show warnings limit 100 showed the header row contained invalid data. Other rows were missing page_rank and click_url values which can be null.
There should be 10 bad header rows. One from each file.
mysql> delete from d where
query='Query' and click_url='ClickUrl';
Query OK, 10 rows affected (1 min 17.05 sec)
I forgot to set MySQL’s sql_mode to STRICT_ALL_TABLES so MySQL annoyingly uses implicit default values when it encounters NULL instead of putting NULL in a nullable column. Because of this a bunch of item_ranks are 0 and click_urls are empty strings.
mysql> update d
set item_rank = null, click_url = null
where item_rank = 0 and click_url = '';
The database works but is really slow and needs indexes and tuning. Indexing the “query” column took 32 minutes. I think there may be better optimizing techniques for LIKE and REGEXP queries. Time to RTFM.
There are 36.3 million rows in the database.
mysql> select count(*) from d;
+----------+
| count(*) |
+----------+
| 36389567 |
+----------+
1 row in set (0.00 sec)
There are 657,427 unique “anonymous” user ids
mysql> select count(distinct(anon_id)) from d;
+--------------------------+
| count(distinct(anon_id)) |
+--------------------------+
| 657427 |
+--------------------------+
1 row in set (2 min 0.60 sec)
Interestingly it looks like almost 6% of the queries were users typing addresses into the search box instead of the browser’s address bar.
mysql> select count(*) / (select count(*) from d) * 100 as percent from d where query like 'www.%.com';
+---------+
| percent |
+---------+
| 5.9706 |
+---------+
1 row in set (4.19 sec)
August 11, 2006 at 3:38 pm |
great advice: got it in easily
August 15, 2006 at 3:52 am |
One, word says it all.
Wow.
August 28, 2006 at 11:09 am |
You can use:
load data local infile ‘user-ct-test-collection-01.txt’ into table d ignore 1 lines (anon_id, query, query_time, item_rank, click_url)
Should supress errors.
Have you found any other useful information from the dataset?
October 11, 2008 at 6:12 pm |
good information
I think database aol larger of 50 million