Loading AOL Search Data into MySQL

AOL released a bunch of search data for research purposes. Obviously this upset a lot of people. (You can read about that elsewhere.) I figure it’s a great chance to mess with a large data set and to check out some search data.

I’m assuming that you have a copy of AOL-data.tgz
$ tar xzf AOL-data.tgz
$ cd AOL-user-ct-collection
$ gunzip *gz
$ mysqladmin create aol
$ mysql aol

mysql> create table d (
  anon_id int(8),
  query varchar(522),
  query_time datetime,
  item_rank int(3),
  click_url varchar(255)
) engine=myisam;

mysql> load data local
  infile 'user-ct-test-collection-01.txt' into table d

Repeat the data load for other 9 files.

I received a lot of warnings during the import. show warnings limit 100 showed the header row contained invalid data. Other rows were missing page_rank and click_url values which can be null.

There should be 10 bad header rows. One from each file.

mysql> delete from d where
  query='Query' and click_url='ClickUrl';
Query OK, 10 rows affected (1 min 17.05 sec)

I forgot to set MySQL’s sql_mode to STRICT_ALL_TABLES so MySQL annoyingly uses implicit default values when it encounters NULL instead of putting NULL in a nullable column. Because of this a bunch of item_ranks are 0 and click_urls are empty strings.

mysql> update d
  set item_rank = null, click_url = null
  where item_rank = 0 and click_url = '';

The database works but is really slow and needs indexes and tuning. Indexing the “query” column took 32 minutes. I think there may be better optimizing techniques for LIKE and REGEXP queries. Time to RTFM.

There are 36.3 million rows in the database.

mysql> select count(*) from d;
+----------+
| count(*) |
+----------+
| 36389567 |
+----------+
1 row in set (0.00 sec)

There are 657,427 unique “anonymous” user ids

mysql> select count(distinct(anon_id)) from d;
+--------------------------+
| count(distinct(anon_id)) |
+--------------------------+
| 657427 |
+--------------------------+
1 row in set (2 min 0.60 sec)

Interestingly it looks like almost 6% of the queries were users typing addresses into the search box instead of the browser’s address bar.

mysql> select count(*) / (select count(*) from d) * 100 as percent from d where query like 'www.%.com';
+---------+
| percent |
+---------+
| 5.9706 |
+---------+
1 row in set (4.19 sec)

Advertisements
Posted in aol, mysql, search
4 comments on “Loading AOL Search Data into MySQL
  1. Sascha says:

    great advice: got it in easily

  2. Ray says:

    One, word says it all.

    Wow.

  3. Richard says:

    mysql>load data local
    infile ‘user-ct-test-collection-01.txt’ into table d

    You can use:

    load data local infile ‘user-ct-test-collection-01.txt’ into table d ignore 1 lines (anon_id, query, query_time, item_rank, click_url)

    Should supress errors.

    Have you found any other useful information from the dataset?

  4. برامج says:

    good information

    I think database aol larger of 50 million

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: