topics
mobile
data mining
web
data visualization
distributed computing
blackberry, iphone, android
sentiment analysis, string matching
social networking, google app engine
processing
hadoop, aster data
journal exploring the twitter stream
Lately, I've gotten interested in the Twitter stream. Last weekend, I hooked up a desktop to download the stream for about ten hours on Friday. I did two main things. First, I just took a look at the data, and what it has in it. Then, I applied TF-IDF to my personal tweets in an effort to see if I could find new tweets based on my tweet stream.

DATA SIZE
747,349 tweets in ~10.5 hours for a total raw size of 925M.

GEOLOCATION DATA
1,723 geo tagged tweets out of 747,349 total tweets (0.23%).

URLS
151,858 tweets with a url out of 747,349 total tweets (0.20%).

By the way, an easy way to unshorten URLs is to use Python's urllib2.urlopen(url).geturl() command. Works like a charm, though I'm not sure if it fetches the entire page. I hav e a sneaking suspicion that it does.

POPULAR 2-WORD PAIRS
PhraseCount
on my2981
to go3037
need to3074
ulive on3086
will be3111
i can3112
have to3165
are you3179
but i3263
is the3311
paramoreinpoland paramoreinpoland3315
in a3344
want to3534
i know3546
and i3570
is a3914
at the4153
to get4223
i was4246
for a4328
have a4340
i think4400
if you4559
i dont4897
i am4905
going to5024
i just5041
i have5249
to be5404
i love6108
to the6407
of the7754
for the7945
on the8325
in the10890

POPULAR 3-WORD PAIRS
PhraseCount
i wish i534
haiti to 90999548
i feel like549
i need a552
to be a596
unew blog post618
i favorited a637
favorited a youtube637
going to be640
meu resultado foi647
e meu resultado647
acabo de completar649
i just took651
uvote too \u2794673
video chat with693
other people at693
cant wait to707
check it out717
joined a video720
a video chat720
i think i724
just joined a729
to go to741
one of the764
a lot of786
for the ff821
i dont know836
i have to851
im going to875
i have a912
pants on the966
i need to1018
on the ground1059
i want to1066
aka aka aka1147
a youtube video1180
i love you1216
thanks for the1635
paramoreinpoland paramoreinpoland paramoreinpoland2745

POPULAR 4-WORD PAIRS
PhraseCount
5 out of 5202
a youtube video 5202
what do you think216
to donate 10 to219
90999 to donate 10225
the people of haiti225
to the red cross229
i rated a youtube234
rated a youtube video234
thank you for the236
uacabo de completar qual237
out of 5 stars240
i just took what253
cant wait to see258
a twibbon to your266
to your avatar now266
twibbon to your avatar266
add a twibbon to267
nowplaying nowplaying nowplaying nowplaying273
to 90999 to donate273
on my way to274
check this video out284
haiti to 90999 to287
i uploaded a youtube302
uploaded a youtube video303
text haiti to 90999366
a shorty award in423
for a shorty award431
thanks for the ff533
favorited a youtube video637
i favorited a youtube637
e meu resultado foi647
a video chat with691
just joined a video720
joined a video chat720
pants on the ground873
aka aka aka aka1049
paramoreinpoland paramoreinpoland paramoreinpoland paramoreinpoland2182

WORDS PER TWEET
Number of WordsCount
033504
130815
239865
335435
436727
538331
638839
738550
837510
935866
1034127
1132280
1231443
1328307
1426469
1525015
1623747
1723304
1822619
1921676
2020552
2119002
2217018
2314555
2412241
2510194
267494
275094
283213
291761
30951
31455
32205
3397
3432
3519
3618
378
388
391
511

POPULAR HASH TAGS
HashCount
#wwfm240
#helphaiti249
#jerseyshore251
#bbb10266
#twibbon268
#deleteyouraccount270
#quote274
#tweetmyjobs282
#bbb295
#epicpetwars300
#news315
#masen315
#1332
#twitterpelis376
#shoutout377
#follow381
#venezuela403
#cbb7428
#fail461
#sega473
#endondeestas527
#patdinlatinamerica546
#ifyoucheatonme563
#tcot673
#omgfacts712
#iwouldhatetobeyou744
#supportdannymcfly859
#supportdannyjones862
#fb1052
#jobs1141
#followfriday1801
#aka1882
#waystoannoypeople2072
#haiti2148
#nowplaying3479
#paramoreinpoland3915
#ff11354

TF-IDF
The first step in determining related tweets for me was to simply determine what topics I was interested in. I initially applied TF-IDF to my tweet stream to determine the most important keywords. This turned out to work fairly well. I got keywords such as Hadoop, Pig, AsterData, etc. I also got a few names of people that I had retweeted. In addition I got some stop words (interesting, bit, etc). I removed some of the stop words, and filtered out the names, and the resulting list was fairly decent.

I then iterated over ever tweet in the twitter stream, and computed the cosine similarity between the tweet, and my feed. This failed miserably. The top tweets for me were one word tweets such as "PIg___", "data", etc. Unsurprisingly, cosine similarity just doesn't hold up well for small documents, such as tweets.

This is as far as I've gotten, since I had to head back into work, but I have a few more tricks to try. Ideas are always welcome.
blog comments powered by Disqus