journal exploring the twitter stream
Lately, I've gotten interested in the Twitter stream. Last weekend, I hooked up a desktop to download the stream for about ten hours on Friday. I did two main things. First, I just took a look at the data, and what it has in it. Then, I applied TF-IDF to my personal tweets in an effort to see if I could find new tweets based on my tweet stream.
DATA SIZE
747,349 tweets in ~10.5 hours for a total raw size of 925M.
GEOLOCATION DATA
1,723 geo tagged tweets out of 747,349 total tweets (0.23%).
URLS
151,858 tweets with a url out of 747,349 total tweets (0.20%).
By the way, an easy way to unshorten URLs is to use Python's urllib2.urlopen(url).geturl() command. Works like a charm, though I'm not sure if it fetches the entire page. I hav e a sneaking suspicion that it does.
POPULAR 2-WORD PAIRS
| Phrase | Count |
| on my | 2981 |
| to go | 3037 |
| need to | 3074 |
| ulive on | 3086 |
| will be | 3111 |
| i can | 3112 |
| have to | 3165 |
| are you | 3179 |
| but i | 3263 |
| is the | 3311 |
| paramoreinpoland paramoreinpoland | 3315 |
| in a | 3344 |
| want to | 3534 |
| i know | 3546 |
| and i | 3570 |
| is a | 3914 |
| at the | 4153 |
| to get | 4223 |
| i was | 4246 |
| for a | 4328 |
| have a | 4340 |
| i think | 4400 |
| if you | 4559 |
| i dont | 4897 |
| i am | 4905 |
| going to | 5024 |
| i just | 5041 |
| i have | 5249 |
| to be | 5404 |
| i love | 6108 |
| to the | 6407 |
| of the | 7754 |
| for the | 7945 |
| on the | 8325 |
| in the | 10890 |
POPULAR 3-WORD PAIRS
| Phrase | Count |
| i wish i | 534 |
| haiti to 90999 | 548 |
| i feel like | 549 |
| i need a | 552 |
| to be a | 596 |
| unew blog post | 618 |
| i favorited a | 637 |
| favorited a youtube | 637 |
| going to be | 640 |
| meu resultado foi | 647 |
| e meu resultado | 647 |
| acabo de completar | 649 |
| i just took | 651 |
| uvote too \u2794 | 673 |
| video chat with | 693 |
| other people at | 693 |
| cant wait to | 707 |
| check it out | 717 |
| joined a video | 720 |
| a video chat | 720 |
| i think i | 724 |
| just joined a | 729 |
| to go to | 741 |
| one of the | 764 |
| a lot of | 786 |
| for the ff | 821 |
| i dont know | 836 |
| i have to | 851 |
| im going to | 875 |
| i have a | 912 |
| pants on the | 966 |
| i need to | 1018 |
| on the ground | 1059 |
| i want to | 1066 |
| aka aka aka | 1147 |
| a youtube video | 1180 |
| i love you | 1216 |
| thanks for the | 1635 |
| paramoreinpoland paramoreinpoland paramoreinpoland | 2745 |
POPULAR 4-WORD PAIRS
| Phrase | Count |
| 5 out of 5 | 202 |
| a youtube video 5 | 202 |
| what do you think | 216 |
| to donate 10 to | 219 |
| 90999 to donate 10 | 225 |
| the people of haiti | 225 |
| to the red cross | 229 |
| i rated a youtube | 234 |
| rated a youtube video | 234 |
| thank you for the | 236 |
| uacabo de completar qual | 237 |
| out of 5 stars | 240 |
| i just took what | 253 |
| cant wait to see | 258 |
| a twibbon to your | 266 |
| to your avatar now | 266 |
| twibbon to your avatar | 266 |
| add a twibbon to | 267 |
| nowplaying nowplaying nowplaying nowplaying | 273 |
| to 90999 to donate | 273 |
| on my way to | 274 |
| check this video out | 284 |
| haiti to 90999 to | 287 |
| i uploaded a youtube | 302 |
| uploaded a youtube video | 303 |
| text haiti to 90999 | 366 |
| a shorty award in | 423 |
| for a shorty award | 431 |
| thanks for the ff | 533 |
| favorited a youtube video | 637 |
| i favorited a youtube | 637 |
| e meu resultado foi | 647 |
| a video chat with | 691 |
| just joined a video | 720 |
| joined a video chat | 720 |
| pants on the ground | 873 |
| aka aka aka aka | 1049 |
| paramoreinpoland paramoreinpoland paramoreinpoland paramoreinpoland | 2182 |
WORDS PER TWEET
| Number of Words | Count |
| 0 | 33504 |
| 1 | 30815 |
| 2 | 39865 |
| 3 | 35435 |
| 4 | 36727 |
| 5 | 38331 |
| 6 | 38839 |
| 7 | 38550 |
| 8 | 37510 |
| 9 | 35866 |
| 10 | 34127 |
| 11 | 32280 |
| 12 | 31443 |
| 13 | 28307 |
| 14 | 26469 |
| 15 | 25015 |
| 16 | 23747 |
| 17 | 23304 |
| 18 | 22619 |
| 19 | 21676 |
| 20 | 20552 |
| 21 | 19002 |
| 22 | 17018 |
| 23 | 14555 |
| 24 | 12241 |
| 25 | 10194 |
| 26 | 7494 |
| 27 | 5094 |
| 28 | 3213 |
| 29 | 1761 |
| 30 | 951 |
| 31 | 455 |
| 32 | 205 |
| 33 | 97 |
| 34 | 32 |
| 35 | 19 |
| 36 | 18 |
| 37 | 8 |
| 38 | 8 |
| 39 | 1 |
| 51 | 1 |
POPULAR HASH TAGS
| Hash | Count |
| #wwfm | 240 |
| #helphaiti | 249 |
| #jerseyshore | 251 |
| #bbb10 | 266 |
| #twibbon | 268 |
| #deleteyouraccount | 270 |
| #quote | 274 |
| #tweetmyjobs | 282 |
| #bbb | 295 |
| #epicpetwars | 300 |
| #news | 315 |
| #masen | 315 |
| #1 | 332 |
| #twitterpelis | 376 |
| #shoutout | 377 |
| #follow | 381 |
| #venezuela | 403 |
| #cbb7 | 428 |
| #fail | 461 |
| #sega | 473 |
| #endondeestas | 527 |
| #patdinlatinamerica | 546 |
| #ifyoucheatonme | 563 |
| #tcot | 673 |
| #omgfacts | 712 |
| #iwouldhatetobeyou | 744 |
| #supportdannymcfly | 859 |
| #supportdannyjones | 862 |
| #fb | 1052 |
| #jobs | 1141 |
| #followfriday | 1801 |
| #aka | 1882 |
| #waystoannoypeople | 2072 |
| #haiti | 2148 |
| #nowplaying | 3479 |
| #paramoreinpoland | 3915 |
| #ff | 11354 |
TF-IDF
The first step in determining related tweets for me was to simply determine what topics I was interested in. I initially applied TF-IDF to my tweet stream to determine the most important keywords. This turned out to work fairly well. I got keywords such as Hadoop, Pig, AsterData, etc. I also got a few names of people that I had retweeted. In addition I got some stop words (interesting, bit, etc). I removed some of the stop words, and filtered out the names, and the resulting list was fairly decent.
I then iterated over ever tweet in the twitter stream, and computed the cosine similarity between the tweet, and my feed. This failed miserably. The top tweets for me were one word tweets such as "PIg___", "data", etc. Unsurprisingly, cosine similarity just doesn't hold up well for small documents, such as tweets.
This is as far as I've gotten, since I had to head back into work, but I have a few more tricks to try. Ideas are always welcome. blog comments powered by Disqus
