Predicting Eurovision 2015 from Twitter data…


…and the winner is…. SWEDEN!!!  (maybe)


For explanation how the figure was created see a wall of text below…

Eurovision actively encourages viewers to tweet about songs. Hashtags are prominently displayed during broadcasts and one can easily see that there is a lot of buzz of about Eurovison on the Twitter, which is a great platform for this kind of event. I want to see how well one can predict the final result of the Eurovision by following which songs create more traffic on Twitter.

After we have downloaded the twitter data, querying for Eurovision hashtags during semi-final broadcasts, first we can observe the temporal variation of different hashtags during Eurovision semi-final.


One can actually observe the order of the songs! Also noticeable is the peak (at 1.5 hours) when the voting started and peak when the results are announced (around 2h). The reason behind sharp peak of #NED at the beginning is unclear to me. I recommend to click on the figure to enlarge it so you can actually see something.


Similar result can be seen for 2nd semifinal. Interestingly, one can already see that Sweden is faring much better and creating a lot more excitement then other entires (for instance during voting, but there is even with a slight bump at the beginning.)

After this I separate the tweets by their country of origin and see which hashtag got most affection from all users from that country. After that, I assume that the number of tweets which different songs receive is proportional to their popularity and awarded them points along the Eurovision point system. Below is an example for Germany in semi-final 2. Colors for countries are the same as in the Figure above.ExampleGermany100

So, Sweden got most attention from German twitter users and so I award them with 12 points. Israel gets 10, Norway 8, Slovenia 7 and so on. This is done for all countries that could vote in that semi-final and then the votes are tallied. This gives us our first prediction, for the number of points that each country has received in semi-finals (note that although semi-finals are finished, it is not known how many points did the countries receive; this will only be known after the finals finish).


Actually, we have some handle on how well the countries did. Only the top 10 countries from each semi-final have qualified! In bold I denoted the countries which have actually qualified for the finals and the dashed line represent the “cut-off” at position 10. In both cases, 9 out of 10 estimates are correct! Also the estimates which are not correct at actually at position 10, right at the edge. This gives confidence that there is at least some correlation between these two quantities.

Finally, we want to estimate the final score. For each country I combine results from the two semi-finals. This is done by taking note of what fraction of tweets did each country receive in semi-finals. Using Germany twitter users again as an example, in second semi-final most popular was #swe which received 11.% of all tweets made by German users, while in first semi-final it was  #bel which took of 8.4% of all German tweets . In this case, Sweden gets 12 points from Germany, and Belgium gets 10. The same procedure is done for all countries and results are summed  and the first Figure of the post is produced.


Few words of caveats are in order.

Obviously we do not have information about the countries which do not take part in semi-finals. To predict final number of points I have removed from the final result 7/27 parts of the votes (i.e. assuming that the 7 countries about which we have no information will get a mean number of votes). Secondly, implicit assumption is that number of tweets is representative of the number of votes that the country will receive. Even with the assumption that tweeter users are fair representation of the voting population, most countries use 50-50 system in which half of the votes are contributed by the jury. Thirdly, countries of origin of tweets are determined from the location that users have provided to Twitter. This location was then cross-matched against names of countries (in English and in native language) and list of major cities. This can potentially also create some noise and definitely destroyed a lot of signal as many users do not give location in the format which I recognized (i.e. non-latin script or small town). Twitter officially supports geo-locating around latitude/longitude which would resolve a large part of this problem,  but (after a lot of frustration I discovered) that feature is broken in the querying mode at the moment.

Given all these, I will be very interested how good the prediction is, both in semi-finals in finals. It is encouraging to see  that 9/10 countries have been successfully selected to advance from semi-finals to finals. Have a great Eurovision night on May 23!

Script used to reduce data (wolfram mathematica, not user-friendly, and not usable without data, given as an example)

A little higher quality versions of the figures



Changing world of The Big Bang Theory show



In the Figure above we can see frequency of words mentioned in different seasons of the The Big Bang Theory.  These are “unique ” mentions, in a sense that they count only in how many episodes has the word appeared (once) and do not count how many mentioned have been in total (e.g. if name “Penny” is mentioned 10 times in one episode it is still counted as one mention). All of the lines have been normalized in respect to the season 1. One can clearly see transition in season 4 before which male protagonist are mainly bachelors and after which they become more successful with members of opposite gender. Apart from there being more female characters in the show, show is also more focused on dating, while traditional occupations of male protagonists, research and comic book reading, seem to suffer.

(see also interesting discussion that has developed on reddit)


Our daily Vox Charta continued… which topics to discuss and how to get a lot of votes

Common wisdom in the astronomy circles is that Vox Charta represents the biased view of the astronomy community which is focused towards extragalactic topics. Let’s see how much truth is in that statement.



Papers that contain keywords connected with galaxies and cosmology seem to indeed to be upvoted more often then papers connected with other fields. The dashed line is 1:1 correspondence and we would expect the points to be on this line. Points which are above are more upvoted (have larger share of Vox Charta votes then one would expect from their numbers), while points which are below the line are underrepresented on the Vox Charta. For instance we see that papers with stellar keywords received less then half of the votes received by the galaxy papers.



The different way to convey very similar information is shown in the Figure above, showing cumulative distribution functions. Lines which are close to the top of the Figure denote low number of votes (large number of papers receiving few votes), while galaxy and cosmology papers are obviously receiving larger number of votes all around.  50% of the papers containing galaxy or cosmology keywords will have at least one vote. We can see that almost all of the most upvoted papers (25+) will be concerning galaxy and cosmology topics.


Ok, so if you life goal is for your papers to have many  Vox Charta votes, you bettwer work in the extragalactic topics. It also seems that is beneficial to have many authors on your papers, as seen on the Figure above which shows correlation between number of votes and number of authors on the paper. I have dashed the area where there are more then 10 paper per one point. Beyond that, there are only very few papers in each bin so any statistical statements are pretty weak.


It also seems it is good to write longer abstract, hopefully because authors have a lot of smart thing to say. As before, dashed shows area where there are more then 10 papers per point. There seems to be increase to around 250 words (abstract limit for many papers) after which there is stabilization trend and possible decline.

So, summarizing our conclusions from the first post and this one, to get a lot of votes, work in extragalactic topics, submit your paper so it on top of astro-ph list (competition is lowest on Tuesday), get a lot of co-authors and write long abstracts (possibly also do good science, but this is only based on anecdotal evidence).

Our Daily Vox Charta

Vox Charta has over last few years become one of more prominent tools in every astronomers arsenal. For those who might be unfamiliar with the concepts like Vox Charta and arXiv, very shortly, on Vox Charta website members of the participating academic institutions can “upvote” or “downvote” papers that have appeared on the Internet (arXiv). Idea is that people will upvote papers that they found interesting and want to talk about on the next discussion session in the department. Everybody can see how many votes a paper has received and one can easily see which papers are “hottest” i.e. which have spurred most interest in the astro community. Let’s see how does the number of votes on Vox Charta in the 2014 correlate with some other parameters!

Above we see that publication position of the paper strongly correlates with the number of votes above position 20 on the arXiv list (Lines show poor broken power-law fit to the data, done with “eyeballing” method). Below position 20 trends seems to stabilize. Scatter increases at very high numbers simply because there are very few days when 60+ papers are published. Interestingly, first position does not mean also the largest number of votes. It is important to note that there is significant number of papers that tend to be first on the list but were not actually first ones to be submitted after the deadline; they were usually submitted day or so before and I assume that there was some problem which caused them to be published with delay through moderator action.


Different days of the week spur different number of votes. Day with most activity seems to be Wednesday and the slowest day is Monday. It also seems that astrophysicists like to upvote papers more in the middle of the week. Even thought there is some difference it is only at about 20% level.


This difference is largely driven by the number of papers that are published each day. Papers published on Tuesday seem to be having lowest number of votes and Tuesday also seems to be only significant outlier.


Distribution of votes is highly non-uniform. In plot above, we show cumulative distribution of votes that papers receive. So, for instance one can see that almost 40% of papers receive no votes, and around 80% of papers receive 5 or less. Having 10 votes is already being in the top 10%, while cca 18 votes are needed to break top 5%.



Ok, so if one wants to be on the top of the arXiv list and (perhaps) have a better chance of getting more votes, how quickly should should the paper be submitted?

We show three lines which show different speeds of filling up. In blue, results are shown for 10% days which have reached 20 papers submitted the quickest. In orange mean is shown and in green we show results for slowest 10% of days.

On average, submitting in around 20 seconds after deadline will secure one of first five positions. After initial rush is over in cca 1 minute, things slow down considerably.


Ok, so you want to be first on the list. How quick do you have to be to succeed in that mission? Data shows that in order to have 50% probability of success paper has to be registered by arXive in the first second and this has no strong dependence on the day of the week when the paper is published. This does not take into account the before mentioned effect, that even if you submit first you might not get first place, because of moderator’s action.


Being in top 5 is somewhat easier and shows stronger day dependence As one can see above, submitting within first 20 seconds should place the paper in the first 5 positions. Competition is much weaker for Monday and Tuesday submissions then for other days of the week.