…and the winner is…. SWEDEN!!! (maybe)
For explanation how the figure was created see a wall of text below…
Eurovision actively encourages viewers to tweet about songs. Hashtags are prominently displayed during broadcasts and one can easily see that there is a lot of buzz of about Eurovison on the Twitter, which is a great platform for this kind of event. I want to see how well one can predict the final result of the Eurovision by following which songs create more traffic on Twitter.
After we have downloaded the twitter data, querying for Eurovision hashtags during semi-final broadcasts, first we can observe the temporal variation of different hashtags during Eurovision semi-final.
One can actually observe the order of the songs! Also noticeable is the peak (at 1.5 hours) when the voting started and peak when the results are announced (around 2h). The reason behind sharp peak of #NED at the beginning is unclear to me. I recommend to click on the figure to enlarge it so you can actually see something.
Similar result can be seen for 2nd semifinal. Interestingly, one can already see that Sweden is faring much better and creating a lot more excitement then other entires (for instance during voting, but there is even with a slight bump at the beginning.)
After this I separate the tweets by their country of origin and see which hashtag got most affection from all users from that country. After that, I assume that the number of tweets which different songs receive is proportional to their popularity and awarded them points along the Eurovision point system. Below is an example for Germany in semi-final 2. Colors for countries are the same as in the Figure above.
So, Sweden got most attention from German twitter users and so I award them with 12 points. Israel gets 10, Norway 8, Slovenia 7 and so on. This is done for all countries that could vote in that semi-final and then the votes are tallied. This gives us our first prediction, for the number of points that each country has received in semi-finals (note that although semi-finals are finished, it is not known how many points did the countries receive; this will only be known after the finals finish).
Actually, we have some handle on how well the countries did. Only the top 10 countries from each semi-final have qualified! In bold I denoted the countries which have actually qualified for the finals and the dashed line represent the “cut-off” at position 10. In both cases, 9 out of 10 estimates are correct! Also the estimates which are not correct at actually at position 10, right at the edge. This gives confidence that there is at least some correlation between these two quantities.
Finally, we want to estimate the final score. For each country I combine results from the two semi-finals. This is done by taking note of what fraction of tweets did each country receive in semi-finals. Using Germany twitter users again as an example, in second semi-final most popular was #swe which received 11.% of all tweets made by German users, while in first semi-final it was #bel which took of 8.4% of all German tweets . In this case, Sweden gets 12 points from Germany, and Belgium gets 10. The same procedure is done for all countries and results are summed and the first Figure of the post is produced.
Few words of caveats are in order.
Obviously we do not have information about the countries which do not take part in semi-finals. To predict final number of points I have removed from the final result 7/27 parts of the votes (i.e. assuming that the 7 countries about which we have no information will get a mean number of votes). Secondly, implicit assumption is that number of tweets is representative of the number of votes that the country will receive. Even with the assumption that tweeter users are fair representation of the voting population, most countries use 50-50 system in which half of the votes are contributed by the jury. Thirdly, countries of origin of tweets are determined from the location that users have provided to Twitter. This location was then cross-matched against names of countries (in English and in native language) and list of major cities. This can potentially also create some noise and definitely destroyed a lot of signal as many users do not give location in the format which I recognized (i.e. non-latin script or small town). Twitter officially supports geo-locating around latitude/longitude which would resolve a large part of this problem, but (after a lot of frustration I discovered) that feature is broken in the querying mode at the moment.
Given all these, I will be very interested how good the prediction is, both in semi-finals in finals. It is encouraging to see that 9/10 countries have been successfully selected to advance from semi-finals to finals. Have a great Eurovision night on May 23!
Script used to reduce data (wolfram mathematica, not user-friendly, and not usable without data, given as an example)
A little higher quality versions of the figures