Predicting Eurovision 2017 from Twitter data…

…and the winner is…. Portugal!!! (maybe)


This is a 2017 version of the prediction of Eurovision results from Twitter data. I have explained systematics in quite detailed fashion in the post describing the results from 2015 which you can find here (and 2016 results are here ). Very shortly, I measured how many tweets have been sent about each song from each country. From this, I estimated amount of votes that each country would give to another. For example, if Germans tweets the most about Polish song, I assume that Germany will give Poland 12 points. Notice that this is very different then simply collecting all the tweets and measuring which song was most tweeted about – this would be heavily biased toward to countries that use Twitter the most; these measurements are normalized per each country. Even though this is very crude estimate and possible caveats are numerous 2015 winner was correctly predicted and overall the prediction was quite good (see here for comparing prediction to actual results). For 2016 the prediction was that Russia was going to win, but they finished second (they won popular vote though, which is closer to what this method is actually measuring). Note that this year Italy is a strong favorite, which is not included in this analysis as it did not take part in semi-finals.

Below you can find some other interesting plots. First, I am showing the time dependence of tweets during the semi-finals. Notice how you can precisely see when which country is performing. You can even when the breaks in the program are, and also the beginning of the voting (around 1.6 hours after the start of the programme) and announcement of results (bump at 2 hours). Notice the very strong performance of Portugal in first semi-final, with tweets continuing during the whole show. Also, Montenegro attracted many tweets, due to their extravagant singer and performance.

Even though the semi-finals have finished, it is only known which countries advance to the finals, but not their score in semi-finals. Below I show what is the prediction for the number of points in the semi-finals (you can compare it after the Eurovision is finished and these results are made public). In the first semi-final the algorithm correctly predicts 9 out of 10 countries that passed to finals (although it fails spectacularly for Montenegro! This is good to remind us that number of tweets is not actually the same as number of points!). For second semi-final we seem to be also doing ok, with 8 out of 10 countries correctly predicted and without catastrophic failures.

Predicting Eurovision 2016 from Twitter data…

…and the winner is…. Russia!!! (maybe)

FinalSitePlus200

click to enlarge


For explanation how the figure was created see text below…

This is 2016 version of the Eurovision prediction. I have explained systematics in quite detailed fashion in the last year post which you can find here. Very shortly, I measured how many tweets have been sent about each song from each country. From this, I estimated amount of votes that each country would give to another. For example, if Germans tweets the most about Polish song, I assume that Germany will give Poland 12 points. Notice that this is very different then simply collecting all the tweets and measuring which song was most tweeted about – this would be heavily biased toward to countries that use Twitter the most; these measurements are normalized per each country. Even though this is very crude estimate and possible caveats are numerous last year’s winner was correctly predicted and overall the prediction was quite good (see here for comparing prediction to actual results). If you are confused by the number of points, the amount of points is much larger than in previous year because of the change to the voting rules (http://www.eurovision.tv/page/voting).

Below I am showing some other interesting plots. Firstly we can first see time dependence of tweets during the semi-finals. Notice how you can precisely see when which country is performing. You can even when the breaks in the program are, and also beginning of the voting (around 1.6 hours after the start of the programme) and announcement of results (bump at 2 hours).

SF1Plus100

SF2Plus100

Even though the semi-finals have finished, it is only known which countries advance to the finals, but not their score in semi-finals. Below I show what is the prediction for the number of points in the semi-finals (you can compare it after the Eurovision is finished and these results are made public). Colors are the same for the same countries as above. Because there are so many countries, unfortunately some color have to repeat, but the country is always clearly stated below. In the first semi-final the algorithm correctly predicts 9 out of 10 countries that passed to finals (although it fails spectacularly for Estonia, but shh…. Just a reminder that this is just an estimate). For second semi-final we seem to be doing better, again predicting 9 out of 10 countries, but without catastrophic failures. This already gives us confidence in our results.

PSF1Plus200

PSF2Plus200

Finally here I show prediction which was derived on a bit different dataset. In the plot shown on the top, I combined both tweets that use hastag of a country (e.g. #POL) and tweets that mention the name of the country in English (e.g. Poland) with the hastag #EUROVISION. This could potentially bias against certain countries (e.g. Russia which gets a lot of support of nearby, predominately Russian speaking countries). On the other hand using only hastag of a country reduces our dataset by roughly 50%. For comparison, plot derived using only data collected via hastags of countries is shown below. As you can see the result differs in details, but the basic trends are the same. Results from this plot seem to strength the first position of Russia.

FinalSite200

Refugees/Migrants along the West Balkan route

(click to enlarge)

click to enlarge


As Germany is preparing to register a millionth refugee I discovered that UNHCR maintains great website visualizing flow of refugees/migrant throughout Europe (http://data.unhcr.org/mediterranean/regional.php), along with graphs showing number of arrivals in each country. I combined data for all of the the countries along the west Balkan route, which enables us to track time delays as people are arriving from one country to another.

Seemingly distant event, such as shipping strike in Greece sends reverberations along the whole route, creating a short dip in arrivals and then spike as the bottleneck gets released. Also significant event was closing of border by Hungary, which immediately resulted in movement of refugees/migrants through Slovenia which has not been part of the route until that point. Movement can be much better traced in later months (e.g. notice how well traced is the last peak that start in Macedonia in early December and which ends in Austria just few days latter). I assume that this is because the methods of accepting and transporting refugees/migrants and general organization levels are much higher now that at the beginning of the fall.

Data sources:
Croatia: Croatian Police
Slovenia: Slovenian Police
Hungary: Hungarian Police
Greece, Macedonia, Serbia, Austria: UNHCR


Global warming, yet another perspective

Temperature extremes in Europe


Just a few days ago, highest temperature ever was recorded in Germany as Europe is sweltering this summer. The figure above is just showing us once again that the world is warming up. In equilibrium one would expect that dates of temperature minimums and maximums are randomly distributed, i.e. that for approximately half of the countries temperature extreme which happened last was temperature maximum and for half of them it was temperature minimum.

Well, for Europe that is certainly not so. Out of 31 countries that have dates of temperature extremes (https://en.wikipedia.org/wiki/List_of_weather_records) 26 of the have experienced temperature maximum last and for only 5 it was temperature minimum. Also for only one of these countries temperature minimum was set in last 10 years (Italy) while temperature maximum was set in last 10 years for 13(+1)  countries (Austria, Belarus, Cyprus, Czech Republic, Finland, Germany, Hungary, (Italy, but before minimum), Latvia, Macedonia, Serbia, Slovakia, Slovenia and Ukraine).


Twitch chat and viewers during the last day of The International 5 

TI5PlotSmaller


This Saturday the largest event in the e-sports history has taken place. The International 5, Dota 2 tournament had a prize pool of 18.6 million dollars (e.g. single Wimbledon tournament has prize pool of around 16 million). Last year (when prize pool was around 10 million dollars) more than 20 million people tuned in to watch some part of the tournament. One of the channels for viewing the final day of the tournament was twitch.tv. On the above figure we can see evolution of the number of viewers and number of chat messages sent during the last day on the main English speaking channel.

We can nicely see build up of viewers and drop after each game ends (end of games is shown with the dashed lines). We see prolonged dip in the number of viewers while teams are taking a break (around 14 to 15 PST) and then final drop after the finals end. Twitch chat reacts most strongly to the end of the games that the American team (EG) wins (for instance, notice that there is no spike during “event 6”, after the loss). Individual spikes are often associated with good plays in the game but also to whatever interesting is happening on screen. For instance, notice strong spike during the final stages of the event, during the DJ set which was poorly received; even though people are leaving, twitch chat actually intensifies.

Some users tweet more then others… much more

TwitterUsersAna

 

I have recently compiled a database with some interesting twitter stats (this raw data you can also access here). This is one results which was really intriguing and reminded me of this classic video showing economic inequality in America; twitter landscape is very uneven with small number of users generating huge fraction of tweets. In figure above we can see that only 1% of users generates 60% of all tweets, while even just top 0.1% users are responsible for around 19% of all tweets. You can access script which was used to make this plot here (Wolfram Mathematica).

Prediction = Great Success !

…and the winner is…. SWEDEN!!!  (actually)



On Saturday morning I posted this analysis which tried to predict the winner of Eurovision from the Twitter activity during semi-finals.  Its prediction was that Sweden was going to win. That part was right. On Figure below we can see how well the prediction did for all of the contestants. Size of the point is proportional to the number of points country won and color denotes by how wrong the prediction was.

FinalPredicting

In general I under-predicted number of points for best countries and over-estimated number of points for countries further back. Point for Cyprus is not shown as it quite far off (at 4.8). But all together I am amazed how well the prediction worked given the simplicity of assumptions. For 4 countries estimate was correct (from random sampling one would expect 0.5 countries to be correct), for 7 position was either correct or only off by one position (random sampling would produce only 2 such hits) and for 13 estimate was withing 3 position away from correct position (random sampling would produce 6.5). Below are also equivalent Figures for both semi-finals. For semi-final 2 estimate is almost amazingly correct!

FinalSemi1FinalSemi2

 

Predicting Eurovision 2015 from Twitter data…

 

…and the winner is…. SWEDEN!!!  (maybe)

FinalSite100

For explanation how the figure was created see a wall of text below…


Eurovision actively encourages viewers to tweet about songs. Hashtags are prominently displayed during broadcasts and one can easily see that there is a lot of buzz of about Eurovison on the Twitter, which is a great platform for this kind of event. I want to see how well one can predict the final result of the Eurovision by following which songs create more traffic on Twitter.

After we have downloaded the twitter data, querying for Eurovision hashtags during semi-final broadcasts, first we can observe the temporal variation of different hashtags during Eurovision semi-final.

SF1100

One can actually observe the order of the songs! Also noticeable is the peak (at 1.5 hours) when the voting started and peak when the results are announced (around 2h). The reason behind sharp peak of #NED at the beginning is unclear to me. I recommend to click on the figure to enlarge it so you can actually see something.

SF2100

Similar result can be seen for 2nd semifinal. Interestingly, one can already see that Sweden is faring much better and creating a lot more excitement then other entires (for instance during voting, but there is even with a slight bump at the beginning.)

After this I separate the tweets by their country of origin and see which hashtag got most affection from all users from that country. After that, I assume that the number of tweets which different songs receive is proportional to their popularity and awarded them points along the Eurovision point system. Below is an example for Germany in semi-final 2. Colors for countries are the same as in the Figure above.ExampleGermany100

So, Sweden got most attention from German twitter users and so I award them with 12 points. Israel gets 10, Norway 8, Slovenia 7 and so on. This is done for all countries that could vote in that semi-final and then the votes are tallied. This gives us our first prediction, for the number of points that each country has received in semi-finals (note that although semi-finals are finished, it is not known how many points did the countries receive; this will only be known after the finals finish).

PSF1100PSF2100

Actually, we have some handle on how well the countries did. Only the top 10 countries from each semi-final have qualified! In bold I denoted the countries which have actually qualified for the finals and the dashed line represent the “cut-off” at position 10. In both cases, 9 out of 10 estimates are correct! Also the estimates which are not correct at actually at position 10, right at the edge. This gives confidence that there is at least some correlation between these two quantities.

Finally, we want to estimate the final score. For each country I combine results from the two semi-finals. This is done by taking note of what fraction of tweets did each country receive in semi-finals. Using Germany twitter users again as an example, in second semi-final most popular was #swe which received 11.% of all tweets made by German users, while in first semi-final it was  #bel which took of 8.4% of all German tweets . In this case, Sweden gets 12 points from Germany, and Belgium gets 10. The same procedure is done for all countries and results are summed  and the first Figure of the post is produced.

FinalSite100

Few words of caveats are in order.

Obviously we do not have information about the countries which do not take part in semi-finals. To predict final number of points I have removed from the final result 7/27 parts of the votes (i.e. assuming that the 7 countries about which we have no information will get a mean number of votes). Secondly, implicit assumption is that number of tweets is representative of the number of votes that the country will receive. Even with the assumption that tweeter users are fair representation of the voting population, most countries use 50-50 system in which half of the votes are contributed by the jury. Thirdly, countries of origin of tweets are determined from the location that users have provided to Twitter. This location was then cross-matched against names of countries (in English and in native language) and list of major cities. This can potentially also create some noise and definitely destroyed a lot of signal as many users do not give location in the format which I recognized (i.e. non-latin script or small town). Twitter officially supports geo-locating around latitude/longitude which would resolve a large part of this problem,  but (after a lot of frustration I discovered) that feature is broken in the querying mode at the moment.

Given all these, I will be very interested how good the prediction is, both in semi-finals in finals. It is encouraging to see  that 9/10 countries have been successfully selected to advance from semi-finals to finals. Have a great Eurovision night on May 23!

Script used to reduce data (wolfram mathematica, not user-friendly, and not usable without data, given as an example)

A little higher quality versions of the figures

 

 

Changing world of The Big Bang Theory show


BigBangTheoryPlot

 

In the Figure above we can see frequency of words mentioned in different seasons of the The Big Bang Theory.  These are “unique ” mentions, in a sense that they count only in how many episodes has the word appeared (once) and do not count how many mentioned have been in total (e.g. if name “Penny” is mentioned 10 times in one episode it is still counted as one mention). All of the lines have been normalized in respect to the season 1. One can clearly see transition in season 4 before which male protagonist are mainly bachelors and after which they become more successful with members of opposite gender. Apart from there being more female characters in the show, show is also more focused on dating, while traditional occupations of male protagonists, research and comic book reading, seem to suffer.

(see also interesting discussion that has developed on reddit)