Predicting Eurovision 2016 from Twitter data…

…and the winner is…. Russia!!! (maybe)


click to enlarge

For explanation how the figure was created see text below…

This is 2016 version of the Eurovision prediction. I have explained systematics in quite detailed fashion in the last year post which you can find here. Very shortly, I measured how many tweets have been sent about each song from each country. From this, I estimated amount of votes that each country would give to another. For example, if Germans tweets the most about Polish song, I assume that Germany will give Poland 12 points. Notice that this is very different then simply collecting all the tweets and measuring which song was most tweeted about – this would be heavily biased toward to countries that use Twitter the most; these measurements are normalized per each country. Even though this is very crude estimate and possible caveats are numerous last year’s winner was correctly predicted and overall the prediction was quite good (see here for comparing prediction to actual results). If you are confused by the number of points, the amount of points is much larger than in previous year because of the change to the voting rules (

Below I am showing some other interesting plots. Firstly we can first see time dependence of tweets during the semi-finals. Notice how you can precisely see when which country is performing. You can even when the breaks in the program are, and also beginning of the voting (around 1.6 hours after the start of the programme) and announcement of results (bump at 2 hours).



Even though the semi-finals have finished, it is only known which countries advance to the finals, but not their score in semi-finals. Below I show what is the prediction for the number of points in the semi-finals (you can compare it after the Eurovision is finished and these results are made public). Colors are the same for the same countries as above. Because there are so many countries, unfortunately some color have to repeat, but the country is always clearly stated below. In the first semi-final the algorithm correctly predicts 9 out of 10 countries that passed to finals (although it fails spectacularly for Estonia, but shh…. Just a reminder that this is just an estimate). For second semi-final we seem to be doing better, again predicting 9 out of 10 countries, but without catastrophic failures. This already gives us confidence in our results.



Finally here I show prediction which was derived on a bit different dataset. In the plot shown on the top, I combined both tweets that use hastag of a country (e.g. #POL) and tweets that mention the name of the country in English (e.g. Poland) with the hastag #EUROVISION. This could potentially bias against certain countries (e.g. Russia which gets a lot of support of nearby, predominately Russian speaking countries). On the other hand using only hastag of a country reduces our dataset by roughly 50%. For comparison, plot derived using only data collected via hastags of countries is shown below. As you can see the result differs in details, but the basic trends are the same. Results from this plot seem to strength the first position of Russia.


Refugees/Migrants along the West Balkan route

(click to enlarge)

click to enlarge

As Germany is preparing to register a millionth refugee I discovered that UNHCR maintains great website visualizing flow of refugees/migrant throughout Europe (, along with graphs showing number of arrivals in each country. I combined data for all of the the countries along the west Balkan route, which enables us to track time delays as people are arriving from one country to another.

Seemingly distant event, such as shipping strike in Greece sends reverberations along the whole route, creating a short dip in arrivals and then spike as the bottleneck gets released. Also significant event was closing of border by Hungary, which immediately resulted in movement of refugees/migrants through Slovenia which has not been part of the route until that point. Movement can be much better traced in later months (e.g. notice how well traced is the last peak that start in Macedonia in early December and which ends in Austria just few days latter). I assume that this is because the methods of accepting and transporting refugees/migrants and general organization levels are much higher now that at the beginning of the fall.

Data sources:
Croatia: Croatian Police
Slovenia: Slovenian Police
Hungary: Hungarian Police
Greece, Macedonia, Serbia, Austria: UNHCR

What is the area that Hubble Space Telescope has covered

Today I got intrigued by the following tweet.

FireShot Capture 1 - Knud Jahnke (@KnudJahnke) I Twitter - https___twitter.com_KnudJahnke

Hm… So what is the area that Hubble has actually covered in its long life? Nothing that a little bit of data can not solve!

In order to find out I queried the Hubble database with 140000 randomly selected positions on the celestial sphere and searched if the center of Hubble image, taken by one of its large field of view cameras, is found in the radius of 1.6 arcmin. If we find a match in a database, we proclaim this area “covered”, i.e. Hubble has observed it. We are not interested in multiple observations, dithering pointings etc… so multiple observations of same area count as one observation. Radius of 1.6 arcsec is chosen so that this area than corresponds to the field of view of 170×170 arcsec that Hubble cameras (WFPC, WFPC2, WFC3, ACS) have or have had. For instance ACS has 202×202 arcsec, while WFPC2 had 164×164 arcsec field of view. But, given that we do not know if the observers perhaps used smaller fields of views (observers could have for instance used near infrared channel on WFC3; information about mode of observations is not so readily available) it is better to be a bit conservative in our estimate.

Out of 140000 positions on the sky 182 produced a match, i.e. Hubble has covered around 182/140000 parts of the sky with imaging or roughly 50 square degrees (for comparison, Moon takes about 0.2 square degrees on the sky).

(Click to enlarge) Some of the found positions. This is with a bigger search radius then in the text (5 arcmin). Only overdensity I recognize is COSMOS field at Ra=10 and Dec = 2 degrees (very small cluster of points only a little bit to the left of the dashed line at that position).

Some of the found positions. This is with a bigger search radius then in the text (5 arcmin). Only overdensity I recognize is COSMOS field at Ra=10 and Dec = 2 degrees (very small cluster of points only a little bit to the left of the dashed line at that position).

Pokerstars rakeback in 2016

Rakeback, Pokerstars 2016

Explanation of the new system:
Table of levels:
Code to generate plot (not user-friendly, Wolfram Mathematica):

There are several reason why I did not attempt to make comparisons with the old(still current in December 2015) system. Basically it is too complicated to make a fair comparison given the complexity of the old system and differences between the old system and the new. Major issues are:

1. Should one assume that the player will exchange FPPs to cash or for more optimal tournament buy-ins? What is the most optimal way to exchange? If player is exchanging for cash, which rate should one apply if the player is advancing a level during his playing?
2. How to count Milestone awards? These are earned during multiple months so it is unclear how to add them together and make fair comparison on a month to month basis…

Given these complications and possibly other, it is not clear to me how to make fair comparison.

I hope that this will still be useful for getting most out of your rakeback!

Global warming, yet another perspective

Temperature extremes in Europe

Just a few days ago, highest temperature ever was recorded in Germany as Europe is sweltering this summer. The figure above is just showing us once again that the world is warming up. In equilibrium one would expect that dates of temperature minimums and maximums are randomly distributed, i.e. that for approximately half of the countries temperature extreme which happened last was temperature maximum and for half of them it was temperature minimum.

Well, for Europe that is certainly not so. Out of 31 countries that have dates of temperature extremes ( 26 of the have experienced temperature maximum last and for only 5 it was temperature minimum. Also for only one of these countries temperature minimum was set in last 10 years (Italy) while temperature maximum was set in last 10 years for 13(+1)  countries (Austria, Belarus, Cyprus, Czech Republic, Finland, Germany, Hungary, (Italy, but before minimum), Latvia, Macedonia, Serbia, Slovakia, Slovenia and Ukraine).

Twitch chat and viewers during the last day of The International 5 


This Saturday the largest event in the e-sports history has taken place. The International 5, Dota 2 tournament had a prize pool of 18.6 million dollars (e.g. single Wimbledon tournament has prize pool of around 16 million). Last year (when prize pool was around 10 million dollars) more than 20 million people tuned in to watch some part of the tournament. One of the channels for viewing the final day of the tournament was On the above figure we can see evolution of the number of viewers and number of chat messages sent during the last day on the main English speaking channel.

We can nicely see build up of viewers and drop after each game ends (end of games is shown with the dashed lines). We see prolonged dip in the number of viewers while teams are taking a break (around 14 to 15 PST) and then final drop after the finals end. Twitch chat reacts most strongly to the end of the games that the American team (EG) wins (for instance, notice that there is no spike during “event 6”, after the loss). Individual spikes are often associated with good plays in the game but also to whatever interesting is happening on screen. For instance, notice strong spike during the final stages of the event, during the DJ set which was poorly received; even though people are leaving, twitch chat actually intensifies.

Some users tweet more then others… much more



I have recently compiled a database with some interesting twitter stats (this raw data you can also access here). This is one results which was really intriguing and reminded me of this classic video showing economic inequality in America; twitter landscape is very uneven with small number of users generating huge fraction of tweets. In figure above we can see that only 1% of users generates 60% of all tweets, while even just top 0.1% users are responsible for around 19% of all tweets. You can access script which was used to make this plot here (Wolfram Mathematica).

Prediction = Great Success !

…and the winner is…. SWEDEN!!!  (actually)

On Saturday morning I posted this analysis which tried to predict the winner of Eurovision from the Twitter activity during semi-finals.  Its prediction was that Sweden was going to win. That part was right. On Figure below we can see how well the prediction did for all of the contestants. Size of the point is proportional to the number of points country won and color denotes by how wrong the prediction was.


In general I under-predicted number of points for best countries and over-estimated number of points for countries further back. Point for Cyprus is not shown as it quite far off (at 4.8). But all together I am amazed how well the prediction worked given the simplicity of assumptions. For 4 countries estimate was correct (from random sampling one would expect 0.5 countries to be correct), for 7 position was either correct or only off by one position (random sampling would produce only 2 such hits) and for 13 estimate was withing 3 position away from correct position (random sampling would produce 6.5). Below are also equivalent Figures for both semi-finals. For semi-final 2 estimate is almost amazingly correct!



Predicting Eurovision 2015 from Twitter data…


…and the winner is…. SWEDEN!!!  (maybe)


For explanation how the figure was created see a wall of text below…

Eurovision actively encourages viewers to tweet about songs. Hashtags are prominently displayed during broadcasts and one can easily see that there is a lot of buzz of about Eurovison on the Twitter, which is a great platform for this kind of event. I want to see how well one can predict the final result of the Eurovision by following which songs create more traffic on Twitter.

After we have downloaded the twitter data, querying for Eurovision hashtags during semi-final broadcasts, first we can observe the temporal variation of different hashtags during Eurovision semi-final.


One can actually observe the order of the songs! Also noticeable is the peak (at 1.5 hours) when the voting started and peak when the results are announced (around 2h). The reason behind sharp peak of #NED at the beginning is unclear to me. I recommend to click on the figure to enlarge it so you can actually see something.


Similar result can be seen for 2nd semifinal. Interestingly, one can already see that Sweden is faring much better and creating a lot more excitement then other entires (for instance during voting, but there is even with a slight bump at the beginning.)

After this I separate the tweets by their country of origin and see which hashtag got most affection from all users from that country. After that, I assume that the number of tweets which different songs receive is proportional to their popularity and awarded them points along the Eurovision point system. Below is an example for Germany in semi-final 2. Colors for countries are the same as in the Figure above.ExampleGermany100

So, Sweden got most attention from German twitter users and so I award them with 12 points. Israel gets 10, Norway 8, Slovenia 7 and so on. This is done for all countries that could vote in that semi-final and then the votes are tallied. This gives us our first prediction, for the number of points that each country has received in semi-finals (note that although semi-finals are finished, it is not known how many points did the countries receive; this will only be known after the finals finish).


Actually, we have some handle on how well the countries did. Only the top 10 countries from each semi-final have qualified! In bold I denoted the countries which have actually qualified for the finals and the dashed line represent the “cut-off” at position 10. In both cases, 9 out of 10 estimates are correct! Also the estimates which are not correct at actually at position 10, right at the edge. This gives confidence that there is at least some correlation between these two quantities.

Finally, we want to estimate the final score. For each country I combine results from the two semi-finals. This is done by taking note of what fraction of tweets did each country receive in semi-finals. Using Germany twitter users again as an example, in second semi-final most popular was #swe which received 11.% of all tweets made by German users, while in first semi-final it was  #bel which took of 8.4% of all German tweets . In this case, Sweden gets 12 points from Germany, and Belgium gets 10. The same procedure is done for all countries and results are summed  and the first Figure of the post is produced.


Few words of caveats are in order.

Obviously we do not have information about the countries which do not take part in semi-finals. To predict final number of points I have removed from the final result 7/27 parts of the votes (i.e. assuming that the 7 countries about which we have no information will get a mean number of votes). Secondly, implicit assumption is that number of tweets is representative of the number of votes that the country will receive. Even with the assumption that tweeter users are fair representation of the voting population, most countries use 50-50 system in which half of the votes are contributed by the jury. Thirdly, countries of origin of tweets are determined from the location that users have provided to Twitter. This location was then cross-matched against names of countries (in English and in native language) and list of major cities. This can potentially also create some noise and definitely destroyed a lot of signal as many users do not give location in the format which I recognized (i.e. non-latin script or small town). Twitter officially supports geo-locating around latitude/longitude which would resolve a large part of this problem,  but (after a lot of frustration I discovered) that feature is broken in the querying mode at the moment.

Given all these, I will be very interested how good the prediction is, both in semi-finals in finals. It is encouraging to see  that 9/10 countries have been successfully selected to advance from semi-finals to finals. Have a great Eurovision night on May 23!

Script used to reduce data (wolfram mathematica, not user-friendly, and not usable without data, given as an example)

A little higher quality versions of the figures



Changing world of The Big Bang Theory show



In the Figure above we can see frequency of words mentioned in different seasons of the The Big Bang Theory.  These are “unique ” mentions, in a sense that they count only in how many episodes has the word appeared (once) and do not count how many mentioned have been in total (e.g. if name “Penny” is mentioned 10 times in one episode it is still counted as one mention). All of the lines have been normalized in respect to the season 1. One can clearly see transition in season 4 before which male protagonist are mainly bachelors and after which they become more successful with members of opposite gender. Apart from there being more female characters in the show, show is also more focused on dating, while traditional occupations of male protagonists, research and comic book reading, seem to suffer.

(see also interesting discussion that has developed on reddit)


Our daily Vox Charta continued… which topics to discuss and how to get a lot of votes

Common wisdom in the astronomy circles is that Vox Charta represents the biased view of the astronomy community which is focused towards extragalactic topics. Let’s see how much truth is in that statement.



Papers that contain keywords connected with galaxies and cosmology seem to indeed to be upvoted more often then papers connected with other fields. The dashed line is 1:1 correspondence and we would expect the points to be on this line. Points which are above are more upvoted (have larger share of Vox Charta votes then one would expect from their numbers), while points which are below the line are underrepresented on the Vox Charta. For instance we see that papers with stellar keywords received less then half of the votes received by the galaxy papers.



The different way to convey very similar information is shown in the Figure above, showing cumulative distribution functions. Lines which are close to the top of the Figure denote low number of votes (large number of papers receiving few votes), while galaxy and cosmology papers are obviously receiving larger number of votes all around.  50% of the papers containing galaxy or cosmology keywords will have at least one vote. We can see that almost all of the most upvoted papers (25+) will be concerning galaxy and cosmology topics.


Ok, so if you life goal is for your papers to have many  Vox Charta votes, you bettwer work in the extragalactic topics. It also seems that is beneficial to have many authors on your papers, as seen on the Figure above which shows correlation between number of votes and number of authors on the paper. I have dashed the area where there are more then 10 paper per one point. Beyond that, there are only very few papers in each bin so any statistical statements are pretty weak.


It also seems it is good to write longer abstract, hopefully because authors have a lot of smart thing to say. As before, dashed shows area where there are more then 10 papers per point. There seems to be increase to around 250 words (abstract limit for many papers) after which there is stabilization trend and possible decline.

So, summarizing our conclusions from the first post and this one, to get a lot of votes, work in extragalactic topics, submit your paper so it on top of astro-ph list (competition is lowest on Tuesday), get a lot of co-authors and write long abstracts (possibly also do good science, but this is only based on anecdotal evidence).

Our Daily Vox Charta

Vox Charta has over last few years become one of more prominent tools in every astronomers arsenal. For those who might be unfamiliar with the concepts like Vox Charta and arXiv, very shortly, on Vox Charta website members of the participating academic institutions can “upvote” or “downvote” papers that have appeared on the Internet (arXiv). Idea is that people will upvote papers that they found interesting and want to talk about on the next discussion session in the department. Everybody can see how many votes a paper has received and one can easily see which papers are “hottest” i.e. which have spurred most interest in the astro community. Let’s see how does the number of votes on Vox Charta in the 2014 correlate with some other parameters!

Above we see that publication position of the paper strongly correlates with the number of votes above position 20 on the arXiv list (Lines show poor broken power-law fit to the data, done with “eyeballing” method). Below position 20 trends seems to stabilize. Scatter increases at very high numbers simply because there are very few days when 60+ papers are published. Interestingly, first position does not mean also the largest number of votes. It is important to note that there is significant number of papers that tend to be first on the list but were not actually first ones to be submitted after the deadline; they were usually submitted day or so before and I assume that there was some problem which caused them to be published with delay through moderator action.


Different days of the week spur different number of votes. Day with most activity seems to be Wednesday and the slowest day is Monday. It also seems that astrophysicists like to upvote papers more in the middle of the week. Even thought there is some difference it is only at about 20% level.


This difference is largely driven by the number of papers that are published each day. Papers published on Tuesday seem to be having lowest number of votes and Tuesday also seems to be only significant outlier.


Distribution of votes is highly non-uniform. In plot above, we show cumulative distribution of votes that papers receive. So, for instance one can see that almost 40% of papers receive no votes, and around 80% of papers receive 5 or less. Having 10 votes is already being in the top 10%, while cca 18 votes are needed to break top 5%.



Ok, so if one wants to be on the top of the arXiv list and (perhaps) have a better chance of getting more votes, how quickly should should the paper be submitted?

We show three lines which show different speeds of filling up. In blue, results are shown for 10% days which have reached 20 papers submitted the quickest. In orange mean is shown and in green we show results for slowest 10% of days.

On average, submitting in around 20 seconds after deadline will secure one of first five positions. After initial rush is over in cca 1 minute, things slow down considerably.


Ok, so you want to be first on the list. How quick do you have to be to succeed in that mission? Data shows that in order to have 50% probability of success paper has to be registered by arXive in the first second and this has no strong dependence on the day of the week when the paper is published. This does not take into account the before mentioned effect, that even if you submit first you might not get first place, because of moderator’s action.


Being in top 5 is somewhat easier and shows stronger day dependence As one can see above, submitting within first 20 seconds should place the paper in the first 5 positions. Competition is much weaker for Monday and Tuesday submissions then for other days of the week.