|
||||||||||||
Some thoughts about Strength Tests
Hi Everyone,
Since I have been creating strength tests for at least 10+ years now, it has also been 10+ years of personal uncertainty knowing that when referencing strength results to ELO I am continuously providing technically incorrect references. As the ELO system was created to rate game performance through a method of storing wins, draws and losses within a pool of stored results and using a predictor from those stored results to estimate the likelyhood of which opponent wins or loses should they meet in a match. These new match games are then added into this data pool in order to grow and improve the prediction formula. In the below Wikipedia link, this is specifically explained. https://en.wikipedia.org/wiki/Elo_rating_system Here is a summary of some key Wikipedia points: "Elo ratings are comparative only and are valid only within the rating pool in which they were calculated, rather than being an absolute measure of a player's strength. Elo's central assumption was that the chess performance of each player in each game is a normally distributed random variable. Although a player might perform significantly better or worse from one game to the next, Elo assumed that the mean value of the performances of any given player changes only slowly over time. Elo thought of a player's true skill as the mean of that player's performance random variable. A further assumption is necessary because chess performance in the above sense is still not measurable. One cannot look at a sequence of moves and derive a number to represent that player's skill. Performance can only be inferred from wins, draws and losses. Therefore, if a player wins a game, they are assumed to have performed at a higher level than their opponent for that game. Conversely, if the player loses, they are assumed to have performed at a lower level. If the game is a draw, the two players are assumed to have performed at nearly the same level. To simplify computation even further, Elo proposed a straightforward method of estimating the variables in his model (i.e., the true skill of each player). One could calculate relatively easily from tables how many games players would be expected to win based on comparisons of their ratings to those of their opponents. The ratings of a player who won more games than expected would be adjusted upward, while those of a player who won fewer than expected would be adjusted downward. Moreover, that adjustment was to be in linear proportion to the number of wins by which the player had exceeded or fallen short of their expected number. From a modern perspective, Elo's simplifying assumptions are not necessary because computing power is inexpensive and widely available. Several people, most notably Mark Glickman, have proposed using more sophisticated statistical machinery to estimate the same variables. On the other hand, the computational simplicity of the Elo system has proven to be one of its greatest assets. With the aid of a pocket calculator, an informed chess competitor can calculate to within one point what their next officially published rating will be, which helps promote a perception that the ratings are fair. The phrase "Elo rating" is often used to mean a player's chess rating as calculated by FIDE. However, this usage may be confusing or misleading because Elo's general ideas have been adopted by many organizations, including the USCF (before FIDE), many other national chess federations, the short-lived Professional Chess Association (PCA), and online chess servers including the Internet Chess Club (ICC), Free Internet Chess Server (FICS), Lichess, Chess.com, and Yahoo! Games. Each organization has a unique implementation, and none of them follows Elo's original suggestions precisely. Instead one may refer to the organization granting the rating. For example: "As of August 2002, Gregory Kaidanov had a FIDE rating of 2638 and a USCF rating of 2742." The Elo ratings of these various organizations are not always directly comparable, since Elo ratings measure the results within a closed pool of players rather than absolute skill." Everything written in the above summary is in my opinion correct, except for one key assumption which is nowadays completely outdated: "A further assumption is necessary because chess performance in the above sense is still not measurable. One cannot look at a sequence of moves and derive a number to represent that player's skill." This may have been true 50-70 years ago since computer power never existed, but nowadays this statement is incorrect. Several chess websites provide strength analysis and ever since the very beginnings of chess programs, move strength has been calculated and accepted as a method to analyze the strength of moves played in a game. Some day in the future (maybe not in our lifetime), I fully expect that you will be able to take the complete game database of for example Wilhelm Steinitz or Garry Kasparov, plug it into the master computer program and get the strength results, overall and year by year for each and every player, and accurately compare the evolving improvements of chess knowledge throughout history for players and computer programs. You will be someday able to track it similar to what Eric (Tibono) recently did with his King Performance testing of the Easy and Fun levels. You just replace the graph chart with the year and you could replace Fun with Human and Easy with Computer for example. You will probably also be able to distinguish between Learned Theory (Brain Memory) and calculations (Brain Calculations) and rate both. The EDOChess Website: http://www.edochess.ca/index.html similarly acknowledges that using the name ELO is potentially conflicting since he uses EDO and ELO did not exist in the years preceding 1970. He also does a masterful job of developing his EDO ratings year by year for historical players. I think he uses in part the Bayesian ELO system to get his ratings. He also acknowledges an obvious problem with most of today's ELO rating systems which are: "Arpad Elo put ratings on the map when he introduced his rating system first in the United States in 1960 and then internationally in 1970. There were, of course, earlier rating systems but Elo was the first to attempt to put them on a sound statistical footing. Richard Eales says in Chess: The History of a Game (1985) regarding Murray's definitive 1913 volume, A History of Chess that "The very excellence of his work has had a dampening effect on the subject," since historians felt that Murray had had the last word. The same could be said of Elo and his contribution to rating theory and practice. However, Elo, like Murray, is not perfect and there are many reasons for exploring improvements. The most obvious at present is the steady inflation in international Elo ratings [though this should probably not be blamed on Elo, as the inflation started after Elo stopped being in charge of F.I.D.E.'s ratings (added Jan. 2010)]. Another is that the requirements of a rating system for updating current players' ratings on a day-to-day basis are different from those of a rating system for players in some historical epoch. Retroactive rating is a different enterprise than the updating of current ratings. In fact, when Elo attempted to calculate ratings of players in history, he did not use the Elo rating system at all! Instead, he applied an iterative method to tournament and match results over five-year periods to get what are essentially performance ratings for each period and then smoothed the resulting ratings over time. This procedure and its results are summarized in his book, The Rating of Chessplayers Past and Present (1978), though neither the actual method of calculation nor full results are laid out in detail. We get only a list of peak ratings of 475 players and a series of graphs indicating the ratings over time of a few of the strongest players, done by fitting a smooth curve to the annual ratings of players with results over a long enough period. When it came to initializing the rating of modern players, Elo collected results of international events over the period 1966-1969 and applied a similar iterative method. Only then could the updating system everyone knows come into effect - there had to be a set of ratings to start from. Iterative methods rate a pool of players simultaneously, rather than adjusting each individual player's rating sequentially, after each event or rating period. The idea is to find the set of ratings for which the observed results are collectively most likely. But they are basically applicable only to static comparisons, giving the most accurate assessment possible of relative strengths at a given time. Elo's idea in his historical rating attempt was to smooth these static annual ratings over time. While we can safely bet that Elo did a careful job of rating historical players, inevitably many choices have to be made in such an attempt, and other approaches could be taken. Indeed, Clarke made an attempt at rating players in history before Elo. Several other approaches have recently appeared, including the Chessmetrics system by Jeff Sonas, the Glicko system(s) of Mark Glickman, a version of which has been adopted by the US Chess Federation, and an unnamed rating method applied to results from 1836-1863 and published online on the Avler Chess Forum by Jeremy Spinrad. By and large these others have applied sequential updating methods, though the new (2005) incarnation of the Chessmetrics system is an interesting exception (see below) and Spinrad achieved a kind of simultaneous rating (at least more symmetric in time) by running an update algorithm alternately forwards and backwards in time. There are pros and cons to all of these." Therefore, in summary I am more convinced that the future, can only be along the lines of calculating strength within a pool of games. As this would be the only way to accurately calculate and chart the strength of chess throughout history and its progression. Therefore I believe my tests are a small step into this future. Performance results will also always be relevant such as ELO or EDO as there is a difference between performance and strength. For example, De Labourdannais maybe have performed at a level of 2600 against his opponents. But the strength of play in the early 1800s may also certainly been lower than what it is today. This is the difference between strength and performance. In Summary similar to EDOChess who rates his results with EDO, as my strength tests are not performance based, I have changed all references to ELO to STR which is short for STRENGTH or SPACIOUSMIND TESTS RATING. Whichever you prefer. My STR ratings formula under Renaissance was of course created to approximate ELO, in order to have a fun comparison. The exact same calculations will be used for all future tests and the test results may likely differ through the history of the games; however, the constant will always remain the same, which is the distance between the Master and its subjects, game by game and average. In my Tests I have changed ELO Sneak to STR Sneak, which you will see on future updates that I will provide for download. I know this is a lot of reading, but I would love to hear from you to see if my assumptions of the future are sound or if you disagree or see a different future. Best regards, Nick ps... I typed this in English as subject would be too hard for me to do in German Geändert von spacious_mind (01.06.2024 um 20:28 Uhr) |
Folgende 3 Benutzer sagen Danke zu spacious_mind für den nützlichen Beitrag: | ||
|
|||||||||||
AW: Some thoughts about Strength Tests
I like the Elo system a lot - as far as I can judge the elo ratings and rankings produced by Elo are pretty fair and very useful for making predictions. Everything written in the above summary is in my opinion correct, except for one key assumption which is nowadays completely outdated:
"A further assumption is necessary because chess performance in the above sense is still not measurable. One cannot look at a sequence of moves and derive a number to represent that player's skill." Why not - let's have a try ! Thanks to your work (and Eric's) it is possible to see the results, do the comparison getting a final judgement. I do believe that it is also possible to use a large position test (with at least 200 positions and lots of chess computers) in order to get quite reliable elo ratings - with Elo-Stat by Frank Schubert. But that's another story. Some day in the future (maybe not in our lifetime), I fully expect that you will be able to take the complete game database of for example Wilhelm Steinitz or Garry Kasparov, plug it into the
master computer program and get the strength results, overall and year by year for each and every player, and accurately compare the evolving improvements of chess knowledge throughout history for players and computer programs. Once in the future would it be possible to get through our Schachcomputer.info Tournament list...? You need a starting point in "history epoch elo" for initializing - you needed a starting point as well in modern elo calculations "initializing 1966-69". Where is the difference at the start - the lower data base - what else ? While we can safely bet that Elo did a careful job of rating historical players, inevitably many choices have to be made in such an attempt, and other approaches could be taken. Indeed, Clarke
made an attempt at rating players in history before Elo. Several other approaches have recently appeared, including the Chessmetrics system by Jeff Sonas, the Glicko system(s) of Mark Glickman, a version of which has been adopted by the US Chess Federation, and an unnamed rating method applied to results from 1836-1863 and published online on the Avler Chess Forum by Jeremy Spinrad. By and large these others have applied sequential updating methods, though the new (2005) incarnation of the Chessmetrics system is an interesting exception (see below) and Spinrad achieved a kind of simultaneous rating (at least more symmetric in time) by running an update algorithm alternately forwards and backwards in time. There are pros and cons to all of these."[/I] Many years ago there was an article in CSS about Jeff Sonas - his contribution was not convincing to me. Sonas Rating Formula - better than Elo ? No - IMHO. AFAIK John Nunn critized a change concerning predicting game results made by Jeff Sonas (Nunn on the K-factor: show me the proof, 2009). Mark Glickman - AFAIK his Glicko-systems (contrary to the Elo-system) aren't totally stable - they could be flawed when intentionally manipulating data - of course that's not the case here. https://www.reddit.com/r/TheSilphRoa...jor_flaw_in_a/ Is there a Glicko-program that can use pgn-data for calculating elo - or only Excel-sheets provided ? ... as there is a difference between performance and strength. For example, De Labourdannais maybe have performed at a level of 2600 against his opponents.
But the strength of play in the early 1800s may also certainly been lower than what it is today. This is the difference between strength and performance. The elo system only measures performance - nothing else. So IMHO that's the drawback in your system using the moves of a game - you are regarding not the performance but only the strength of a person, which is not the aim of the Elo-sytem. To say it more drastically: the strength of a person or computer is totally irrelevant for calculating elo - only performance matters ! Nothing else than 1-0, 1/2:1/2, 0:1, simple as that ! In Summary similar to EDOChess who rates his results with EDO, as my strength tests are not performance based, I have changed all references to ELO to STR which is short for STRENGTH or
SPACIOUSMIND TESTS RATING. Whichever you prefer. ...In my Tests I have changed ELO Sneak to STR Sneak, which you will see on future updates that I will provide for download. You shouldn't call it Elo...STR...or maybe StrElo. My STR ratings formula under Renaissance was of course created to approximate ELO, in order to have a fun comparison. The exact same calculations will be used for all future tests and the test
results may likely differ through the history of the games; however, the constant will always remain the same, which is the distance between the Master and its subjects, game by game and average. The main thing after all - it's fun ! Regards Hans-Jürgen |
Folgende 4 Benutzer sagen Danke zu CC 7 für den nützlichen Beitrag: | ||
Folgende 3 Benutzer sagen Danke zu spacious_mind für den nützlichen Beitrag: | ||
|
|||||||||||
AW: Some thoughts about Strength Tests
...Several chess websites provide strength analysis and ever since the very beginnings of chess programs, move strength has been calculated and accepted as a method to analyze the strength of moves played in a game.
Some day in the future (maybe not in our lifetime), I fully expect that you will be able to take the complete game database of for example Wilhelm Steinitz or Garry Kasparov, plug it into the master computer program and get the strength results, overall and year by year for each and every player, and accurately compare the evolving improvements of chess knowledge throughout history for players and computer programs. https://www.chessmonitor.com/? Definitely worth a try (there is a free version) - but you can even make suggestions to improve the features of this program. Has anyone already tested chessmonitor for this purpose ? Regards Hans-Jürgen |
Folgende 2 Benutzer sagen Danke zu CC 7 für den nützlichen Beitrag: | ||
Egbert (15.06.2024), spacious_mind (15.06.2024) |
|
||||||||||||
Re: AW: Some thoughts about Strength Tests
Maybe this future is very near, please have a look at:
https://www.chessmonitor.com/? Definitely worth a try (there is a free version) - but you can even make suggestions to improve the features of this program. Has anyone already tested chessmonitor for this purpose ? Regards Hans-Jürgen You got me very happy and excited for a moment there. But, unfortunately when you look at it closer, it seems to be performance based again, collecting all games played and rating the performance ELO's with opponents ELO's. Unless I am missing something when looking at the Website? Best regards Nick |
|
|||||||||||
AW: Re: AW: Some thoughts about Strength Tests
Hi Hans-Jürgen
You got me very happy and excited for a moment there. But, unfortunately when you look at it closer, it seems to be performance based again, collecting all games played and rating the performance ELO's with opponents ELO's. Unless I am missing something when looking at the Website? Best regards Nick Right, at the moment it is performance based. You could make a suggestion, if it would be possible to create a tool that is not based on performance but according to your needs. Why not asking for such a new tool ? What do you (they) think about it ? Regards Hans-Jürgen |
|
||||||||||||
Re: AW: Re: AW: Some thoughts about Strength Tests
Let's think about what would be needed or considered. 1) It is unlikely that today's hardware would be able to go quick enough in order to go deep enough to identify best moves as for it to be interesting you need the game analysis to be fairly quick, otherwise you have problems with impatience. So, do you take 1 minute for example to evaluate a complete game today and how good is that? 2) In order to do this, I would want to have a clear distinction between theory strength and brain power strength. Meaning for example an opening is theory. So, a rule could be put in place where you have the database, and a match is made against the database. If for example in previous plays x number of games had played that move, then it falls under theory. 3) Strength begins where theory ends. Since the game database has the year tied to it, that is also possible for historical games. You match the year played against current year and previous years. 4) Endgame strength.... you could for example create a rule that endgames begin by counting empty squares. i.e. if there are 49 empty squares, the endgame begins. Make that a standard...as an example. So, in summary if something like this could be created then you would have: Theory rating Middle game strength rating Endgame strength rating Combined rating. Just some thoughts, regards Nick |
|
|
Ähnliche Themen | ||||
Thema | Erstellt von | Forum | Antworten | Letzter Beitrag |
Review: Chessnut Evo - My Thoughts | Ray | Die ganze Welt der Schachcomputer / World of chess computers | 2 | 09.12.2023 12:27 |
Tipp: Android- und iOS-Stoppuhren für BT-Tests | Robert | Teststellungen und Elo Listen / Test positions and Elo lists | 0 | 18.11.2013 14:24 |
Frage: real strength of Novag Super Nova | IvenGO | Die ganze Welt der Schachcomputer / World of chess computers | 9 | 22.07.2013 10:53 |