Test: Some thoughts about Strength Tests - Schachcomputer.info Community

		Schachcomputer.info Community > Schachcomputer / Chess Computer: > Teststellungen und Elo Listen / Test positions and Elo lists
Test: Some thoughts about Strength Tests

Themen-Optionen

Ansicht

01.06.2024, 14:31

spacious_mind spacious_mind ist offline

Lebende Foren Legende

Registriert seit: 29.06.2006

Ort: Alabama, USA

Land:

Beiträge: 2.092

Abgegebene Danke: 316

Erhielt 737 Danke für 343 Beiträge

Some thoughts about Strength Tests

Hi Everyone,

Since I have been creating strength tests for at least 10+ years now, it has also been 10+ years of personal uncertainty knowing that when referencing strength results to ELO I am continuously providing technically incorrect references. As the ELO system was created to rate game performance through a method of storing wins, draws and losses within a pool of stored results and using a predictor from those stored results to estimate the likelyhood of which opponent wins or loses should they meet in a match. These new match games are then added into this data pool in order to grow and improve the prediction formula.

In the below Wikipedia link, this is specifically explained.

https://en.wikipedia.org/wiki/Elo_rating_system

Here is a summary of some key Wikipedia points:

"Elo ratings are comparative only and are valid only within the rating pool in which they were calculated, rather than being an absolute measure of a player's strength.

Elo's central assumption was that the chess performance of each player in each game is a normally distributed random variable. Although a player might perform significantly better or worse from one game to the next, Elo assumed that the mean value of the performances of any given player changes only slowly over time. Elo thought of a player's true skill as the mean of that player's performance random variable.

A further assumption is necessary because chess performance in the above sense is still not measurable. One cannot look at a sequence of moves and derive a number to represent that player's skill. Performance can only be inferred from wins, draws and losses. Therefore, if a player wins a game, they are assumed to have performed at a higher level than their opponent for that game. Conversely, if the player loses, they are assumed to have performed at a lower level. If the game is a draw, the two players are assumed to have performed at nearly the same level.

To simplify computation even further, Elo proposed a straightforward method of estimating the variables in his model (i.e., the true skill of each player). One could calculate relatively easily from tables how many games players would be expected to win based on comparisons of their ratings to those of their opponents. The ratings of a player who won more games than expected would be adjusted upward, while those of a player who won fewer than expected would be adjusted downward. Moreover, that adjustment was to be in linear proportion to the number of wins by which the player had exceeded or fallen short of their expected number.

From a modern perspective, Elo's simplifying assumptions are not necessary because computing power is inexpensive and widely available. Several people, most notably Mark Glickman, have proposed using more sophisticated statistical machinery to estimate the same variables. On the other hand, the computational simplicity of the Elo system has proven to be one of its greatest assets. With the aid of a pocket calculator, an informed chess competitor can calculate to within one point what their next officially published rating will be, which helps promote a perception that the ratings are fair.

The phrase "Elo rating" is often used to mean a player's chess rating as calculated by FIDE. However, this usage may be confusing or misleading because Elo's general ideas have been adopted by many organizations, including the USCF (before FIDE), many other national chess federations, the short-lived Professional Chess Association (PCA), and online chess servers including the Internet Chess Club (ICC), Free Internet Chess Server (FICS), Lichess, Chess.com, and Yahoo! Games. Each organization has a unique implementation, and none of them follows Elo's original suggestions precisely.

Instead one may refer to the organization granting the rating. For example: "As of August 2002, Gregory Kaidanov had a FIDE rating of 2638 and a USCF rating of 2742." The Elo ratings of these various organizations are not always directly comparable, since Elo ratings measure the results within a closed pool of players rather than absolute skill."

Everything written in the above summary is in my opinion correct, except for one key assumption which is nowadays completely outdated:

"A further assumption is necessary because chess performance in the above sense is still not measurable. One cannot look at a sequence of moves and derive a number to represent that player's skill."

This may have been true 50-70 years ago since computer power never existed, but nowadays this statement is incorrect. Several chess websites provide strength analysis and ever since the very beginnings of chess programs, move strength has been calculated and accepted as a method to analyze the strength of moves played in a game.

Some day in the future (maybe not in our lifetime), I fully expect that you will be able to take the complete game database of for example Wilhelm Steinitz or Garry Kasparov, plug it into the master computer program and get the strength results, overall and year by year for each and every player, and accurately compare the evolving improvements of chess knowledge throughout history for players and computer programs. You will be someday able to track it similar to what Eric (Tibono) recently did with his King Performance testing of the Easy and Fun levels. You just replace the graph chart with the year and you could replace Fun with Human and Easy with Computer for example. You will probably also be able to distinguish between Learned Theory (Brain Memory) and calculations (Brain Calculations) and rate both.

The EDOChess Website: http://www.edochess.ca/index.html similarly acknowledges that using the name ELO is potentially conflicting since he uses EDO and ELO did not exist in the years preceding 1970. He also does a masterful job of developing his EDO ratings year by year for historical players. I think he uses in part the Bayesian ELO system to get his ratings.

He also acknowledges an obvious problem with most of today's ELO rating systems which are:

"Arpad Elo put ratings on the map when he introduced his rating system first in the United States in 1960 and then internationally in 1970. There were, of course, earlier rating systems but Elo was the first to attempt to put them on a sound statistical footing. Richard Eales says in Chess: The History of a Game (1985) regarding Murray's definitive 1913 volume, A History of Chess that "The very excellence of his work has had a dampening effect on the subject," since historians felt that Murray had had the last word. The same could be said of Elo and his contribution to rating theory and practice. However, Elo, like Murray, is not perfect and there are many reasons for exploring improvements. The most obvious at present is the steady inflation in international Elo ratings [though this should probably not be blamed on Elo, as the inflation started after Elo stopped being in charge of F.I.D.E.'s ratings (added Jan. 2010)]. Another is that the requirements of a rating system for updating current players' ratings on a day-to-day basis are different from those of a rating system for players in some historical epoch. Retroactive rating is a different enterprise than the updating of current ratings.

In fact, when Elo attempted to calculate ratings of players in history, he did not use the Elo rating system at all! Instead, he applied an iterative method to tournament and match results over five-year periods to get what are essentially performance ratings for each period and then smoothed the resulting ratings over time. This procedure and its results are summarized in his book, The Rating of Chessplayers Past and Present (1978), though neither the actual method of calculation nor full results are laid out in detail. We get only a list of peak ratings of 475 players and a series of graphs indicating the ratings over time of a few of the strongest players, done by fitting a smooth curve to the annual ratings of players with results over a long enough period. When it came to initializing the rating of modern players, Elo collected results of international events over the period 1966-1969 and applied a similar iterative method. Only then could the updating system everyone knows come into effect - there had to be a set of ratings to start from.

Iterative methods rate a pool of players simultaneously, rather than adjusting each individual player's rating sequentially, after each event or rating period. The idea is to find the set of ratings for which the observed results are collectively most likely. But they are basically applicable only to static comparisons, giving the most accurate assessment possible of relative strengths at a given time. Elo's idea in his historical rating attempt was to smooth these static annual ratings over time.

While we can safely bet that Elo did a careful job of rating historical players, inevitably many choices have to be made in such an attempt, and other approaches could be taken. Indeed, Clarke made an attempt at rating players in history before Elo. Several other approaches have recently appeared, including the Chessmetrics system by Jeff Sonas, the Glicko system(s) of Mark Glickman, a version of which has been adopted by the US Chess Federation, and an unnamed rating method applied to results from 1836-1863 and published online on the Avler Chess Forum by Jeremy Spinrad. By and large these others have applied sequential updating methods, though the new (2005) incarnation of the Chessmetrics system is an interesting exception (see below) and Spinrad achieved a kind of simultaneous rating (at least more symmetric in time) by running an update algorithm alternately forwards and backwards in time.

There are pros and cons to all of these."

Therefore, in summary I am more convinced that the future, can only be along the lines of calculating strength within a pool of games. As this would be the only way to accurately calculate and chart the strength of chess throughout history and its progression. Therefore I believe my tests are a small step into this future.

Performance results will also always be relevant such as ELO or EDO as there is a difference between performance and strength. For example, De Labourdannais maybe have performed at a level of 2600 against his opponents. But the strength of play in the early 1800s may also certainly been lower than what it is today. This is the difference between strength and performance.

In Summary similar to EDOChess who rates his results with EDO, as my strength tests are not performance based, I have changed all references to ELO to STR which is short for STRENGTH or SPACIOUSMIND TESTS RATING. Whichever you prefer.

My STR ratings formula under Renaissance was of course created to approximate ELO, in order to have a fun comparison. The exact same calculations will be used for all future tests and the test results may likely differ through the history of the games; however, the constant will always remain the same, which is the distance between the Master and its subjects, game by game and average.

In my Tests I have changed ELO Sneak to STR Sneak, which you will see on future updates that I will provide for download.

I know this is a lot of reading, but I would love to hear from you to see if my assumptions of the future are sound or if you disagree or see a different future.

Best regards,
Nick

ps... I typed this in English as subject would be too hard for me to do in German

Geändert von spacious_mind (01.06.2024 um 20:28 Uhr)

Folgende 3 Benutzer sagen Danke zu spacious_mind für den nützlichen Beitrag:
applechess (01.06.2024), Tibono (01.06.2024), Wandersleben (01.06.2024)

« Vorheriges Thema | Nächstes Thema »

Forumregeln
Du bist nicht berechtigt, neue Themen zu erstellen. Du bist nicht berechtigt, auf Beiträge zu antworten. Du bist nicht berechtigt, Anhänge hochzuladen. Du bist nicht berechtigt, deine Beiträge zu bearbeiten. BB code ist An Smileys sind An. [IMG] Code ist An. HTML-Code ist An. Forum Regeln

Gehe zu

Ähnliche Themen
Thema	Erstellt von	Forum	Antworten	Letzter Beitrag
Review: Chessnut Evo - My Thoughts	Ray	Die ganze Welt der Schachcomputer / World of chess computers	2	09.12.2023 12:27
Tipp: Android- und iOS-Stoppuhren für BT-Tests	Robert	Teststellungen und Elo Listen / Test positions and Elo lists	0	18.11.2013 14:24
Frage: real strength of Novag Super Nova	IvenGO	Die ganze Welt der Schachcomputer / World of chess computers	9	22.07.2013 10:53

Alle Zeitangaben in WEZ +2. Es ist jetzt 18:05 Uhr.