[ List Earliest Comments Only For Pages | Games | Rated Pages | Rated Games | Subjects of Discussion ]
Comments/Ratings for a Single Item
H.G.M. wrote: ... After an unequal trade, andy Chess game becomes a game
between different armies. ...
And thus I am convinced, that I have to include this aspect into SMIRF's successor's detail evaluation function.
... This can still be done in a reasonably realistic mix of pieces, e.g. replacing Q and C on one side by A, and on the other side by Q and A by C, so that you play 3C vs 3A, and then give additional Knight odds to the Chancellors. ...
And by that this would create just the problem I have tried to demonstrate. The three Chancellors could impossibly be covered, thus disabling their potential to risk their own existence by entering squares already influenced by the opponent's side.
And thus I am convinced, that I have to include this aspect into SMIRF's successor's detail evaluation function.
... This can still be done in a reasonably realistic mix of pieces, e.g. replacing Q and C on one side by A, and on the other side by Q and A by C, so that you play 3C vs 3A, and then give additional Knight odds to the Chancellors. ...
And by that this would create just the problem I have tried to demonstrate. The three Chancellors could impossibly be covered, thus disabling their potential to risk their own existence by entering squares already influenced by the opponent's side.
H.G.M. wrote: ... Both
imbalances are large enough to cause 80-90% win percentages, so that just
a few games should make it obvious which value is very wrong.
Hard to see. You will wait for White to lose because of insufficient material, and I will await a loss of White because of the lonely big pieces disadvantage. It will be the task then to find out the true reasons of that.
I will try to create two arrays, where each side think to have advantage.
Hard to see. You will wait for White to lose because of insufficient material, and I will await a loss of White because of the lonely big pieces disadvantage. It will be the task then to find out the true reasons of that.
I will try to create two arrays, where each side think to have advantage.
| And by that this would create just the problem I have tried to | demonstrate. The three Chancellors could impossibly be covered, | thus disabling their potential to risk their own existence by | entering squares already influenced by the opponent's side. You make it sound like it is a disadvantage to have a stronger piece, because it cannot go on squares attacked by the weaker piece. To a certain extent this is true, if the difference in capabilities is not very large. Then you might be better off ignoring the difference in some cases, as respecting the difference would actually deteriorate the value of the stronger piece to the point where it was weaker than the weak piece. (For this reason I set the B and N value in my 1980 Chess program Usurpator to exactly the same value.) But if the difference between the pieces is large, then the fact that the stronger one can be interdicted by the weaker one is simply an integral part of its piece value. And IMO this is not the reason the 4A-9N example is so biased. The problem there is that the pieces of one side are all worth more than TWICE that of the other. Rooks against Knights would not have the same problem, as they could still engage in R vs 2N trades, capturing a singly defended Knight, in a normal exchange on a single square. But 3 vs 1 trades are almost impossible to enforce, and require very special tactics. It is easy enough to verify by playtesting that playing CCC vs AAA (as substitutes for the normal super-pieces) will simply produce 3 times the score excess of playing a normal setup with on one side a C deleted, and at the other an A. The A side will still have only a single A to harrass every C. Most squares on enemy territory will be covered by R, B, N or P anyway, in addition to A, so the C could not go there anyway. And it is not true that anything defended by A would be immune to capture by C, as A+anything > C (and even 2A+anything > 2C. So defending by A will not exempt the opponent from defending as many times as there is attack, by using A as defenders. And if there was one other piece amongst the defenders, the C had no chance anyway. The effect you point out does not nearly occur as easily as you think. And, as you can see, only 5 of my different armies did have duplicated superpieces. All the other armies where just what you would get if you traded the mentioned pieces, thus detecting if such a trade would enhance or deteriorate your winning chances or not.
Reinhard, if I understand you correct, what you basically want to introduce in the evaluation is terms of the type w_ij*N_i*N_j, where N_i is the number of pieces of type i of one side, and N_j is the number of pieces of type j of the opponent, and w_ij is an tunable weight. So that, if type i = A and type j = N, a negative w_ij would describe a reduction of the value of each Archbishop by the presence of the enemy Knights, through the interdiction effect. Such a term would for instance provide an incentive to trade A in a QA vs ABNN for the QA side, as his A is suppressed in value by the presence of the enemy N (and B), while the opponent's A would not be similarly suppressed by our Q. On the contrary, our Q value would be suppressed by the the opponent's A as well, so trading A also benefits him there. I guess it should be easy enough to measure if terms of this form have significant values, by playing Q-BNN imbalances in the presence of 0, 1 and 2 Archbishops, and deducing from the score whose Archbishops are worth more (i.e. add more winning probability). And similarly for 0, 1, 2 Chancellors each, or extra Queens. And then the same thing with a Q-RR imbalance, to measure the effect of Rooks on the value of A, C or Q. In fact, every second-order term can be measured this way. Not only for cross products between own and enemy pieces, but also cooperative effects between own pieces of equal or different type. With 7 piece types for each side (14 in total) there would be 14*13/2 = 91 terms of this type possible.
'I never found any effect of the time control on the scores I measure for some material imbalance. Within statistical error, the combinations I tries produced the same score at 40/15', 40/20', 40/30', 40/40', 40/1', 40/2', 40/5'. Going to even longer TC is very expensive, and I did not consider it worth doing just to prove that it was a waste of time...' _________ The additional time I normally give to playtesting games to improve the move quality is partially wasted because I can only control the time per move instead of the number of plies completed using most chess variant programs. This usually results in the time expiring while it is working on an incomplete ply. Then, it prematurely spits out a move representative of an incomplete tour of the moves available within that ply at a random fraction of that ply. Since there is always more than one move (often, a few-several) under evaluation as being the best possible move [Otherwise, the chosen move would have already been executed.], this means that any move on this 'list of top candidates' is equally likely to be randomly executed. Here are two typical scenarios that should cover what usually happens: A. If the list of top candidates in an 11-ply search consists of 6 moves where the list of top candidates in a 10-ply search consists of 7 moves, then only 1 discovered-to-be-less-than-the-best move has been successfully excluded and cannot be executed. Of course, an 11-ply search completion may typically require est. 8-10 times as much time as the search completions for all previous plies (1-ply thru 10-ply) up until then added together. OR B. If the list of top candidates in an 11-ply search consists of 7 moves [Moreover, the exact same 7 moves.] just as the preceding 10-ply search, then there is no benefit at all in expending 8-10 times as much time. ______________________________________________________________ The reason I endure this brutal waiting game is not for purely masochistic experience but because the additional time has a tangible chance (although no guarantee) of yielding a better move with every occasion. Throughout the numerous moves within a typical game, it can be realistically expected to yield better moves on dozens of occasions. We usually playtest for purposes at opposite extremes of the spectrum yet I regard our efforts as complimentary toward building a complete picture involving material values of pieces. You use 'asymmetrical playtesting' with unequal armies on fast time controls, collect and analyze statistics ... to determine a range, with a margin of error, for individual material piece values. I remain amazed (although I believe you) that you actually obtain any meaningful results at all via games that are played so quickly that the AI players do not have 'enough time to think' while playing games so complex that every computer (and person) needs time to think to play with minimal competence. Can you explain to me in a way I can understand how and why you are able to successfully obtain valuable results using this method? The quality of your results was utterly surprising to me. I apologize for totally doubting you when you introduced your results and mentioned how you obtained them. I use 'symmetrical playtesting' with identical armies on very slow time controls to obtain the best moves realistically possible from an evaluation function thereby giving me a winner (that is by some margin more likely than not deserving) ... to determine which of two sets of material piece values is probably (yet not certainly) better. Nonetheless, as more games are likewise played ... If they present a clear pattern, then the results become more probable to be reliable, decisive and indicative of the true state of affairs. The chances of flipping a coin once and it landing 'heads' are equal to it landing 'tails'. However, the chances of flipping a coin 7 times and it landing 'heads' all 7 times in a row are 1/128. Now, replace the concepts 'heads' and 'tails' with 'victory' and 'defeat'. I presume you follow my point. The results of only a modest number of well-played games can definitely establish their significance beyond chance and to the satisfaction of reasonable probability for a rational human mind. [Most of us, including me, do not need any better than a 95%-99% success to become convinced that there is a real correlation at work even though such is far short of an absolute 100% mathematical proof.] In my experience, I have found that using any less than 10 minutes per move will cause at least one instance within a game when an AI player makes a move that is obvious to me (and correctly assessed as truly being) a poor move. Whenever this occurs, it renders my playtesting results tainted and useless for my purposes. Sometimes this occurs during a game played at 30 minutes per move. However, this rarely occurs during a game played at 90 minutes per move. For my purposes, it is critically important above all other considerations that the winner of these time-consuming games be correctly determined 'most of the time' since 'all of the time' is impossible to assure. I must do everything within my power to get as far from 50% toward 100% reliability in correctly determining the winner. Hence, I am compelled to play test games at nearly the longest survivable time per move to minimize the chances that any move played during a game will be an obviously poor move that could have changed the destiny of the game thereby causing the player that should have won to become the loser, instead. In fact, I feel as if I have no choice under the circumstances.
Harm, I think of a more simple formula, because it seems to be easier to find out an approximation than to weight a lot of parameters facing a lot of other unhanded strange effects. Therefore my less dimensional approach is looking like: f(s := sum of unbalanced big pieces' values, n := number of unbalanced big pieces, v := value of biggest opponents' piece). So I intend to calculate the presumed value reduction e.g. as: (s - v*n)/constant P.S.: maybe it will make sense to down limit v by s/(2*n) to prevent a too big reduction, e.g. when no big opponents' piece would be present at all. P.P.S.: There have been some more thoughts of mine on this question. Let w := sum of n biggest opponent pieces, limited by s/2. Then the formula should be: (s - w)/constant P.P.P.S.: My experiments suggest, that the constant is about 2.0 P^4.S.: I have implemented this 'Elephantiasis-Reduction' (as I will name it) in a new private SMIRF version and it is working well. My constant is currently 8/5. I found out, that it is good to calculate one more piece than being without value compensation, because that bottom piece pair could be of switched size and thus would reduce the reduction. Non existing opponent pieces will be replaced by a Knight piece value within the calculation. I noticed a speeding up of SMIRF when searching for mating combinations (by normal play). I also noticed that SMIRF is making sacrifices, incorporating vanishing such penalties of the introduced kind.
Derek Nalls:
| The additional time I normally give to playtesting games to improve
| the move quality is partially wasted because I can only control the
| time per move instead of the number of plies completed using most
| chess variant programs.
Well, on Fairy-Max you won't have that problem, as it always finishes an
iteration once it decides to start it. But although Fairy-Max might be
stronger than most other variant-playing AIs you use, it is not stronger
than SMIRF, so using it for 10x8 CVs would still be a waste of time.
Joker80 tries to minimize the time wastage you point out by attempting
only to start iterations when it has time to finish them. It cannot always
accurately guess the required time, though, so unlike Fairy-Max it has
built in some emergency breaks. If they are triggered, you would have an
incomplete iteration. Basically, the mechanism works by stopping to search
new moves in the root if there already is a move with a similar score as on
the previous iteration, once it gets in 'overtime'. In practice, these
unexpectedly long iterations mainly occur when the previously best move
runs into trouble that so far was just beyond the horizon. As the tree for
that move will then look completely different from before, it takes a long
time to search (no useful information in the hash), and the score will
have a huge drop. It then continues searching new moves even in overtime
in a desparate attempt to find one that avoids the disaster. Usually this
is time well spent: even if there is no guarantee it finds the best move
of the new iteration, if it aborts it early, it at least has found a move
that was significantly better than that found in the previous iteration.
Of course both Joker80 and Fairy-Max support the WinBoard 'sd' command,
allowing you to limit the depth to a certain number of plies, although I
never use that. I don't like to fix the ply depth, as it makes the engine
play like an idiot in the end-game.
| Can you explain to me in a way I can understand how and why
| you are able to successfully obtain valuable results using this
| method?
Well, to start with, Joker80 at 1 sec per move still reaches a depth of
8-9 ply in the middle-game, and would probably still beat most Humans at
that level. My experience is that, if I immediately see an obvious error,
it is usually because the engine makes a strategic mistake, not a tactical
one. And such strategic mistakes are awefully persistent, as they are a
result of faulty evaluation, not search. If it makes them at 8 ply, it is
very likely to make that same error at 20 ply. As even 20 ply is usually
not enough to get the resolution of the strategical feature within the
horizon.
That being said, I really think that an important reason I can afford fast
games is a statistical one: by playing so many games I can be reasonably
sure that I get a representative number of gross errors in my sample, and
they more or less cancel each other out on the average. Suppose at a
certain level of play 2% of the games contains a gross error that turns a
totally won position into a loss. If I play 10 games, there is a 20% error
that one game contains such an error (affecting my result by 10%), and only
~2% probability on two such errors (that then in half the cases would
cancel, but in other cases would put the result off by 20%).
If, OTOH, I would play 1000 faster games, with an increased 'blunder
rate' of 5% because of the lower quality, I would expect 50 blunders. But
the probability that they were all made by the same side would be
negligible. In most cases the imbalace would be around sqrt(50) ~ 7. That
would impact the 1000-game result by only 0.7%. So virtually all results
would be off, but only by about 0.7%, so I don't care too much.
Another way of visualizing this would be to imagine the game state-space
as a2-dimensional plane, with two evaluation terms determining the x- and
y-coordinate. Suppose these terms can both run from -5 to +5 (so the state
space is a square), and the game is won if we end in the unit circle (x^2 +
y^2 < 1), but that we don't know that. Now suppose we want to know how
large the probability of winning is if we start within the square with
corners (0,0) and (1,1) (say this is the possible range of the evaluation
terms when we posses a certain combination of pieces). This should be the
area of a quarter circle, PI/4, divided by the area of the square (1), so
PI/4 = 79%.
We try to determine this empirically by randomly picking points in the
square (by setting up the piece combination in some shuffled
configuration), and let the engines play the game. The engines know that
getting closer or farther away of (0,0) is associated with changing the
game result, and are programmed to maximize or minimize this distance to
the origin. If they both play perfectly, they should by definition succeed
in doing this. They don't care about the 'polar angle' of the game
state, so the point representing the game state will make a random walk on
a circle around the origin. When the game ends, it will still be in the
same region (inside or outside the unit circle), and games starting in the
won region will all be won.
Now with imperfect play, the engines will not conserve the distance to the
origing, but their tug of war will sometimes change it in favor of one or
the other (i.e. towards the origin, or away from it). If the engines are
still equally strong, by definition on the average this distance will not
change. But its probability distribution will now spread out over a ring
with finite width during the game. This might lead to won positions close
to the boundary (the unit circle) now ending up outside it, in the lost
region. But if the ring of final game states is narrow (width << 1), there
will be a comparable number of initial game states that diffuse from within
the unit circle to the outside, as in the other direction.
In other words, the game score as a function of the initial evaluation
terms is no longer an absolute all or nothing, but the circle is radially
smeared out a little, making a smooth transition from 100% to 0% in a
narrow band centered on the original circle.
This will hardly affect the averaging, and in particular, making the ring
wider by decreasing playing accuracy will initially hardly have any
effect. Only when play gets so wildly inaccurate that the final positions
(where win/loss is determined) diverge so far from the initial point that
it could cross the entire circle, you will start to see effects on the
score. In the extreme case wher the radial diffusion is so fast that you
could end up anywhere in the 10x10 square when the game finishes, the
result score will only be PI/100 = 3%.
So it all depends on how much the imperfections in the play spread out the
initial positions in the game-state space. If this is only small compared
to the measures of the won and lost areas, the result will be almost
independent of it.
Before Scharnagl sent me three special versions of SMIRF MS-174c compiled with the CRC material values of Scharnagl, Muller & Nalls, I began playtesting something else that interested me using SMIRF MS-174b-O. I am concerned that the material value of the rook (especially compared to the queen) amongst CRC pieces in the Muller model is too low: rook 55.88 queen 111.76 This means that 2 rooks exactly equal 1 queen in material value. According to the Scharnagl model: rook 55.71 queen 91.20 This means that 2 rooks have a material value (111.42) 22.17% greater than 1 queen. According to the Nalls model: rook 59.43 queen 103.05 This means that 2 rooks have a material value (118.86) 15.34% greater than 1 queen. Essentially the Scharnagl & Nalls models are in agreement in predicting victories in a CRC game for the player missing 1 queen yet possessing 2 rooks. By contrast, the Muller model predicts draws (or appr. equal number of victories and defeats) in a CRC game for either player. I put this extraordinary claim to the test by playing 2 games at 10 minutes per move on an appropriately altered Embassy Chess setup with the missing-1-queen player and the missing-2-rooks player each having a turn at white and black. The missing-2-rooks player lost both games and was always behind. They were not even long games at 40-60 moves. Muller: I think you need to moderately raise the material value of your rook in CRC. It is out of its proper relation with the other material values within the set.
To Derek: I am aware that the empirical Rook value I get is suspiciously low. OTOH, it is an OPENING value, and Rooks get their value in the game only late. Furthermore, this only is the BASE VALUE of the Rook; most pieces have a value that depends on the position on the board where it actually is, or where you can quickly get it (in an opening situation, where the opponent is not yet able to interdict your moves, because his pieces are in inactive places as well). But Rooks only increase their value on open files, and initially no open files are to be seen. In a practical game, by the time you get to trade a Rook for 2 Queens, there usually are open files. So by that time, the value of the Q vs 2R trade will have gone up by two times the open-file bonus. You hardly have the possibility of trading it before there are open files. So it stands to reason that you might as well use the higher value during the entire game. In 8x8 Chess, the Larry Kaufman piece values include the rule that a Rook should be devaluated by 1/8 Pawn for each Pawn on the board there is over five. In the case of 8 Pawns that is a really large penalty of 37.5cP for having no open files. If I add that to my opening value, the late middle-game / end-game value of the Rook gets to 512, which sounds a lot more reasonable. There are two different issues here: 1) The winning chances of a Q vs 2R material imbalance game 2) How to interpret that result as a piece value All I say above has no bearing on (1): if we both play a Q-2R match from the opening, it is a serious problem if we don't get the same result. But you have played only 2 games. Statistically, 2 games mean NOTHING. I don't even look at results before I have at least 100 games, because before they are about as likely to be the reverse from what they will eventually be, as not. The standard deviation of the result of a single Gothic Chess game is ~0.45 (it would be 0.5 point if there were no draws possible, and in Gothic Chess the draw percentge is low). This error goes down as the square root of the number of games. In the case of 2 games this is 45%/sqrt(2) = 32%. The Pawn-odds advantage is only 12%. So this standard error corresponds to 2.66 Pawns. That is 1.33 Pawns per Rook. So with this test you could not possibly see if my value is off by 25, 50 or 75. If you find a discrepancy, it is enormously more likely that the result of your 2-game match is off from to true win probability. Play 100 games, and the error in the observed score is reasonable certain (68% of the cases) to be below 4.5% ~1/3 Pawn, so 16 cP per Rook. Only thn you can see with reasonable confidence if your observations differ from mine.
'You hardly have the possibility of trading it before there are open files. So it stands to reason that you might as well use the higher value during the entire game.' Well, I understand and accept your reasons for leaving your lower rook value in CRC as is. It is interesting that you thoroughly understand and accept the reasons of others for using a higher rook value in CRC as well. Ultimately, is not the higher rook value in CRC more practical and useful to the game by your own logic? _____________________________ '... if we both play a Q-2R match from the opening, it is a serious problem if we don't get the same result. But you have played only 2 games. Statistically, 2 games mean NOTHING.' I never falsely claimed or implied that only 2 games at 10 minutes per move mean everything or even mean a great deal (to satisfy probability overwhelmingly). However, they mean significantly more than nothing. I cannot accept your opinion, based upon a purely statistical viewpoint, since it is at the exclusion another applicable mathematical viewpoint. They definitely mean something ... although exactly how much is not easily known or quantified (measured) mathematically. __________________________________________________ 'I don't even look at results before I have at least 100 games, because before they are about as likely to be the reverse from what they will eventually be, as not.' Statistically, when dealing with speed chess games populated exclusively with virtually random moves ... YES, I can understand and agree with you requiring a minimum of 100 games. However, what you are doing is at the opposite extreme from what I am doing via my playtesting method. Surely you would agree that IF I conducted only 2 games with perfect play for both players that those results would mean EVERYTHING. Unfortunately, with state-of-the-art computer hardware and chess variant programs (such as SMIRF), this is currently impossible and will remain impossible for centuries-millennia. Nonetheless, games played at 100 minutes per move (for example) have a much greater probability of correctly determining which player has a definite, significant advantage than games played at 10 seconds per move (for example). Even though these 'deep games' play of nowhere near 600 times better quality than these 'shallow games' as one might naively expect (due to a non-linear correlation), they are far from random events (to which statistical methods would then be fully applicable). Instead, they occupy a middleground between perfect play games and totally random games. [In my studied opinion, the example 'middleground games' are more similar to and closer to perfect play games than totally random games.] To date, much is unknown to combinatorial game theory about the nature of these 'middleground games'. Remember the analogy to coin flips that I gave you? Well, in fact, the playtest games I usually run go far above and beyond such random events in their probable significance per event. If the SMIRF program running at 90 minutes per move casted all of its moves randomly and without any intelligence at all (as a perfect woodpusher), only then would my 'coin flip' analogy be fully applicable. Therefore, when I estimate that it would require 6 games (for example) for me to determine, IF a player with a given set of piece values loses EVERY game, that there is only a 63/64 chance that the result is meaningful (instead of random bad luck), I am being conservative to the extreme. The true figure is almost surely higher than a 63/64 chance. By the way, if you doubt that SMIRF's level of play is intelligent and non-random, then play a CRC variant of your choice against it at 90 minutes per move. After you lose repeatedly, you may not be able to credit yourself with being intelligent either (although you should) ... if you insist upon holding an impractically high standard to define the word. ______ 'If you find a discrepancy, it is enormously more likely that the result of your 2-game match is off from its true win probability.' For a 2-game match ... I agree. However, this may not be true for a 4-game, 6-game or 8-game match and surely is not true to the extremes you imagine. Everything is critically dependant upon the specifications of the match. The number of games played (of course), the playing strength or quality of the program used, the speed of the computer and the time or ply depth per move are the most important factors. _________________________________________________________ 'Play 100 games, and the error in the observed score is reasonable certain (68% of the cases) to be below 4.5% ~1/3 Pawn, so 16 cP per Rook. Only then you can see with reasonable confidence if your observations differ from mine.' It would require est. 20 years for me to generate 100 games with the quality (and time controls) I am accustomed to and somewhat satisfied with. Unfortunately, it is not that important to me just to get you to pay attention to the results for the benefit of only your piece values model. As a practical concern to you, everyone else who is working to refine quality piece values models in FRC and CRC will have likely surpassed your achievements by then IF you refuse to learn anything from the results of others who use different yet valid and meaningful methods for playtesting and mathematical analysis than you.
Drek Nalls: | They definitely mean something ... although exactly how much is not | easily known or quantified (measured) mathematically. Of course that is easily quantified. The entire mathematical field of statistics is designed to precisely quantify such things, through confidence levels and uncertainty intervals. The only thing you proved with reasonable confidence (say 95%) is that two Rooks are not 1.66 Pawn weaker than a Queen. So if Q=950, then R > 392. Well, no one claimed anything different. What we want to see is if Q-RR scores 50% (R=475) or 62% (R=525). That difference just can't be seen with two games. Play 100. There is no shortcut. Even perfect play doesn't help. We do have perfect play for all 6-men positions. Can you derive piece values from that, even end-game piece values??? | Statistically, when dealing with speed chess games populated | exclusively with virtually random moves ... YES, I can understand and | agree with you requiring a minimum of 100 games. However, what you | are doing is at the opposite extreme from what I am doing via my | playtesting method. Where do you get this nonsense? This is approximately master-level play. Fact is that results from playing opening-type positions (with 35 pieces or more) are stochastic quantity at any level of play we are likely to see the next few million years. And even if they weren't, so that you could answer the question 'who wins' through a 35-men tablebase, you would still have to make some average over all positions (weighted by relevance) with a certain material composition to extract piece values. And if you would do that by sampling, the resukt would again be a sochastic quantity. And if you would do it by exhaustive enumeration, you would have no idea which weights to use. And if you are sampling a stochastic quantity, the error will be AT LEAST as large as the statistical error. Errors from other sources could add to that. But if you have two games, you will have at least 32% error in the result percentage. Doesnt matter if you play at an hour per move, a week per move, a year per move, 100 year per move. The error will remain >= 32%. So if you want to play 100 yesr per move, fine. But you will still need 100 games. | Nonetheless, games played at 100 minutes per move (for example) have | a much greater probability of correctly determining which player has | a definite, significant advantage than games played at 10 seconds per | move (for example). Why do I get the suspicion that you are just making up this nonsense? Can you show me even one example where you have shown that a certain material advantage would be more than 3-sigma different for games at 100 min / move than for games at 1 sec/move? Show us the games, then. Be aware that this would require at least 100 games at aech time control. That seems to make it a safe guess that you did not do that for 100 min/move. On the other hand, in stead of just making things up, I have actually done such tests, not with 100 games per TC, but with 432, and for the faster even with 1728 games per TC. And there was no difference beyond the expected and unavoidable statistical fluctuations corresponding to those numbers of games, between playing 15 sec or 5 minutes. The advantage that a player has in terms of winning probability is the same at any TC I ever tried, and can thus equally reliably be determined with games of any duration. (Provided ou have the same number of games). If you think it would be different for extremely long TC, show us statistically sound proof. I might comment on the rest of your long posting later, but have to go now...
'Of course, that is easily quantified. The entire mathematical field of statistics is designed to precisely quantify such things, through confidence levels and uncertainty intervals.' No, it is not easily quantified. Some things of numerical importance as well as geometric importance that we try to understand or prove in the study of chess variants are NOT covered or addressed by statistics. I wish our field of interest was that simple (relatively speaking) and approachable but it is far more complicated and interdisciplinary. All you talk about is statistics. Is this because statistics is all you know well? ___________ 'That difference just can't be seen with two games. Play 100. There is no shortcut.' I agree. Not with only 2 games. However ... With only 4 games, IF they were ALL victories or defeats for the player using a given piece values model, I could tell you with confidence that there is at least a 15/16 chance the given piece values model is stronger or weaker, respectively, than the piece values model used by its opponent. [Otherwise, the results are inconclusive and useless.] Furthermore, based upon the average number of moves per game required for victory or defeat compared to the established average number of moves in a long, close game, I could probably, correctly estimate whether one model was a little or a lot stronger or weaker, respectively, than the other model. Thus, I will not play 100 games because there is no pressing, rational need to reduce the 'chance of random good-bad luck' to the ridiculously-low value of 'the inverse of (base 2 to exponent 100)'. Is there anything about the odds associated with 'flipping a coin' that is beyond your ability to understand? This is a fundamental mathematical concept applicable without reservation to symmetrical playtesting. In any case, it is a legitimate 'shortcut' that I can and will use freely. ________________ 'Even perfect play doesn't help. We do have perfect play for all 6-men positions.' I meant perfect play throughout an entire game of a CRC variant involving 40 pieces initially. That is why I used the word 'impossible' with reference to state-of-the-art computer technology. _______________________________________________________ 'This is approximately master-level play.' Well, if you are getting master-level play from Joker80 with speed chess games, then I am surely getting a superior level of play from SMIRF with much longer times and deeper plies per move. You see, I used the term 'virtually random moves' appropriately in a comparative context based upon my experience. _____________________________________________ 'Doesn't matter if you play at an hour per move, a week per move, a year per move, 100 year per move. The error will remain >=32%. So if you want to play 100 years per move, fine. But you will still need 100 games.' Of course, it matters a lot. If the program is well-written, then the longer it runs per move, the more plies it completes per move and consequently, the better the moves it makes. Hence, the entire game played will progressively approach the ideal of perfect play ... even though this finite goal is impossible to attain. Incisive, intelligent, resourceful moves must NOT to be confused with or dismissed as purely random moves. Although I could humbly limit myself to applying only statistical methods, I am totally justified, in this case, in more aggressively using the 'probabilities associated with N coin flips ALL with the same result' as an incomplete, minimum value before even taking the playing strength of SMIRF at extremely-long time controls into account to estimate a complete, maximum value. ______________________________________________________________ 'The advantage that a player has in terms of winning probability is the same at any TC I ever tried, and can thus equally reliably be determined with games of any duration.' You are obviously lacking completely in the prerequisite patience and determination to have EVER consistently used long enough time controls to see any benefit whatsoever in doing so. If you had ever done so, then you would realize (as everyone else who has done so realizes) that the quality of the moves improves and even if the winning probability has not changed much numerically in your experience, the figure you obtain is more reliable. [I cannot prove to you that this 'invisible' benefit exists statistically. Instead, it is an important concept that you need to understand in its own terms. This is essential to what most playtesters do, with the notable exception of you. If you want to understand what I do and why, then you must come to grips with this reality.]
CRC piece values tournament http://www.symmetryperfect.com/pass/ Just push the 'download now' button. Game #1 Scharnagl vs. Muller 10 minutes per move SMIRF MS-174c Result- inconclusive. Draw after 87 moves by black. Perpetual check declared.
This discussion is pointless. In dealing with a stochastic quantity, if your statistics are no good, your observations are no good, and any conclusions based on them utterly meaningless. Nothing of what you say here has any reality value, it is just your own fantasies. First you should have results, then it becomes possible to talk about what they mean. You have no result. Get statistically meaningful testresults. If your method can't produce them, or you don't feel it important enough to make your method produce them, don't bother us with your cr*p instead. Two sets of piece values as different as day and knight, and the only thing you can come up with is that their comparison is 'inconclusive'. Are you sure that you could conclusively rule out that a Queen is worth 7, or a Rook 8, by your method of 'playtesting'? Talk about pathetic: even the two games you played are the same. Oh man, does your test setup s*ck! If you cannot even decide simple issues like this, what makes you think you have anything meaningful to say about piece values at all?
Once upon a time I had a friend in a country far, far away, who had obtained a coin from the bank. I was sure this coin was counterfeit, as it had a far larger probability of producing tails. I even PROVED it to him: I threw the coin twice, and both times tails came up. But do you think the fool believed me? No, he DIDN'T! He had the AUDACITY to claim there was nothing wrong with the coin, because he had tossed it a thouand times, and 523 times heads had come up! While it was clear to everyone that he was cheating: he threw the coin only 10 feet up into the air, on each try. While I brought my coin up to 30,000 feet in an airplane, before I threw it out of the window, BOTH times! And, mind you, both times it landed tails! And it was not just an ordinary plane, like a Boeing 747. No sir, it was a ROCKET plane! And still this foolish friend of mine insisted that his measly 10 feet throws made him more confident that the coin was OK then my IRONCLAD PROOF with the rocket plane. Ridicuoulous! Anyone knows that you can't test a coin by only tossing it 10 feet. If you do that, it might land on any side, rather than the side it always lands on. He might as well have flipped a coin! No wonder they send him to this far, far away country: no one would want to live in the same country as such an idiot. He even went as far as to buy an ICECREAM for that coin, and even ENJOYED eating that! Scandalous! I can tell you, he ain't my friend anymore! Using coins that always land on one side as if it were real money. For more fairy tales and bed-time stories, read Derek's postings on piece values... :-) :-) :-)
Two suggestion for settling debates such as these. First distributed computing to provide as much data as possible. And bayesian statistical methods to provide statistical bounds on results.
Jianying Ji: | Two suggestion for settling debates such as these. First distributed | computing to provide as much data as possible. And bayesian statistical | methods to provide statistical bounds on results. Agreed: one first needs to generate data. Without data, there isn't even a debate, and everything is just idle talk. What bounds would you expect from a two-game dataset? And what if these two games were actually the same? But the problem is that the proverbial fool can always ask more than anyone can answer. If, by recruting all PCs in the World, we could generate 100,000 games at an hour per move, an hour per move will of course not be 'good enough'. It will at least have to be a week per move. Or, if that is possible, 100 years per move. And even 100 years per move are of course no good, because the computers will still not be able to search into the end-game, as they will search only 12 ply deeper than with 1 hour per move. So what's the point? Not only is his an énd-of-the-rainbow-type endeavor, even if you would get there, and generate the perfect data, where it is 100% sure and prooven for each position what the outcome under perfect play is, what then? Because for simple end-games we are alrady in a position to reach perfect play, through retrograde analysis (tablebases). So why not start there, to show that such data is of any use whatsoever, in this case for generating end-game piece values? If you have the EGTB for KQKAN, and KAKBN, how would you extract a piece value for A from it?
'This discussion is pointless.' On this one occasion, I agree with you. However, I cannot just let you get away with some of your most outrageous remarks to date. So, unfortunately, this discussion is not yet over. ____________________________________________ 'First you should have results, then it becomes possible to talk about what they mean. You have no result.' Of course, I have a result! The result is obviously the game itself as a win, loss or draw for the purposes of comparing the playing strengths of two players using different sets of CRC piece values. The result is NOT statistical in nature. Instead, the result is probabilistic in nature. I have thoroughly explained this purpose and method to you. I understand it. Reinhard Scharnagl understands it. You do not understand it. I can accept that. However, instead of admitting that you do not understand it, you claim there is nothing to understand. ______________________________________ 'Two sets of piece values as different as day and night, and the only thing you can come up with is that their comparison is 'inconclusive'.' Yes. Draws make it impossible to determine which of two sets of piece values is stronger or weaker. However, by increasing the time (and plies) per move, smaller differences in playing strength can sometimes be revealed with 'conclusive' results. I will attempt the next pair of Scharnagl vs. Muller and Muller vs. Scharnagl games at 30 minutes per move. Knowing how much you appreciate my efforts on your behalf motivates me. ___________________________________________________ 'Talk about pathetic: even the two games you played are the same.' Only one game was played. The logs you saw were produced by the Scharnagl (standard) version of SMIRF for the white player and the Muller (special) version of SMIRF for the black player. The game is handled in this manner to prevent time from being expired without computation occurring. ___________________________________________________ '... does your test setup s*ck!' What, now you hate Embassy Chess too? Take up this issue with Kevin Hill.
I really am completely lost, so I won't comment until I can see what the debate is about.
H.G.M. wrote: '... he threw the coin only 10 feet up into the air, on each try. While I brought my coin up to 30,000 feet in an airplane ...'
Understanding your example as an argument against Derek Nalls' testing method, I wonder why your chess engines always are thinking using the full given timeframe. It would be much more impressive, if your engine would decide always immediately. ;-)
I am still convinced, that longer thinking times would have an influence on the quality of the resulting moves.
Understanding your example as an argument against Derek Nalls' testing method, I wonder why your chess engines always are thinking using the full given timeframe. It would be much more impressive, if your engine would decide always immediately. ;-)
I am still convinced, that longer thinking times would have an influence on the quality of the resulting moves.
Since I had to endure one of your long bedtime stories (to be sure), you are going to have to endure one of mine. Yet unlike yours [too incoherent to merit a reply], mine carries an important point: Consider it a test of your common sense- Here is a scenario ... 01. It is the year 2500 AD. 02. Androids exist. 03. Androids cannot tell lies. 04. Androids can cheat, though. 05. Androids are extremely intelligent in technical matters. 06. Your best friend is an android. 07. It tells you that it won the lottery. 08. You verify that it won the lottery. 09. It tells you that it purchased only one lottery ticket. 10. You verify that it purchased only one lottery ticket. 11. The chance of winning the lottery with only one ticket is 1 out of 100 million. 12. It tells you that it cheated to win the lottery by hacking into its computer system immediately after the winning numbers were announced, purchasing one winning ticket and back-dating the time of the purchase. ____________________________________________ You have only two choices as to what to believe happened- A. The android actually won the lottery by cheating. OR B. The android actually won the lottery by good luck. The android was mistaken in thinking it successfully cheated. ______________________________________________________ The chance of 'A' being true is 99,999,999 out of 100,000,000. The chance of 'B' being true is 1 out of 100,000,000. ________________________________________________ I would place my bet upon 'A' being true because I do not believe such unlikely coincidences will actually occur. You would place your bet upon 'B' being true because you do not believe such unlikely coincidences have any statistical significance whatsoever. _________________________________________ I make this assessment of your judgment ability fairly because you think it is a meaningless result if a player with one set of CRC piece values wins against its opponent 10-times-in-a-row even as the chance of it being 'random good luck' is indisputably only 1 out of 1024. By the way ... base 2 to exponent 100 equals 1,267,650,600,228,229,401,496,703,205,376. Can you see how ridiculous your demand of 100 games is?
Is this story meant to illustrate that you have no clue as to how to calculate statistical significance? Or perhaps that you don't know what it is at all? The observation of a single tails event rules out the null hypothesis that the lottery was fair (i.e. that the probability for this to happen was 0.000,000,01) with a confidence of 99.999,999%. Be careful, though, that this only describes the case where the winning android was somehow special or singled out in advance. If the other participants to the lottery were 100 million other cheating androids, it would not be remarkable in anyway that one of them won. The null hypothesis that the lottery was fair predicted a 100% probability for that. But, unfortunately for you, it doesn't work for lotteries with only 2 tickets. Then you can rule the null hypothesis that the lottery was fair (and hence the probability 0.5) with a confidence of 50%. And 50% confidence means that in 50% of the cases your conclusion is correct, and in the other 50% of the cases not. In other words, a confidence level of 50% is a completely blind, uninformed random guess.
Reinhard Scharnagl: | I am still convinced, that longer thinking times would have an | influence on the quality of the resulting moves. Yes, so what? Why do you think that is a relevant remark? The better moves won't help you at all, if the opponent also does better moves. The result will be the same. And the rare cases it is not, on the average cancel each other. So for the umptiest time: NO ONE DENIES THAT LONGER THINKING TIME PRODUCES SOMEWHAT BETTER MOVES. THE ISSUE IS THAT IF BOTH SIDES PLAY WITH LONGER TC, THEIR WINNING PROBABILITIES WON'T CHANGE. And don't bother to to tell us that you are also convinced that the winning probabilities will change, without showing us proof. Because no one is interested in unfounded opinions, not even if they are yours.
'Is this story meant to illustrate that you have no clue as to how to calculate statistical significance?' No. This story is meant to illustrate that you have no clue as to how to calculate probabilistic significance ... and it worked perfectly. ________________________________________________________ There you go again. Missing the point entirely and ranting about probabilities not being proper statistics.
25 comments displayed
Permalink to the exact comments currently displayed.