Check out Atomic Chess, our featured variant for November, 2024.


[ Help | Earliest Comments | Latest Comments ]
[ List All Subjects of Discussion | Create New Subject of Discussion ]
[ List Earliest Comments Only For Pages | Games | Rated Pages | Rated Games | Subjects of Discussion ]

Comments/Ratings for a Single Item

EarliestEarlier Reverse Order LaterLatest
Piece Values[Subject Thread] [Add Response]
H. G. Muller wrote on Sat, May 3, 2008 06:42 PM UTC:
Derek Nalls:
| Given enough years (working with only one server), this quantity of 
| well-played games may eventually become adequate.

I never found any effect of the time control on the scores I measure for
some material imbalance. Within statistical error, the combinations I
tries produced the same score at 40/15', 40/20', 40/30', 40/40',
40/1', 40/2', 40/5'. Going to even longer TC is very expensive, and I
did not consider it worth doing just to prve that it was a waste of
time...

The way I see it, piece-values are a quantitative measure for the amount
of control that a piece contributes to steering the game tree in the
direction of the desired evaluation. He who has more control, can
systematically force the PV in the direction of better and better
evaluation (for him). This is a strictly local property of the tree. The
only advantage of deeper searches is that you average out this control
(which highly fluctuates on a ply-by play basis) over more ply. But in
playing the game, you average over all plies anyway.

Reinhard Scharnagl wrote on Sat, May 3, 2008 06:43 PM UTC:
H.G.M. wrote: ... After an unequal trade, andy Chess game becomes a game between different armies. ...

And thus I am convinced, that I have to include this aspect into SMIRF's successor's detail evaluation function.

... This can still be done in a reasonably realistic mix of pieces, e.g. replacing Q and C on one side by A, and on the other side by Q and A by C, so that you play 3C vs 3A, and then give additional Knight odds to the Chancellors. ...

And by that this would create just the problem I have tried to demonstrate. The three Chancellors could impossibly be covered, thus disabling their potential to risk their own existence by entering squares already influenced by the opponent's side.

Reinhard Scharnagl wrote on Sat, May 3, 2008 08:06 PM UTC:
H.G.M. wrote: ... Both imbalances are large enough to cause 80-90% win percentages, so that just a few games should make it obvious which value is very wrong.

Hard to see. You will wait for White to lose because of insufficient material, and I will await a loss of White because of the lonely big pieces disadvantage. It will be the task then to find out the true reasons of that.

I will try to create two arrays, where each side think to have advantage.

H. G. Muller wrote on Sat, May 3, 2008 08:18 PM UTC:
| And by that this would create just the problem I have tried to 
| demonstrate. The three Chancellors could impossibly be covered, 
| thus disabling their potential to risk their own existence by 
| entering squares already influenced by the opponent's side.

You make it sound like it is a disadvantage to have a stronger piece,
because it cannot go on squares attacked by the weaker piece. To a certain
extent this is true, if the difference in capabilities is not very large.
Then you might be better off ignoring the difference in some cases, as
respecting the difference would actually deteriorate the value of the
stronger piece to the point where it was weaker than the weak piece. (For
this reason I set the B and N value in my 1980 Chess program Usurpator to
exactly the same value.) But if the difference between the pieces is
large, then the fact that the stronger one can be interdicted by the
weaker one is simply an integral part of its piece value.

And IMO this is not the reason the 4A-9N example is so biased. The problem
there is that the pieces of one side are all worth more than TWICE that of
the other. Rooks against Knights would not have the same problem, as they
could still engage in R vs 2N trades, capturing a singly defended Knight,
in a normal exchange on a single square. But 3 vs 1 trades are almost
impossible to enforce, and require very special tactics.

It is easy enough to verify by playtesting that playing CCC vs AAA (as
substitutes for the normal super-pieces) will simply produce 3 times the
score excess of playing a normal setup with on one side a C deleted, and
at the other an A. The A side will still have only a single A to harrass
every C. Most squares on enemy territory will be covered by R, B, N or P
anyway, in addition to A, so the C could not go there anyway. And it is
not true that anything defended by A would be immune to capture by C, as
A+anything > C (and even 2A+anything > 2C. So defending by A will not
exempt the opponent from defending as many times as there is attack, by
using A as defenders. And if there was one other piece amongst the
defenders, the C had no chance anyway. 

The effect you point out does not nearly occur as easily as you think.
And, as you can see, only 5 of my different armies did have duplicated
superpieces. All the other armies where just what you would get if you
traded the mentioned pieces, thus detecting if such a trade would enhance
or deteriorate your winning chances or not.

H. G. Muller wrote on Sat, May 3, 2008 09:31 PM UTC:
Reinhard, if I understand you correct, what you basically want to introduce
in the evaluation is terms of the type w_ij*N_i*N_j, where N_i is the
number of pieces of type i of one side, and N_j is the number of pieces of
type j of the opponent, and w_ij is an tunable weight.

So that, if type i = A and type j = N, a negative w_ij would describe a
reduction of the value of each Archbishop by the presence of the enemy
Knights, through the interdiction effect. Such a term would for instance
provide an incentive to trade A in a QA vs ABNN for the QA side, as his A
is suppressed in value by the presence of the enemy N (and B), while the
opponent's A would not be similarly suppressed by our Q. On the contrary,
our Q value would be suppressed by the the opponent's A as well, so
trading A also benefits him there.

I guess it should be easy enough to measure if terms of this form have
significant values, by playing Q-BNN imbalances in the presence of 0, 1
and 2 Archbishops, and deducing from the score whose Archbishops are worth
more (i.e. add more winning probability). And similarly for 0, 1, 2
Chancellors each, or extra Queens. And then the same thing with a Q-RR
imbalance, to measure the effect of Rooks on the value of A, C or Q.

In fact, every second-order term can be measured this way. Not only for
cross products between own and enemy pieces, but also cooperative effects
between own pieces of equal or different type. With 7 piece types for each
side (14 in total) there would be 14*13/2 = 91 terms of this type possible.

Derek Nalls wrote on Sun, May 4, 2008 06:38 AM UTC:
'I never found any effect of the time control on the scores I measure for
some material imbalance. Within statistical error, the combinations I
tries produced the same score at 40/15', 40/20', 40/30', 40/40',
40/1', 40/2', 40/5'. Going to even longer TC is very expensive, and I
did not consider it worth doing just to prove that it was a waste of
time...'
_________

The additional time I normally give to playtesting games to improve the
move quality is partially wasted because I can only control the time per
move instead of the number of plies completed using most chess variant
programs.  This usually results in the time expiring while it is working
on an incomplete ply.  Then, it prematurely spits out a move
representative of an incomplete tour of the moves available within that
ply at a random fraction of that ply.  Since there is always more than one
move (often, a few-several) under evaluation as being the best possible
move [Otherwise, the chosen move would have already been executed.], this
means that any move on this 'list of top candidates' is equally likely
to be randomly executed.

Here are two typical scenarios that should cover what usually happens:

A.  If the list of top candidates in an 11-ply search consists of 6 moves
where the list of top candidates in a 10-ply search consists of 7 moves,
then only 1 discovered-to-be-less-than-the-best move has been successfully
excluded and cannot be executed.  

Of course, an 11-ply search completion may typically require est. 8-10
times as much time as the search completions for all previous plies (1-ply
thru 10-ply) up until then added together.

OR

B.  If the list of top candidates in an 11-ply search consists of 7 moves
[Moreover, the exact same 7 moves.] just as the preceding 10-ply search, 
then there is no benefit at all in expending 8-10 times as much time.
______________________________________________________________

The reason I endure this brutal waiting game is not for purely masochistic
experience but because the additional time has a tangible chance (although
no guarantee) of yielding a better move with every occasion.  Throughout
the numerous moves within a typical game, it can be realistically expected
to yield better moves on dozens of occasions.

We usually playtest for purposes at opposite extremes of the spectrum 
yet I regard our efforts as complimentary toward building a complete 
picture involving material values of pieces.

You use 'asymmetrical playtesting' with unequal armies on fast time 
controls, collect and analyze statistics ... to determine a range, with a
margin of error, for individual material piece values.

I remain amazed (although I believe you) that you actually obtain any 
meaningful results at all via games that are played so quickly that the AI
players do not have 'enough time to think' while playing games so complex
that every computer (and person) needs time to think to play with minimal
competence.  Can you explain to me in a way I can understand how and why
you are able to successfully obtain valuable results using this method? 
The quality of your results was utterly surprising to me.  I apologize for
totally doubting you when you introduced your results and mentioned how you
obtained them.

I use 'symmetrical playtesting' with identical armies on very slow time
controls to obtain the best moves realistically possible from an
evaluation function thereby giving me a winner (that is by some margin
more likely than not deserving) ... to determine which of two sets of
material piece values is probably (yet not certainly) better. 
Nonetheless, as more games are likewise played ...  If they present a
clear pattern, then the results become more probable to be reliable, 
decisive and indicative of the true state of affairs.

The chances of flipping a coin once and it landing 'heads' are equal to
it landing 'tails'.  However, the chances of flipping a coin 7 times and
it landing 'heads' all 7 times in a row are 1/128.  Now, replace the
concepts 'heads' and 'tails' with 'victory' and 'defeat'.  I
presume you follow my point.

The results of only a modest number of well-played games can definitely
establish their significance beyond chance and to the satisfaction of 
reasonable probability for a rational human mind.  [Most of us, including
me, do not need any better than a 95%-99% success to become convinced that
there is a real correlation at work even though such is far short of an
absolute 100% mathematical proof.]

In my experience, I have found that using any less than 10 minutes per
move will cause at least one instance within a game when an AI player
makes a move that is obvious to me (and correctly assessed as truly being)
a poor move.  Whenever this occurs, it renders my playtesting results 
tainted and useless for my purposes.  Sometimes this occurs during a 
game played at 30 minutes per move.  However, this rarely occurs during 
a game played at 90 minutes per move.

For my purposes, it is critically important above all other considerations
that the winner of these time-consuming games be correctly determined 
'most of the time' since 'all of the time' is impossible to assure.
I must do everything within my power to get as far from 50% toward 100%
reliability in correctly determining the winner.  Hence, I am compelled to
play test games at nearly the longest survivable time per move to minimize
the chances that any move played during a game will be an obviously poor 
move that could have changed the destiny of the game thereby causing 
the player that should have won to become the loser, instead.  In fact, 
I feel as if I have no choice under the circumstances.

Reinhard Scharnagl wrote on Sun, May 4, 2008 07:09 AM UTC:
Harm, I think of a more simple formula, because it seems to be easier to
find out an approximation than to weight a lot of parameters facing a lot
of other unhanded strange effects. Therefore my less dimensional approach
is looking like: f(s := sum of unbalanced big pieces' values,  n :=
number of unbalanced big pieces, v := value of biggest opponents' piece).

So I intend to calculate the presumed value reduction e.g. as:

(s - v*n)/constant

P.S.: maybe it will make sense to down limit v by s/(2*n) to prevent a too big reduction, e.g. when no big opponents' piece would be present at all.  

P.P.S.: There have been some more thoughts of mine on this question. Let w := sum of n biggest opponent pieces, limited by s/2. Then the formula should be:

(s - w)/constant

P.P.P.S.: My experiments suggest, that the constant is about 2.0

P^4.S.: I have implemented this 'Elephantiasis-Reduction' (as I will name it) in a new private SMIRF version and it is working well. My constant is currently 8/5. I found out, that it is good to calculate one more piece than being without value compensation, because that bottom piece pair could be of switched size and thus would reduce the reduction. Non existing opponent pieces will be replaced by a Knight piece value within the calculation. I noticed a speeding up of SMIRF when searching for mating combinations (by normal play). I also noticed that SMIRF is making sacrifices, incorporating vanishing such penalties of the introduced kind.

H. G. Muller wrote on Sun, May 4, 2008 08:57 AM UTC:
Derek Nalls: | The additional time I normally give to playtesting games to improve | the move quality is partially wasted because I can only control the | time per move instead of the number of plies completed using most | chess variant programs. Well, on Fairy-Max you won't have that problem, as it always finishes an iteration once it decides to start it. But although Fairy-Max might be stronger than most other variant-playing AIs you use, it is not stronger than SMIRF, so using it for 10x8 CVs would still be a waste of time. Joker80 tries to minimize the time wastage you point out by attempting only to start iterations when it has time to finish them. It cannot always accurately guess the required time, though, so unlike Fairy-Max it has built in some emergency breaks. If they are triggered, you would have an incomplete iteration. Basically, the mechanism works by stopping to search new moves in the root if there already is a move with a similar score as on the previous iteration, once it gets in 'overtime'. In practice, these unexpectedly long iterations mainly occur when the previously best move runs into trouble that so far was just beyond the horizon. As the tree for that move will then look completely different from before, it takes a long time to search (no useful information in the hash), and the score will have a huge drop. It then continues searching new moves even in overtime in a desparate attempt to find one that avoids the disaster. Usually this is time well spent: even if there is no guarantee it finds the best move of the new iteration, if it aborts it early, it at least has found a move that was significantly better than that found in the previous iteration. Of course both Joker80 and Fairy-Max support the WinBoard 'sd' command, allowing you to limit the depth to a certain number of plies, although I never use that. I don't like to fix the ply depth, as it makes the engine play like an idiot in the end-game. | Can you explain to me in a way I can understand how and why | you are able to successfully obtain valuable results using this | method? Well, to start with, Joker80 at 1 sec per move still reaches a depth of 8-9 ply in the middle-game, and would probably still beat most Humans at that level. My experience is that, if I immediately see an obvious error, it is usually because the engine makes a strategic mistake, not a tactical one. And such strategic mistakes are awefully persistent, as they are a result of faulty evaluation, not search. If it makes them at 8 ply, it is very likely to make that same error at 20 ply. As even 20 ply is usually not enough to get the resolution of the strategical feature within the horizon. That being said, I really think that an important reason I can afford fast games is a statistical one: by playing so many games I can be reasonably sure that I get a representative number of gross errors in my sample, and they more or less cancel each other out on the average. Suppose at a certain level of play 2% of the games contains a gross error that turns a totally won position into a loss. If I play 10 games, there is a 20% error that one game contains such an error (affecting my result by 10%), and only ~2% probability on two such errors (that then in half the cases would cancel, but in other cases would put the result off by 20%). If, OTOH, I would play 1000 faster games, with an increased 'blunder rate' of 5% because of the lower quality, I would expect 50 blunders. But the probability that they were all made by the same side would be negligible. In most cases the imbalace would be around sqrt(50) ~ 7. That would impact the 1000-game result by only 0.7%. So virtually all results would be off, but only by about 0.7%, so I don't care too much. Another way of visualizing this would be to imagine the game state-space as a2-dimensional plane, with two evaluation terms determining the x- and y-coordinate. Suppose these terms can both run from -5 to +5 (so the state space is a square), and the game is won if we end in the unit circle (x^2 + y^2 < 1), but that we don't know that. Now suppose we want to know how large the probability of winning is if we start within the square with corners (0,0) and (1,1) (say this is the possible range of the evaluation terms when we posses a certain combination of pieces). This should be the area of a quarter circle, PI/4, divided by the area of the square (1), so PI/4 = 79%. We try to determine this empirically by randomly picking points in the square (by setting up the piece combination in some shuffled configuration), and let the engines play the game. The engines know that getting closer or farther away of (0,0) is associated with changing the game result, and are programmed to maximize or minimize this distance to the origin. If they both play perfectly, they should by definition succeed in doing this. They don't care about the 'polar angle' of the game state, so the point representing the game state will make a random walk on a circle around the origin. When the game ends, it will still be in the same region (inside or outside the unit circle), and games starting in the won region will all be won. Now with imperfect play, the engines will not conserve the distance to the origing, but their tug of war will sometimes change it in favor of one or the other (i.e. towards the origin, or away from it). If the engines are still equally strong, by definition on the average this distance will not change. But its probability distribution will now spread out over a ring with finite width during the game. This might lead to won positions close to the boundary (the unit circle) now ending up outside it, in the lost region. But if the ring of final game states is narrow (width << 1), there will be a comparable number of initial game states that diffuse from within the unit circle to the outside, as in the other direction. In other words, the game score as a function of the initial evaluation terms is no longer an absolute all or nothing, but the circle is radially smeared out a little, making a smooth transition from 100% to 0% in a narrow band centered on the original circle. This will hardly affect the averaging, and in particular, making the ring wider by decreasing playing accuracy will initially hardly have any effect. Only when play gets so wildly inaccurate that the final positions (where win/loss is determined) diverge so far from the initial point that it could cross the entire circle, you will start to see effects on the score. In the extreme case wher the radial diffusion is so fast that you could end up anywhere in the 10x10 square when the game finishes, the result score will only be PI/100 = 3%. So it all depends on how much the imperfections in the play spread out the initial positions in the game-state space. If this is only small compared to the measures of the won and lost areas, the result will be almost independent of it.

Derek Nalls wrote on Sun, May 11, 2008 10:05 PM UTC:
Before Scharnagl sent me three special versions of SMIRF MS-174c compiled
with the CRC material values of Scharnagl, Muller & Nalls, I began
playtesting something else that interested me using SMIRF MS-174b-O.

I am concerned that the material value of the rook (especially compared to
the queen) amongst CRC pieces in the Muller model is too low:

rook  55.88
queen  111.76

This means that 2 rooks exactly equal 1 queen in material value.

According to the Scharnagl model:

rook  55.71
queen  91.20

This means that 2 rooks have a material value (111.42) 22.17% greater than
1 queen.

According to the Nalls model:

rook  59.43
queen  103.05

This means that 2 rooks have a material value (118.86) 15.34% greater than
1 queen.

Essentially the Scharnagl & Nalls models are in agreement in predicting
victories in a CRC game for the player missing 1 queen yet possessing 2
rooks.  By contrast, the Muller model predicts draws (or appr. equal
number of victories and defeats) in a CRC game for either player.

I put this extraordinary claim to the test by playing 2 games at 10
minutes per move on an appropriately altered Embassy Chess setup with the
missing-1-queen player and the missing-2-rooks player each having a turn
at white and black.

The missing-2-rooks player lost both games and was always behind.  They
were not even long games at 40-60 moves.

Muller:

I think you need to moderately raise the material value of your rook in
CRC.  It is out of its proper relation with the other material values
within the set.

H. G. Muller wrote on Mon, May 12, 2008 05:57 AM UTC:
To Derek:

I am aware that the empirical Rook value I get is suspiciously low. OTOH,
it is an OPENING value, and Rooks get their value in the game only late.
Furthermore, this only is the BASE VALUE of the Rook; most pieces have a
value that depends on the position on the board where it actually is, or
where you can quickly get it (in an opening situation, where the opponent
is not yet able to interdict your moves, because his pieces are in
inactive places as well). But Rooks only increase their value on open
files, and initially no open files are to be seen. In a practical game, by
the time you get to trade a Rook for 2 Queens, there usually are open
files. So by that time, the value of the Q vs 2R trade will have gone up
by two times the open-file bonus. You hardly have the possibility of
trading it before there are open files. So it stands to reason that you
might as well use the higher value during the entire game.

In 8x8 Chess, the Larry Kaufman piece values include the rule that a Rook
should be devaluated by 1/8 Pawn for each Pawn on the board there is over
five. In the case of 8 Pawns that is a really large penalty of 37.5cP for
having no open files. If I add that to my opening value, the late
middle-game / end-game value of the Rook gets to 512, which sounds a lot
more reasonable.

There are two different issues here:
1) The winning chances of a Q vs 2R material imbalance game
2) How to interpret that result as a piece value

All I say above has no bearing on (1): if we both play a Q-2R match from
the opening, it is a serious problem if we don't get the same result. But
you have played only 2 games. Statistically, 2 games mean NOTHING. I don't
even look at results before I have at least 100 games, because before they
are about as likely to be the reverse from what they will eventually be,
as not. The standard deviation of the result of a single Gothic Chess game
is ~0.45 (it would be 0.5 point if there were no draws possible, and in
Gothic Chess the draw percentge is low). This error goes down as the
square root of the number of games. In the case of 2 games this is
45%/sqrt(2) = 32%. The Pawn-odds advantage is only 12%. So this standard
error corresponds to 2.66 Pawns. That is 1.33 Pawns per Rook. So with this
test you could not possibly see if my value is off by 25, 50 or 75. If you
find a discrepancy, it is enormously more likely that the result of your
2-game match is off from to true win probability.

Play 100 games, and the error in the observed score is reasonable certain
(68% of the cases) to be below 4.5% ~1/3 Pawn, so 16 cP per Rook. Only thn
you can see with reasonable confidence if your observations differ from
mine.

Derek Nalls wrote on Mon, May 12, 2008 07:06 PM UTC:
'You hardly have the possibility of trading it before there are open
files. So it stands to reason that you might as well use the higher value
during the entire game.'

Well, I understand and accept your reasons for leaving your lower rook 
value in CRC as is.  It is interesting that you thoroughly understand and
accept the reasons of others for using a higher rook value in CRC as
well.  Ultimately, is not the higher rook value in CRC more practical and useful to the game by your own logic?
_____________________________

'... if we both play a Q-2R match from the opening, it is a serious
problem if we don't get the same result. But you have played only 2
games. Statistically, 2 games mean NOTHING.'

I never falsely claimed or implied that only 2 games at 10 minutes per 
move mean everything or even mean a great deal (to satisfy probability
overwhelmingly).  However, they mean significantly more than nothing.  
I cannot accept your opinion, based upon a purely statistical viewpoint,
since it is at the exclusion another applicable mathematical viewpoint.  
They definitely mean something ... although exactly how much is not 
easily known or quantified (measured) mathematically.
__________________________________________________

'I don't even look at results before I have at least 100 games, because
before they are about as likely to be the reverse from what they will 
eventually be, as not.'

Statistically, when dealing with speed chess games populated 
exclusively with virtually random moves ... YES, I can understand and 
agree with you requiring a minimum of 100 games.  However, what you 
are doing is at the opposite extreme from what I am doing via my 
playtesting method.

Surely you would agree that IF I conducted only 2 games with perfect 
play for both players that those results would mean EVERYTHING.  
Unfortunately, with state-of-the-art computer hardware and chess variant 
programs (such as SMIRF), this is currently impossible and will remain 
impossible for centuries-millennia.  Nonetheless, games played at 100 
minutes per move (for example) have a much greater probability of 
correctly determining which player has a definite, significant advantage 
than games played at 10 seconds per move (for example).

Even though these 'deep games' play of nowhere near 600 times better
quality than these 'shallow games' as one might naively expect
(due to a non-linear correlation), they are far from random events 
(to which statistical methods would then be fully applicable).  
Instead, they occupy a middleground between perfect play games and 
totally random games.  [In my studied opinion, the example 
'middleground games' are more similar to and closer to perfect play 
games than totally random games.]  To date, much is unknown to
combinatorial game theory about the nature of these 'middleground 
games'.

Remember the analogy to coin flips that I gave you?  Well, in fact, 
the playtest games I usually run go far above and beyond such random 
events in their probable significance per event.

If the SMIRF program running at 90 minutes per move casted all of its 
moves randomly and without any intelligence at all (as a perfect 
woodpusher), only then would my 'coin flip' analogy be fully applicable.
Therefore, when I estimate that it would require 6 games (for example) 
for me to determine, IF a player with a given set of piece values loses 
EVERY game, that there is only a 63/64 chance that the result is
meaningful (instead of random bad luck), I am being conservative to the
extreme.  The true figure is almost surely higher than a 63/64 chance.

By the way, if you doubt that SMIRF's level of play is intelligent and
non-random, then play a CRC variant of your choice against it at 90 
minutes per move.  After you lose repeatedly, you may not be able to 
credit yourself with being intelligent either (although you should) ... 
if you insist upon holding an impractically high standard to define the 
word.
______

'If you find a discrepancy, it is enormously more likely that the result
of your 2-game match is off from its true win probability.'

For a 2-game match ... I agree.  However, this may not be true for a 
4-game, 6-game or 8-game match and surely is not true to the extremes 
you imagine.  Everything is critically dependant upon the specifications 
of the match.  The number of games played (of course), the playing 
strength or quality of the program used, the speed of the computer and 
the time or ply depth per move are the most important factors.
_________________________________________________________

'Play 100 games, and the error in the observed score is reasonable
certain (68% of the cases) to be below 4.5% ~1/3 Pawn, so 16 cP per Rook. Only then you can see with reasonable confidence if your observations differ from mine.'

It would require est. 20 years for me to generate 100 games with the 
quality (and time controls) I am accustomed to and somewhat satisfied 
with.  Unfortunately, it is not that important to me just to get you to
pay attention to the results for the benefit of only your piece values
model.  As a practical concern to you, everyone else who is working to
refine quality piece values models in FRC and CRC will have likely
surpassed your achievements by then IF you refuse to learn anything from
the results of others who use different yet valid and meaningful methods
for playtesting and mathematical analysis than you.

H. G. Muller wrote on Mon, May 12, 2008 10:12 PM UTC:
Drek Nalls:
| They definitely mean something ... although exactly how much is not 
| easily known or quantified (measured) mathematically.
Of course that is easily quantified. The entire mathematical field of
statistics is designed to precisely quantify such things, through
confidence levels and uncertainty intervals. The only thing you proved
with reasonable confidence (say 95%) is that two Rooks are not 1.66 Pawn
weaker than a Queen. So if Q=950, then R > 392. Well, no one claimed
anything different. What we want to see is if Q-RR scores 50% (R=475) or
62% (R=525). That difference just can't be seen with two games. Play 100.
There is no shortcut. Even perfect play doesn't help. We do have perfect
play for all 6-men positions. Can you derive piece values from that, even
end-game piece values???

| Statistically, when dealing with speed chess games populated 
| exclusively with virtually random moves ... YES, I can understand and 
| agree with you requiring a minimum of 100 games.  However, what you 
| are doing is at the opposite extreme from what I am doing via my 
| playtesting method.
Where do you get this nonsense? This is approximately master-level play.
Fact is that results from playing opening-type positions (with 35 pieces
or more) are stochastic quantity at any level of play we are likely to see
the next few million years. And even if they weren't, so that you could
answer the question 'who wins' through a 35-men tablebase, you would
still have to make some average over all positions (weighted by relevance)
with a certain material composition to extract piece values. And if you
would do that by sampling, the resukt would again be a sochastic quantity.
And if you would do it by exhaustive enumeration, you would have no idea
which weights to use.
And if you are sampling a stochastic quantity, the error will be AT LEAST
as large as the statistical error. Errors from other sources could add to
that. But if you have two games, you will have at least 32% error in the
result percentage. Doesnt matter if you play at an hour per move, a week
per move, a year per move, 100 year per move. The error will remain >=
32%. So if you want to play 100 yesr per move, fine. But you will still
need 100 games.

| Nonetheless, games played at 100 minutes per move (for example) have 
| a much greater probability of correctly determining which player has 
| a definite, significant advantage than games played at 10 seconds per 
| move (for example).
Why do I get the suspicion that you are just making up this nonsense? Can
you show me even one example where you have shown that a certain material
advantage would be more than 3-sigma different for games at 100 min / move
than for games at 1 sec/move? Show us the games, then. Be aware that this
would require at least 100 games at aech time control. That seems to make
it a safe guess that you did not do that for 100 min/move.
 On the other hand, in stead of just making things up, I have actually
done such tests, not with 100 games per TC, but with 432, and for the
faster even with 1728 games per TC. And there was no difference beyond the
expected and unavoidable statistical fluctuations corresponding to those
numbers of games, between playing 15 sec or 5 minutes. 
The advantage that a player has in terms of winning probability is the
same at any TC I ever tried, and can thus equally reliably be determined
with games of any duration. (Provided ou have the same number of games).
If you think it would be different for extremely long TC, show us
statistically sound proof.

I might comment on the rest of your long posting later, but have to go
now...

Derek Nalls wrote on Tue, May 13, 2008 02:39 AM UTC:
'Of course, that is easily quantified. The entire mathematical field of
statistics is designed to precisely quantify such things, through
confidence levels and uncertainty intervals.'

No, it is not easily quantified.  Some things of numerical importance
as well as geometric importance that we try to understand or prove 
in the study of chess variants are NOT covered or addressed by statistics.
I wish our field of interest was that simple (relatively speaking) and
approachable but it is far more complicated and interdisciplinary.  
All you talk about is statistics.  Is this because statistics is all you
know well?
___________

'That difference just can't be seen with two games. Play 100.
There is no shortcut.'

I agree.  Not with only 2 games.  

However ...

With only 4 games, IF they were ALL victories or defeats for the player 
using a given piece values model, I could tell you with confidence 
that there is at least a 15/16 chance the given piece values model is 
stronger or weaker, respectively, than the piece values model used by 
its opponent.  [Otherwise, the results are inconclusive and useless.]

Furthermore, based upon the average number of moves per game 
required for victory or defeat compared to the established average 
number of moves in a long, close game, I could probably, correctly 
estimate whether one model was a little or a lot stronger or weaker, 
respectively, than the other model.  Thus, I will not play 100 games 
because there is no pressing, rational need to reduce the 'chance of 
random good-bad luck' to the ridiculously-low value of 
'the inverse of (base 2 to exponent 100)'.

Is there anything about the odds associated with 'flipping a coin'
that is beyond your ability to understand?  This is a fundamental 
mathematical concept applicable without reservation to symmetrical 
playtesting.  In any case, it is a legitimate 'shortcut' that I can and
will use freely.
________________

'Even perfect play doesn't help. We do have perfect play for all 6-men 
positions.'

I meant perfect play throughout an entire game of a CRC variant 
involving 40 pieces initially.  That is why I used the word 'impossible'
with reference to state-of-the-art computer technology.
_______________________________________________________

'This is approximately master-level play.'

Well, if you are getting master-level play from Joker80 with speed
chess games, then I am surely getting a superior level of play from 
SMIRF with much longer times and deeper plies per move.  You see,
I used the term 'virtually random moves' appropriately in a 
comparative context based upon my experience.
_____________________________________________

'Doesn't matter if you play at an hour per move, a week per move, 
a year per move, 100 year per move. The error will remain >=32%. 
So if you want to play 100 years per move, fine. But you will still
need 100 games.'

Of course, it matters a lot.  If the program is well-written, then the 
longer it runs per move, the more plies it completes per move
and consequently, the better the moves it makes.  Hence,
the entire game played will progressively approach the ideal of 
perfect play ... even though this finite goal is impossible to attain.
Incisive, intelligent, resourceful moves must NOT to be confused with 
or dismissed as purely random moves.  Although I could humbly limit 
myself to applying only statistical methods, I am totally justified,
in this case, in more aggressively using the 'probabilities associated 
with N coin flips ALL with the same result' as an incomplete, minimum 
value before even taking the playing strength of SMIRF at extremely-long 
time controls into account to estimate a complete, maximum value.
______________________________________________________________

'The advantage that a player has in terms of winning probability is the
same at any TC I ever tried, and can thus equally reliably be determined
with games of any duration.'

You are obviously lacking completely in the prerequisite patience and 
determination to have EVER consistently used long enough time controls 
to see any benefit whatsoever in doing so.  If you had ever done so, 
then you would realize (as everyone else who has done so realizes) 
that the quality of the moves improves and even if the winning probability
has not changed much numerically in your experience, the figure you 
obtain is more reliable.  

[I cannot prove to you that this 'invisible' benefit exists
statistically. Instead, it is an important concept that you need to
understand in its own terms.  This is essential to what most playtesters do, with the notable exception of you.  If you want to understand what I do and why, then you must come to grips with this reality.]

Derek Nalls wrote on Tue, May 13, 2008 03:38 AM UTC:
CRC piece values tournament
http://www.symmetryperfect.com/pass/

Just push the 'download now' button.

Game #1
Scharnagl vs. Muller
10 minutes per move
SMIRF MS-174c

Result- inconclusive.
Draw after 87 moves by black.
Perpetual check declared.

H. G. Muller wrote on Tue, May 13, 2008 07:17 AM UTC:
This discussion is pointless. In dealing with a stochastic quantity, if
your statistics are no good, your observations are no good, and any
conclusions based on them utterly meaningless. Nothing of what you say
here has any reality value, it is just your own fantasies. First you
should have results, then it becomes possible to talk about what they
mean. You have no result. Get statistically meaningful testresults. If
your method can't produce them, or you don't feel it important enough to
make your method produce them, don't bother us with your cr*p instead.

Two sets of piece values as different as day and knight, and the only
thing you can come up with is that their comparison is 'inconclusive'.
Are you sure that you could conclusively rule out that a Queen is worth 7,
or a Rook 8, by your method of 'playtesting'? Talk about pathetic: even
the two games you played are the same. Oh man, does your test setup s*ck!
If you cannot even decide simple issues like this, what makes you think
you have anything meaningful to say about piece values at all?

H. G. Muller wrote on Tue, May 13, 2008 10:59 AM UTC:
Once upon a time I had a friend in a country far, far away, who had
obtained a coin from the bank. I was sure this coin was counterfeit, as it
had a far larger probability of producing tails. I even PROVED it to him: I
threw the coin twice, and both times tails came up. But do you think the
fool believed me? No, he DIDN'T! 

He had the AUDACITY to claim there was nothing wrong with the coin,
because he had tossed it a thouand times, and 523 times heads had come up!
While it was clear to everyone that he was cheating: he threw the coin only
10 feet up into the air, on each try. While I brought my coin up to 30,000
feet in an airplane, before I threw it out of the window, BOTH times! And,
mind you, both times it landed tails! And it was not just an ordinary
plane, like a Boeing 747. No sir, it was a ROCKET plane!

And still this foolish friend of mine insisted that his measly 10 feet
throws made him more confident that the coin was OK then my IRONCLAD PROOF
with the rocket plane. Ridicuoulous! Anyone knows that you can't test a
coin by only tossing it 10 feet. If you do that, it might land on any
side, rather than the side it always lands on. He might as well have
flipped a coin! No wonder they send him to this far, far away country: no
one would want to live in the same country as such an idiot. He even went
as far as to buy an ICECREAM for that coin, and even ENJOYED eating that!
Scandalous! I can tell you, he ain't my friend anymore! Using coins that
always land on one side as if it were real money.

For more fairy tales and bed-time stories, read Derek's postings on piece
values...
:-) :-) :-)

Jianying Ji wrote on Tue, May 13, 2008 12:59 PM UTC:
Two suggestion for settling debates such as these. First distributed
computing to provide as much data as possible. And bayesian statistical
methods to provide statistical bounds on results.

H. G. Muller wrote on Tue, May 13, 2008 01:58 PM UTC:
Jianying Ji:
| Two suggestion for settling debates such as these. First distributed
| computing to provide as much data as possible. And bayesian statistical
| methods to provide statistical bounds on results.

Agreed: one first needs to generate data. Without data, there isn't even
a debate, and everything is just idle talk. What bounds would you expect
from a two-game dataset? And what if these two games were actually the
same?

But the problem is that the proverbial fool can always ask more than
anyone can answer. If, by recruting all PCs in the World, we could
generate 100,000 games at an hour per move, an hour per move will of
course not be 'good enough'. It will at least have to be a week per
move. Or, if that is possible, 100 years per move.

And even 100 years per move are of course no good, because the computers
will still not be able to search into the end-game, as they will search
only 12 ply deeper than with 1 hour per move. So what's the point?

Not only is his an énd-of-the-rainbow-type endeavor, even if you would get
there, and generate the perfect data, where it is 100% sure and prooven for
each position what the outcome under perfect play is, what then? Because
for simple end-games we are alrady in a position to reach perfect play,
through retrograde analysis (tablebases).

So why not start there, to show that such data is of any use whatsoever,
in this case for generating end-game piece values? If you have the EGTB
for KQKAN, and KAKBN, how would you extract a piece value for A from it?

Derek Nalls wrote on Tue, May 13, 2008 03:08 PM UTC:
'This discussion is pointless.'

On this one occasion, I agree with you.

However, I cannot just let you get away with some of your most 
outrageous remarks to date.

So, unfortunately, this discussion is not yet over.
____________________________________________

'First you should have results, 
then it becomes possible to talk about what they mean. 
You have no result.'

Of course, I have a result!

The result is obviously the game itself as a win, loss or draw
for the purposes of comparing the playing strengths of two
players using different sets of CRC piece values.

The result is NOT statistical in nature.
Instead, the result is probabilistic in nature.

I have thoroughly explained this purpose and method to you.
I understand it.
Reinhard Scharnagl understands it.
You do not understand it.
I can accept that.
However, instead of admitting that you do not understand it,
you claim there is nothing to understand.
______________________________________

'Two sets of piece values as different as day and night, and the only
thing you can come up with is that their comparison is
'inconclusive'.'

Yes.  Draws make it impossible to determine which of two sets of
piece values is stronger or weaker.  However, by increasing the
time (and plies) per move, smaller differences in playing strength 
can sometimes be revealed with 'conclusive' results.

I will attempt the next pair of Scharnagl vs. Muller and Muller vs.
Scharnagl games at 30 minutes per move.  Knowing how much
you appreciate my efforts on your behalf motivates me.
___________________________________________________

'Talk about pathetic: even the two games you played are the same.'

Only one game was played.

The logs you saw were produced by the Scharnagl (standard) version
of SMIRF for the white player and the Muller (special) version of SMIRF
for the black player.  The game is handled in this manner to prevent 
time from being expired without computation occurring.
___________________________________________________

'... does your test setup s*ck!'

What, now you hate Embassy Chess too?
Take up this issue with Kevin Hill.

Jianying Ji wrote on Tue, May 13, 2008 03:28 PM UTC:
I really am completely lost, so I won't comment until I can see what the
debate is about.

Reinhard Scharnagl wrote on Tue, May 13, 2008 03:50 PM UTC:
H.G.M. wrote: '... he threw the coin only 10 feet up into the air, on each try. While I brought my coin up to 30,000 feet in an airplane ...'

Understanding your example as an argument against Derek Nalls' testing method, I wonder why your chess engines always are thinking using the full given timeframe. It would be much more impressive, if your engine would decide always immediately. ;-)

I am still convinced, that longer thinking times would have an influence on the quality of the resulting moves.

Derek Nalls wrote on Tue, May 13, 2008 04:18 PM UTC:
Since I had to endure one of your long bedtime stories (to be sure),
you are going to have to endure one of mine.  Yet unlike yours
[too incoherent to merit a reply], mine carries an important point:

Consider it a test of your common sense-

Here is a scenario ...

01.  It is the year 2500 AD.

02.  Androids exist.

03.  Androids cannot tell lies.

04.  Androids can cheat, though.

05.  Androids are extremely intelligent in technical matters.

06.  Your best friend is an android.

07.  It tells you that it won the lottery.

08.  You verify that it won the lottery.

09.  It tells you that it purchased only one lottery ticket.

10.  You verify that it purchased only one lottery ticket.

11.  The chance of winning the lottery with only one ticket is 1 out of
100 million.

12.  It tells you that it cheated to win the lottery by hacking into its
computer system immediately after the winning numbers were announced,
purchasing one winning ticket and back-dating the time of the purchase.
____________________________________________

You have only two choices as to what to believe happened-

A.  The android actually won the lottery by cheating.

OR

B.  The android actually won the lottery by good luck.
The android was mistaken in thinking it successfully cheated.
______________________________________________________

The chance of 'A' being true is 99,999,999 out of 100,000,000.
The chance of 'B' being true is 1 out of 100,000,000.
________________________________________________

I would place my bet upon 'A' being true
because I do not believe such unlikely coincidences
will actually occur.

You would place your bet upon 'B' being true
because you do not believe such unlikely coincidences
have any statistical significance whatsoever.
_________________________________________

I make this assessment of your judgment ability fairly because you think
it is a meaningless result if a player with one set of CRC piece values
wins against its opponent 10-times-in-a-row even as the chance of it being
'random good luck' is indisputably only 1 out of 1024.

By the way ...

base 2 to exponent 100 equals 1,267,650,600,228,229,401,496,703,205,376.

Can you see how ridiculous your demand of 100 games is?

H. G. Muller wrote on Tue, May 13, 2008 04:57 PM UTC:
Is this story meant to illustrate that you have no clue as to how to
calculate statistical significance? Or perhaps that you don't know what
it is at all?

The observation of a single tails event rules out the null hypothesis that
the lottery was fair (i.e. that the probability for this to happen was
0.000,000,01) with a confidence of 99.999,999%.

Be careful, though, that this only describes the case where the winning
android was somehow special or singled out in advance. If the other
participants to the lottery were 100 million other cheating androids, it
would not be remarkable in anyway that one of them won. The null
hypothesis that the lottery was fair predicted a 100% probability for
that.

But, unfortunately for you, it doesn't work for lotteries with only 2
tickets. Then you can rule the null hypothesis that the lottery was fair
(and hence the probability 0.5) with a confidence of 50%. And 50%
confidence means that in 50% of the cases your conclusion is correct, and
in the other 50% of the cases not. In other words, a confidence level of
50% is a completely blind, uninformed random guess.

H. G. Muller wrote on Tue, May 13, 2008 05:06 PM UTC:
Reinhard Scharnagl:
| I am still convinced, that longer thinking times would have an 
| influence on the quality of the resulting moves.

Yes, so what? Why do you think that is a relevant remark? The better moves
won't help you at all, if the opponent also does better moves. The result
will be the same. And the rare cases it is not, on the average cancel each
other.

So for the umptiest time:
NO ONE DENIES THAT LONGER THINKING TIME PRODUCES SOMEWHAT BETTER MOVES.
THE ISSUE IS THAT IF BOTH SIDES PLAY WITH LONGER TC, THEIR WINNING
PROBABILITIES WON'T CHANGE.

And don't bother to to tell us that you are also convinced that the
winning probabilities will change, without showing us proof. Because no
one is interested in unfounded opinions, not even if they are yours.

Derek Nalls wrote on Tue, May 13, 2008 05:27 PM UTC:
'Is this story meant to illustrate that you have no clue as to how to
calculate statistical significance?'

No.

This story is meant to illustrate that you have no clue as to how to
calculate probabilistic significance ... and it worked perfectly.
________________________________________________________

There you go again.  Missing the point entirely and ranting about
probabilities not being proper statistics.

25 comments displayed

EarliestEarlier Reverse Order LaterLatest

Permalink to the exact comments currently displayed.