[ List Earliest Comments Only For Pages | Games | Rated Pages | Rated Games | Subjects of Discussion ]
Comments by HGMuller
Once upon a time I had a friend in a country far, far away, who had obtained a coin from the bank. I was sure this coin was counterfeit, as it had a far larger probability of producing tails. I even PROVED it to him: I threw the coin twice, and both times tails came up. But do you think the fool believed me? No, he DIDN'T! He had the AUDACITY to claim there was nothing wrong with the coin, because he had tossed it a thouand times, and 523 times heads had come up! While it was clear to everyone that he was cheating: he threw the coin only 10 feet up into the air, on each try. While I brought my coin up to 30,000 feet in an airplane, before I threw it out of the window, BOTH times! And, mind you, both times it landed tails! And it was not just an ordinary plane, like a Boeing 747. No sir, it was a ROCKET plane! And still this foolish friend of mine insisted that his measly 10 feet throws made him more confident that the coin was OK then my IRONCLAD PROOF with the rocket plane. Ridicuoulous! Anyone knows that you can't test a coin by only tossing it 10 feet. If you do that, it might land on any side, rather than the side it always lands on. He might as well have flipped a coin! No wonder they send him to this far, far away country: no one would want to live in the same country as such an idiot. He even went as far as to buy an ICECREAM for that coin, and even ENJOYED eating that! Scandalous! I can tell you, he ain't my friend anymore! Using coins that always land on one side as if it were real money. For more fairy tales and bed-time stories, read Derek's postings on piece values... :-) :-) :-)
Jianying Ji: | Two suggestion for settling debates such as these. First distributed | computing to provide as much data as possible. And bayesian statistical | methods to provide statistical bounds on results. Agreed: one first needs to generate data. Without data, there isn't even a debate, and everything is just idle talk. What bounds would you expect from a two-game dataset? And what if these two games were actually the same? But the problem is that the proverbial fool can always ask more than anyone can answer. If, by recruting all PCs in the World, we could generate 100,000 games at an hour per move, an hour per move will of course not be 'good enough'. It will at least have to be a week per move. Or, if that is possible, 100 years per move. And even 100 years per move are of course no good, because the computers will still not be able to search into the end-game, as they will search only 12 ply deeper than with 1 hour per move. So what's the point? Not only is his an énd-of-the-rainbow-type endeavor, even if you would get there, and generate the perfect data, where it is 100% sure and prooven for each position what the outcome under perfect play is, what then? Because for simple end-games we are alrady in a position to reach perfect play, through retrograde analysis (tablebases). So why not start there, to show that such data is of any use whatsoever, in this case for generating end-game piece values? If you have the EGTB for KQKAN, and KAKBN, how would you extract a piece value for A from it?
Is this story meant to illustrate that you have no clue as to how to calculate statistical significance? Or perhaps that you don't know what it is at all? The observation of a single tails event rules out the null hypothesis that the lottery was fair (i.e. that the probability for this to happen was 0.000,000,01) with a confidence of 99.999,999%. Be careful, though, that this only describes the case where the winning android was somehow special or singled out in advance. If the other participants to the lottery were 100 million other cheating androids, it would not be remarkable in anyway that one of them won. The null hypothesis that the lottery was fair predicted a 100% probability for that. But, unfortunately for you, it doesn't work for lotteries with only 2 tickets. Then you can rule the null hypothesis that the lottery was fair (and hence the probability 0.5) with a confidence of 50%. And 50% confidence means that in 50% of the cases your conclusion is correct, and in the other 50% of the cases not. In other words, a confidence level of 50% is a completely blind, uninformed random guess.
Reinhard Scharnagl: | I am still convinced, that longer thinking times would have an | influence on the quality of the resulting moves. Yes, so what? Why do you think that is a relevant remark? The better moves won't help you at all, if the opponent also does better moves. The result will be the same. And the rare cases it is not, on the average cancel each other. So for the umptiest time: NO ONE DENIES THAT LONGER THINKING TIME PRODUCES SOMEWHAT BETTER MOVES. THE ISSUE IS THAT IF BOTH SIDES PLAY WITH LONGER TC, THEIR WINNING PROBABILITIES WON'T CHANGE. And don't bother to to tell us that you are also convinced that the winning probabilities will change, without showing us proof. Because no one is interested in unfounded opinions, not even if they are yours.
Reinhard, that is not relevant. It will happen on the average as often for the other side. It is in the nature of Chess. Every game that is won, is won by an error, that might not have been made on longer thinking. As the initial position is not a won position for eaiter side. But most games are won by either side, and if they are allowed to think longer, most games are still won by either side. What is so hard to understand about the statement 'the win probability (score fraction, if you allow for draws) obtained from a given quiet, but complex (many pieces) position between equal opponents does not depend on time control' that it prompt people to come up with irrelevancies? Why do you think that saying anything at all that does not mention an observed probability would have any bearing on this statement whatsoever? I don't think the ever more hollow sounding selfdeclared superiority of Derek need much comment. He obviously doesn't know zilch about probability theory and statistics. Shouting that he does won't make it so, and won't fool anyone.
This discussion is too silly for words anyway. Because even if it were true that the winning probability for a given material imbalance would be different at 1 hour per move than it would be at 10 sec/move, it would merely mean that piece values are different for different quality players. And although that is unprecedented, that revelation in itself would not make the piece values at 1 hour per move of any use, as that is a time control that no one wants to play anyway. So the whole endeavor is doomed from the start: by testing at 1 hour per move, either you measure the same piece values as you would at 10 sec/move, and wasted 99.7% of your time, or you find different values, and then you have wrong values, which cannot be used at any time control you would actually want to play...
My small live tourney has led to a proliferation of WinBoard-compatible Knightmate engines. We now have: JokerKM ( http://home.hccnet.nl/h.g.muller/jokerKM.exe) CCCP-Knightmate ( http://www.marittima.pl/cccp) Fairy-Max ( http://home.hccnet.nl/h.g.muller/dwnldpage.html, do not forget to download the accompanying fmax.ini with game definitions!) Dabbaba ( http://homepages.tesco.net/henry.ablett/jims.html) JokerKM is the strongest, CCCP and Fairy-Max are both about 400 Elo points weaker. Dabbaba is a rebuild of an old DOS engine from the 90s, and is some 300 Elo points behind that.
Rich Hutnik: | Anyone think this might be a sound approach? Well, not me! Science is not a democracy. We don't interview people in the street to determine if a neutron is heavier than a proton, or what the 100th decimal of the number pi is. At best, you could use this method to determine the CV rating of the interviewed people. But even if a million people would think that piece A is worth more than piece B, and none the other way around, that doesn't make it so. The only thing that counts is if A makes you win more often than B would. If it doesn't, than it is of lower value. No matter what people say, or how many say it.
Note there are now many free computer programs that can play the 10x8 variants with the Capablanca piece set. Many do use the WinBoard protocol to communicate their moves, so they can be made to play each other automatically under the WinBoard GUI. Pages with many links to downloadable engines you will find at http://home.hccnet.nl/h.g.muller/10x8.html and [at another site.] The results of a recent tournament of the WB compatible engines at long time control (55 min + 5 sec/move), where each engine had to play each other engine 10 times, over 5 different opening setups (Carrera, Bird, Capablanca, and Embassy), led to the following ranking: Rank Name Elo + - games score oppo. draws 1 Joker80 n 2432 96 83 70 80% 2110 0% 2 TJchess10x8 2346 83 76 70 72% 2122 4% 3 Smirf 1.73h 2304 80 75 70 68% 2128 4% 4 Smirf Donation 2165 73 73 70 53% 2148 9% 5 [other software] 6 Fairy-Max 4.8 v 2027 72 77 70 34% 2168 11% 7 BigLion80 4apr 1945 76 84 70 26% 2179 7% 8 ArcBishop80 1.00 1822 86 103 70 15% 2197 4% Except for Smirf 1.73h, all the engines are available for free download, from their various sources. In addition, there exist several programs with incompatible interfaces, such as ChessV and Zillions of Games. Their level of play is not thoroughly tested, as the incompatibility of their interfaces makes it impossible to play them against each other without assistance of a Human operator, which again makes it difficult to conduct the hundreds of games necessary for reliable rating determination. Compared to the ranking above, Zillions would rank at the very bottom. [The above has been edited to remove a name and site reference. It is the policy of cv.org to avoid mention of that particular name and site to remove any threat of lawsuits. Sorry to have to do that, but we must protect ourselves. - J. Joyce]
I am sorry to have put your site in jeopardy, I was not aware that giving a link to a site as a source of information could make you subject to a lawsuit. But why did you delete the reference to poor Michel's program? My own engines are mentioned on the unspeakable website as well, on the very page of which you deleted the link. I even gave permission to its owner to host them there for download, should I no longer want to host them myself. Does that mean I will in the future also not be allowed to mention any of my own engines here??? Would it at least be allowed to mention the perfomance rating of the [other software]? Anyway, people interested in the complete result of the WinBoard General 10x8 Championship 2008, can find it on my own website, on the page: http://home.hccnet.nl/h.g.muller/BotG08G/finalstanding.html
OK, fair enough. But 10x8 variants are rapidly growing more popular with engine programmers, and I intend to contribute to that process through organizing the 'Battle of the Goths' tournament series, and publishing rating lists. I might want to share important developments in that area here, so it would be useful to know which engines can be mentioned, and which not. Is the problem caused by the 'G-word', and should I avoid any reference to engines that contain the G-word as part of their name? So far there are only two of this, but there are likely to be many more in the future, as people tend to name their engines after the variant they are playing.
Would it be OK then, if I just circumscribe the [other software] in my tournament as 'a version of the well known open-source program TSCP, adapted to play some 10x8 variants', and call it 'TSCP-derivative' for short? Or is it too risky to mention the name of popular Chess engines like TSCP even in their normal Chess version, (or Capablanca version), once someone created a derivative of them that is capable to play the unspeakable variant?
I am currently engaged in a massive test effort to understand such short-range leapers. It is slow going, though: there are many possible combinations of moves, especially if you drop the requirement for 8-fold symmetry. And I need at least 400 games to get an acceptable accuracy for the empirical piece value of a ertain piece type. Even then, the statistical (random) error in the piece values is about 0.1 Pawn, if I test them in pairs (to double the effect of any value difference).
Your estimate seems reasonable, from what I have learned so far. 8-fold-symmetric SR compound leapers with N moves seem to have a value close to (30+5/8*N)*N, in centiPawn. That would evaluate to 640 for the Squirrel. And I expect the Squirrel to be one of the stronger such compounds, with this number of moves, because of the 'front' of 5 contiguous forward moves.
First about the potential bug:
I am afraid that I need more information to figure out what exactly was
the problem. This is not a plain move-generator bug; when I feed the game
to to my version of Joker80 here (which is presumably the same as that you
are using), it accepts the move without complaints. It would be
unconceivable anyway that a move-generator bug in such a common move would
not have manifested itself in the many hundreds of games I had it play
against other engines.
OTOH, Human vs. engine play is virtually untested. Did you at any point of
the game use 'undo' (through the WinBoard 'retract move')? It might be
that the undo is not correctly implemented, and I would not notice it in
engine-engine play. In fact it is very likely to be broken fter setting up
a position, as I implemented it by resetting to the opening position and
replaying all moves from there. But this won't work after loading a FEN
(a feature I added only later). This is indeed something I should fix, but
the current work-around would be not to use 'undo'.
To make sure what happened, I would have to see the winboard.debug file
(which records all communication between engine and GUI, including a lot
of debug output from the engine itself). Unfortunately this file is not
made by default. You would have to start WinBoard with the command-line
option /debug, or press + + after starting WinBoard. And
then immediately rename the winboard.debug to something else if a bug
manisfests itself, to prevent it from being overwritten when you run
WinBoard again.
Joker80 also makes a log file 'jokerlog.txt', but this also is
overwritten each time you re-run it. If you didn't run Joker80 since the
bug, it might help if you sent me that file. Otherwise, I am afraid that
there is little I can do at the moment; we would have to wait until the
problem occurs again, and examine the recorded debug information.
About the piece values:
I could make a Joker80 version that reads the piece base values from a
file 'joker.ini' at startup. Then you could change them to anything you
want to test, without the need to re-compile. Would that satisfy your
needs?
Note that currently Joker80 is not really able to play CRC, as it only
supports normal castling
OK, I replaced the joker80.exe on my website by one with adjustable piece values. (If you run it from the command line, it should say version 1.1.14 (h).) I also tried to fix the bug in undo (which I discoverd was disabled altogether in the previous version), and although it seemed to work, it might remain a weak spot. (I foresee problems if the game contained a promotion, for instance, as it might not remember the correct promotion piece on replay.) So try to avoid using the undo. I decided to make the piece values adjustable through a command-line option, rather than from a file, to avoid problems if you want to run two different sets of piece values (where you then would have to keep the files separate somehow). The way it works now is that for the engine name (that WinBoard asks in the startup dialog, or that you can put in the winboard.ini file to appear in the selectable engines there), you should write: joker80.exe P85=300=350=475=875=900=950 The whole thing should be put between double quotes, so that WinBoard knows the P... is an option to the engine, and not to WinBoard. The numerical values are those of P, N, B, R, A, C and Q, respectively, in centiPawn. You can replace them by any value you like. If you don't give the P argument, it uses the default values. If you give a P argument with not enough values, the engine exits. Note that these are base values, for the positionally average piece. For N and B this would be on c3, in the presence (for B) of ~ 6 own Pawns, half of them on the color of the Bishop. A Bishop pair further gets 40cP bonus. For the Rook it is the value for one in the absence of (half-)open files. The Pawn value will be heavily modified by positional effects (centralization, support by own Pawns, blocking by enemy Pawns), which on the average will be positive. Note that you can play two different versions against each other automatically. The first engine plays white, in two-machines mode. (You won't be able to recognize them from their name...)
One small refinement: If the command-line argument was used to modify the piece values, Joker80 will give its own name to WinBoard as 'Joker80.xp', in stead of 'Joker80.np', so that it becomes less hard to figure out which engine was winning (e.g. from the PGN file). Note also that at very long time control you might want to enlarge the hash table; default is 128MB, but if you invoke Joker80 as 'joker80.exe 22 P100=300=....' it will use 256MB (and with 23 in stead of 22 it will use 512MB, etc.)
What characterizes Chess variants: 1) move one piece at a time to an empty cell, or 2) capture an enemy piece by moving into its cell 3) win by capture of royal piece 4) many different piece types 5) a large fraction of the pieces are pawns 6) pawns are weak pieces which move irreversibly, and promote to a stronger piece when advanced enough. Some of these rules can be violated, but only if all other characteristics are very close to a very common variant.
Well, even FIDE Chess violates the defining characteristics, by the non-Chess-like moves of castling and e.p. capture. But, like I stated, violation of some of the rules does not immediately disqualify a game as a CV. Extinction Chess doesn't have a royal piece, but in all other respects it is identical to FIDE Chess. So it is clearly a CV. But I would not call checkers or draughts CVs. In the interpretation that the chips are pawns, (they do promote...), the capture mode and piece variety is too different from common variants to qualify. I do not consider Ultima / Baroque a Chess variant. It does have piece variety, and even a royal piece, but the capture modes are too alien, only the King has a Chess-like capture, most pieces don't. I see no problem with Jacks and Witches. The majority of the pieces are normal Chess pieces. OK, so some Witch moves violate the one-at-a-time rule, like castling does. No problem, as even within this game this is an exception. IMO the array is not relevant as a distinctive trait of variants. You could call them sub-variants at best. Near Chess is simply FIDE Chess. The opening position of Near Chess occurs even in the game tree of FIDE Chess. In that respect FRC is more different from FIDE Chess than Near Chess is: there at least the opening position can be unreachable frrom the FIDE opening.
Well, to get an impression at what you can expect: In my first versions of Joker80 I still used the Larry-Kaufman piece values of 8x8 Chess. So the Bishop was half a Pawn too low, nearly equal to the Knight (as with more than 5 Pawns, Kaufman has a Knight worth more than a lone Bishop, neutraling a large part of the pair bonus.) Now unlike a Rook, a Bishop is very easy to trade for a Knight, as both get into play early. Making the trade usually wrecks the opponent's pawn structure by creating a doubled Pawn, giving enough compensation to make it attractive. So in almost all games Joker played with two Knights against two Bishops after 12 moves or so. Fixing that did increase the playing strength by ~100 Elo points. So where the old version would score 50%, the improved version would score 57%. Now a similarly bad value for the Rook would manifest itself much more difficultly: the Rooks get into play late, there is no nearly equal piece for which a 1:1 trade changes sign, and you would need 1:3 trades (R vs B+2P) or 2:2 trades (R+P for N+N), which are much more difficult to set up. So I would expect that being half a Pawn off on the Rook value would only reduce your score by about 3%, rather than 7% as with the Bishop. After playing 100 games, the score differs by more than 3% from the true win probability more often than not. So you would need at least 400 games to show with minimal confidence that there was a difference. Beware that the result of the games are stochastic quantities. Replay the game at the same time control, and the game Joker80 plays will be different. And often the result will be different. This is true at 1 sec per move, but it is equally true at 1 year per move. The games that will be played, are just a sample from the myriads of games Joker80 could play with non-zero probability. And with fewer than 400 games, the difference between the actually measured score percentage and the probability you want to determine will be in most cases larger than the effect of the piece values, if they are not extremey wrong (e.g. setting Q < B).
It looks OK to me. One caveat: the normalization (e.g. Pawn = 100) is not completely arbitrary, as the engine weights material against positional terms, and doubling all piece values would effectively scale down the importance of passers and King Safety. In addition, the engine also uses some heavily rounded 'quick' piece values internally, where B=N=3, R=5, A=C=8 and Q=9, to make a rough guess if certain branches stand any chance to recoup the material it gave earlier in the branche. So in certain situations, when it is behind 800 cP, it won't consider capturing a Rook, because it expects that to be worth about 500 cP, and thus falls 300 cP below the target. Such a large deficit would be beyond the safety margin for pruning the move. But if the piece values where scaled up such that the 800 merely represented being a Bishop behind, this obviously would be an unjustified pruning. The safety margin is large enough to allow some leeway here, but don't overdo it. It would be safest to keep the value of Q close to 950. I am indeed skeptical to the possibility to do enough games to measure the difference you want to see in the total score percentage. But perhaps some sound conclusions could be drawn by not merely looking at the result, but at the actual games, and single out the Q vs 2R trades. (Or actually any Rook versus other material trade before the end-game. Rooks capturing Pawns to prevent their promotion probably should not count, though.) These could then be used to separately extracting the probability for such a trade for the two sets of piece values, and determine the winning probability for each of the piece values once such a trade would have occurred. By filtering the raw data this way, we get rid of the stochastic noise produced by the (majority of) games whwre the event we want to determine the effect of would not have occurred.
Well, I share that concern. But note that the low Rook value was not only based on the result of Q-2R assymetric testing. I also played R-BP and NN-RP, which ended unexpectedly bad for the Rook, and sets the value of the Rook compared to that of the minor pieces. While the value of the Queen was independently tested against that of the minor pieces by playing Q-BNN. The low difference between R and B does make sense to me now, as the wider board should upgrade the Bishop a lot more than the Rook. The Bishop gets extra forward moves, and forward moves are worth a lot more than lateral moves. I have seen that in testing cylindrical pieces, (indicated by *), where the periodic boundary condition w.r.t. the side edges effectifely simulates an infinitely wide board. In a context of normal Chess pieces, B* = B+P, while R* = R + 0.25P. OTOH, Q* = Q+2P. So it doesn't surprise me that on wider boards R loses compared to Q and B. I can think of several systematic errors that lead to unrealistically poor performance of the Rook in asymmetric playtesting from an opening position. One is that Capablanca Chess is a very violent game, where the three super-pieces are often involved in inflicting an early chekmate (or nearly so, where the opponent has to sacrifice so much material to prevent the mate, that he is lost anyway). The Rooks initially offer not much defense against that. But your chances for such an early victory would be strongly reduced if you were missing a super-piece. So perhaps two Rooks would do better against Q after A and C are traded. This explanation would do nothing for explaining poor Rook performance of R vs B, but perhaps it is B that is strong (it is also strong compared to N). The problem then would be not so much low R value, but high Q value, due to cooperativity between superpieces. So perhaps the observed scores should not be entirely interpreted as high base values for Q, C and A, but might be partly due to super-piece pair bonuses similar to that for the Bishop pair. Which I would then (mistakenly) include in the base value, as the other super-pieces are always present in my test positions. Another possible source of error is that the engine plays a strategy that is not well suited for playing 2R vs Q. Joker80's evaluation does not place a lot of importance to keeping all its pieces defended. In general this might be a winning strategy, giving the engine more freedom in using its pieces in daring attacks. But 2R vs Q might be a case where this backfires, and where you can only manifest the superiority of your Rook force by very careful and meticulous, nearly allergic defense of your troops, slowly but surely pushing them forward. This is not really the style of Joker's play. So it would be interesting to do the asymmetreic playtesting for Q vs 2R also with other engines. But TJchess10x8 only became available long after I started my piece value project, TSCP-G does not allow setting up positions (although now I know a work-around for that, forcing initial moves with both ArchBishops to capture all pieces to delete, and then retreating them before letting the engine play). And Smirf initially could not play automatically at all, and when I finally made a WB adapter for it so that it could, fast games by it where more decided by timing issues than by play quality (many losses on time with scores like +12!). And Fairy-Max is really a bit too simplistic for this, not knowing the concept of a Bishop pair or passed pawns, besides being a slower searcher.
[I deleted this post, because I accidentally posted it in the wrong discussion.]
Is there any special reason you want to keep the Pawn value equal in all trial versions, rather than, say, the total value of the army, or the value of the Queen? Especially in the Scharnagl settings it makes almost every piece rather light compared to the quick guesses used for pruning. Note that there are so many positional modifiers on the value of a pawn (not only determined by its own position, but also by the relation to other friendly and enemy pawns) that I am not sure what the base value really means. Even if I say that it represents the value of a Pawn at g2, the evaluation points lost on deleting a pawn on g2 will depend on if there are pawns on e- and i-file, and how far they are advanced, and on the presence of pawns on the f- and h-file (which mighht become backward or isolated), and of course if losing the pawn would create a passer for the opponent. If I were you, I would normalize all models to Q=950, but then replace the pawn value everywhere by 85 (I think the standard value used in Joker is even 75). I don't think you could say then that you deviate from the model, as the models do not really specify which type of Pawn they use as a standard. My value refers to the g2 pawn in an opening setup. Perhaps Reinhard's value refers to an 'average' pawn, in a typical pawn chain occurring in the early middle game, or a Pawn on d4/e4 (which is the most likely to be traded). As to the B-pair: tricky question. The way you did it now would make the first Bishop to be traded of the value the model prescribes, but would make the second much lighter. If you would subtract half the bonus, then on the average they would be what the model prescribes. The value is indeed hard-wired in Joker, but if you really want, I could make it adjustable through a 8th parameter.
Well, I do not really play CVs myself, but I love to watch games played by my engines, especially blitz games. And from this I learned that Knightmate is a CV that definitely works. It is just different enough from FIDE Chess to make it interesting, but familiar enough that you immediately can grasp it. Great game! Similarly for the 10x8 Capablanca variants. They are very interesting because of the Archbishop, which tends to be very active.
25 comments displayed
Permalink to the exact comments currently displayed.