Friday, July 17, 2009

Does Bayes' Theorem Have Any Special Epistemological Significance?

Bayesian epistemologists seem to think that Bayes' theorem (BT) has some special epistemological significance. Let's assume that BT provides us with a synchronic constraint on the coherence of one's degrees of belief (it tells us that whatever our degrees of beliefs in H, E, H given E, and E given H are at time t they have to be related so that Pr(H|E)t=(Pr(H)tPr(E|H)t)/Pr(E)t) and that synchronc coherence is a necessary but not sufficient condition for epistemic rationality. So far nothing epistemologically special about BT--every other theorem or axiom of probability theory also provides us with such a synchronic constraint.

Supposedly, however, BT does more than just that--it also tells us how to "conditionalize on new evidence". What I don't understand is how it is supposed to do so. As far as I can see, the theorem only tells us that the conditional probability of H given E, (Pr(H|E)t) is equal to (Pr(H)tPr(E|H)t)/Pr(E)t but this is only the old synchronic constraint again. It is only if we assume that, in observing that E, our degree of belief in H (Pr(H)t+1) becomes identical to our previous degree of belief in H given E (Pr(H|E)t) (i.e. if we assume that Pr(E)t+1=1 and Pr(H|E)t+1=Pr(H|E)t) that we can use BT to find out what that degree of belief was equal to. But then if this is the case BT in and of itself does not tell us what our degree of belief in H should be after the evidence is in. It only tell us what our degree of belief in Pr(H|E)t had to be before the new evidence was in.

Can someone please show me the error of my ways? Why do Bayesian epistemologists assume that BT plays any different role from that of other axioms of probability? In what sense it is providing us with anything other than a synchornic constraint on our degrees of belief?

34 comments:

  1. I suggest to create a philosophical school called "al-khwarizmian epistemology"

    ReplyDelete
  2. A central assumption of Bayesianism is that, typically, Posterior(H)=Prior(H|E). A Jesus notes, this is not a strict requirement - but Bayesianism loses all bite if conditionalization isn't at least the usual and presumptive way to change degrees of belief. As you note in the original post, this doesn't necessarily give Bayes' theorem any special significance. Add the fact that Prior(H|E) is usually impossible to assess directly - and that the prior for the hypothesis, the likelihood of the evidence, and the expectedness of the evidence are easier to assess - and Bayes' theorem becomes an important tool for actually figuring out what Posterior(H) ought to be.

    ReplyDelete
  3. Jesus,

    As far as I can see, some (most?) Bayesian epistemologists think that synchronic and diachronic coherence are both necessary and sufficient conditions for epistemic rationality. If diachronic coherence was not necessary, the prior would never "wash out" as one could have new priors all the times instead of conditionalizing. In any case, my question was: 'In what sense does Bayes' theorem provide us with anything other than a diachronic constraint?' and I'm not sure I see how what you say answers that question.

    P.D.,

    As I mentioned above, I don't think that this can be just a matter of 'typically' for Bayesians. If "non-typical" updates were allowed from time to time, one could always occasionally reupdate her degrees of belief so as to match her initial priors and the priors would never wash out. In any case, you seem to agree that BT just provides with a synchronic constraint on coherence so I guess you aren't disagreeing with what I am saying.

    ReplyDelete
  4. Gabriele,
    what I was trying to say is that, taking into account the mathematical content of the Pr function, 'prior' and 'posterior' do not refer at all to time (we interpret them so, because we make the assumption that E is 'added' in a certain point of time, but nothing in the mathematical structure of the function implies that this is a necessary assumption).
    .
    On the other hand, I don't know of any argument that shows that two people can have different Pr's at the same time and be rational, but one single person is necessarily irrational if she has different Pr's at different times. After all, could one not LEARN that his Pr was 'defective', and so try to replace it?

    ReplyDelete
  5. I suspect that any reference to "updating by Bayes' Theorem" or the like has to be some sort of mistake. Bayes' Theorem itself is basically just a restatement of the ratio formula for conditional probability. However it does have an advantage in being phrased in terms of the prior and the likelihood, both of which seem to be easier to get a grip on than the posterior.

    I don't get a sense that most Bayesian epistemologists require the washing out of the priors results. Maybe the majority of subjective Bayesian philosophers of science do, but I think objective Bayesians might say that if you start with the right function then washing out is unnecessary, and epistemologists outside philosophy of science seem not to mind as much if there is persistent disagreement. Thus, both of these groups seem to have room to allow for updates by means other than conditionalization (especially in cases where one "unlearns" things one thought one had properly learned).

    ReplyDelete
  6. Personally, as a Bayesian, I don't think Bayes' theorem has any particular epistemic significance per se beyond the (Cox) axioms of probability theory. However, as a trivial consequence of those axioms, it does highlight rather well the way in which those axioms can be used to update our degrees of belief when new evidence comes to light.

    So basically, on re-reading your post, I agree with you, and most of the following is pretty much a long-winded expansion of what you said. However, it might be helpful anyway, as an explanation of why Bayes' theorem is seen as significant by a Bayesian who happens to agree with you. I think I have clarified my own views somewhat in writing it.

    To a Bayesian (at least a sensible one who has read Cox and Jaynes) all probabilities represent degrees of belief about something, and hence there is no such thing as a non-conditonal probability: all probabilities are conditional on some backfground body of knowledge and belief. So the four relevant probabilities are

    p(E|U),
    p(H|U),
    p(E|HU) and
    p(H|EU).

    As you rightly point out, Bayes' theorem relates these quantities in a synchronic fashion, regardless of what E, H and U represent. However, we can regard U as representing someone's state of knowledge (call them person A). We can assume that that person has some level of uncertainty regarding the truth or falsity of H and E, so that p(H|U) and p(E|U) are strictly between zero and one. In that case, EU (strictly, E⋀U) represents the state of knowledge of someone (call them B) who has the exact same set of knowledge and beliefs as A, except that they also assume, for whatever reason, that E is true. If we assume that A and B are both epistemically rational way then Bayes' theorem puts a (synchronic) constraint on the relation between A's and B's belief in H.

    But here's the thing: if we want, instead of thinking of A and B as two different people, we could choose to think of them as the same person, before and after learning that E is in fact true. Thus, at time t1, before observing E, the A's belief is U; whereas at time t2, after observing E, their belief is EU. In this case, Bayes' theorem puts a non-synchronic constraint on how A must change their level of belief in H after learning that E is true.

    As you say, "it is only if we assume that, in observing that E, our degree of belief in H ... becomes identical to our previous degree of belief in H given E ... that we can use BT to find out what that degree of belief was equal to." However, like Bayes' theorem itself, this is a necessary condition for epistemic rationality according to the Bayesian point of view. So we really have any choice about making that additional assumption.

    I guess in many practical cases, we wouldn't bother to figure out the value of p(H|EU) until we observed that E was the case. In that situation, although we could technically be said to be using BT to figure out what our degree of belief in H given EU should have been all along, it makes more practical sense to say we're using it to figure out how to update our degree of belief in H, since our state of knowledge changed from U to EU before performing the calculation. Perhaps this causes some Bayesians to place more epistemological significance on Bayes' theorem than it strictly deserves.

    In fact I think I would say that it's really that additional assumption --- that upon observing E our degree of belief in H should become identical to our prior belief in H given E --- which really deserves that significance.

    ReplyDelete
  7. Jesus,

    I don't know of any argument that shows that two people can have different Pr's at the same time and be rational, but one single person is necessarily irrational if she has different Pr's at different times.

    Dutch book arguments purport to be such an argument.

    After all, could one not LEARN that his Pr was 'defective', and so try to replace it?

    As far as I can see, an orthodox Bayesian should reply 'No!', for all learning is done by conditionalizing.

    Kenny,

    objective Bayesians might say that if you start with the right function then washing out is unnecessary

    I really can't understand "objective Bayesians". What do they think the "right" function is? As far as I can see, the only "right" function would be one that assigns 1 to all truths and 0 to all falsities, but then if one is simply ominscient there is simply no need for updating. In any case, I don't think is reasonable to assume that one can assign a priori the "right" degrees of beliefs to any a posteriori truths and falsities, whatever the right degrees of belief are supposed to be.

    Nathaniel,

    In fact I think I would say that it's really that additional assumption --- that upon observing E our degree of belief in H should become identical to our prior belief in H given E --- which really deserves that significance.

    I agree, what I was saying is that this assumption is based on two substantial epistemological assumptions--i.e. that, in all cases, Pr(E)t+1=1 and that Pr(H|E)t+1=Pr(H|E)t.

    ReplyDelete
  8. I'm not quite sure what you mean by "in all cases" Pr(E)t+1=1 and Pr(H|E)t+1=Pr(H|E)t. I guess you mean something like "in all cases of learning that E is true." If so then I agree, but I also think both of these assumptions are quite easily justified.

    The first comes from the fact that assigning Pr(E)=1 is just what it means to believe that E is true. Since probability theory reduces to propositional logic in the case where all probabilities are 0 and 1, we can say that statements "Pr(E)=1" and "E" are equivalent. So if we learned before time t+1 that E is true then our Pr(E)t+1 must be 1.

    The second comes from logical consistency: Pr(H|E) is a statement of how certain we would be about H if we assumed E were true. In effect it says "if I believed E then I would be x% certain about H." (I think that's its only meaning, but perhaps there are other other ways to interpret it? I would be interested to hear if so). So in order to be consistent we had better update our certainty about H in line with that statement.


    I also wanted to comment on the "objective Bayesian" theme. There are some situations where there clearly is an objectively correct prior, even when one is not omniscient. The best example I can think of is this: a ball lies under one of three cups. By assumption, you know that there is exactly one ball and that it is under one of the cups, but (by assumption) you have no information about which cup it might be under. Looking up from under the glass table, I can see that it's a matter of objective fact that the ball is under cup 3, but from your point of view the only rational thing to do is to assign a probability of 1/3 to each cup. The problem is invariant under permutation of the three cups, and {1/3, 1/3, 1/3} is the only probability distribution that respects that invariance. (also, intuitively, it would be irrational for you to bet on finding the ball under any one of the three cups unless the payoff was at least three times the stake.)

    I'm not sure how "objective" that is, but the general idea (mostly due to Jaynes) is that in this type of problem probability distributions are in a one-to-one correspondence with states of knowledge, so if you can formulate your state of knowledge about a problem clearly enough then you can calculate the corresponding probability distribution, which is unique. This reasoning leads to an extremely neat justification for maximum entropy reasoning, among other things.

    ReplyDelete
  9. This comment has been removed by the author.

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete
  11. Here are my thoughts: After all, Bayes's theorem is a synchronic restraint and so cannot tell us how to update our degrees of beliefs (hereafter: credences) by itself.

    Surely, when combined with conditioning, it implies the diachronic constraint that (BTC) Pt+1(H)=[Pt(E/H)Pt(H)]/Pt(E). However, one may complain that if we have accepted conditoning, we already have the recipe for updating, namely (C) Pt+1(H)=Pt(H/E). So, unless (BTC) is somehow a better recipe for updating than (C), Bayes's theorem seems to be redundant at best.

    Howver, I suspect that people have attributed such a great importance to Bayes's theorem for a reason. Note that (C), conditioning itself, is formulated in terms of Pt(H/E), but (BTC), derived from Conditioning and Bayes's theorem, is formulated in terms of Pt(E/H).

    And, the latter conditional credence, Pt(E/H), will often be easier to calculate. Here is an example:
    Let K be agent A's background knowledge at t.
    Let H be the theory of quantum mechanics.
    Let E be the proposition that particle p decays at t+1.
    Suppose that H&K entails Ch(E)=.9. Then,

    Pt(E/H)=
    Pt(E/H&K)= (the agent knows K)
    Pt(E/H&K&Ch(E)=.9)= (H&K entails Ch(E)=.7)
    .9. (the Principal Principle)

    Additionally, assume that A's credence in H was .3 at t and A's credence in E was .5 at t. Then,

    Pt+1(H)=
    [Pt(E/H)Pt(H)]/Pt(E)= (BTC)
    (.9*.3)/.5=.54.

    In this way, A's credence in H increases from .3 at t to .54 at t+1. The point is that when the hypothesis is a scientific theory, it will typically predict the chance of a particular event's occurrence in the future. In such a case, the agent can easily calculate Pt(E/H) by using the predictive entailment and the Principal Principle. There is no comparably convenient procedure for calculating Pt(E/H). In my opinion, this is the main attraction of Bayes's theorem.

    ReplyDelete
  12. Sorry, I meant Pt(H/E) by the last occurrence of "Pt(E/H)."

    ReplyDelete
  13. You respond to my comment above that "this can [not] be just a matter of 'typically' for Bayesians. If 'non-typical' updates were allowed from time to time, one could always occasionally reupdate her degrees of belief so as to match her initial priors and the priors would never wash out."

    This is a non sequitur. From the fact that non-typical updates are sometimes allowed, it does not follow that one could always make a that kind of non-typical update. Perhaps that kind of update is never allowed at all.

    Consider a specific example: Someone proposes a theory one had not considered before. One didn't have a well-defined prior, and it would be disastrous to insist that the credence for the new proposal must come from the prior for the catch-all hypothesis. So one makes a non-typical update. Allowing for such a possibility does not obviously license stubborn refusal to learn from evidence.

    ReplyDelete
  14. Nathaniel,

    I don't think those two constraints are as epistemologically innocent as you suggest. Just one example, you say 'assigning Pr(E)=1 is just what it means to believe that E is true'. As far as I can see this is not the case. In a Bayesian framework one can believe E to be true (to some degree) without having a degree of belief 1. In fact, I think Bayesians have good reasons never to assign degree of belief one to any a posteriori proposition including most propositions that are passed as E. Just to give you an example. Suppose that I see my friend A from across the street (which according to you should send my degree of belief in that proposition to 1) and I come to believe that A is on the other side of the street (which means that my degree of belief in that proposition becomes more than .5). However, on that same day, B., whom I believe to be well-informed about A's whereabouts, tells me that A is out of town (in fact B is not as well-informed as I believe and I did in fact see A from across the street). That should make me revise my belief that I have seen A across the street, but, if my degree of belief in that proposition is 1, then, according to Bayesians I can't do so.

    P.D.,

    This is a non sequitur. From the fact that non-typical updates are sometimes allowed, it does not follow that one could always make a that kind of non-typical update. Perhaps that kind of update is never allowed at all.

    I'm afraid I wasn't sufficiently clear. In 'one could always occasionally reupdate her degrees of belief so as to match her initial priors' I was using 'always' as in 'we could always go to the lake'. As far as I can see that doesn't mean 'we could go to the lake all the times' but something along the lines of 'we could go to the lake if necessary'.

    But let me put my point as a question: if, as you suggest, according to Bayesians, it is only typically (but not always) the case that Pr(H)t+1=Pr(H|E)t, what are the circumstances under which Pr(H)t+1 may not be equal to Pr(H|E)t? And what is Pr(H)t+1 equal to in those cases? I was assuming that there were no rules for those "atypical" cases but if there are rules, as you now seem to suggest, I'd be curious to hear what they are (btw, I don't think the case you mention is a case in point).

    ReplyDelete
  15. Hi,

    Firstly let me clarify that I see Bayesian reasoning as a model for how we as scientists should reason in the ideal case, rather than as a model for how humans actually reason on a day-to-day basis. It plays a similar role to logic in that respect. In idealised cases there may be times when it makes sense to believe things with 100% certainty (which is what I meant by "believe something is true"), but you are right to say that in reality there may never be --- and of course the whole power of Bayesian reasoning comes from the fact that you can believe something to some degree, without being certain about it.

    However, Bayesian reasoning can handle the case where you thought you saw your friend fairly well. After seeing your friend, you then believe with 100% certainty that you thought you saw your friend across the street. However, unless you have an unrealistic degree of faith in your senses, you might decide that Pr( "my friend actually was across the street from me" | "I thought I saw my friend across the street" ) = 0.99, so in the absence of other evidence you now believe with 0.99 probability that your friend A was across the street. Later, B tells you that A is out of town. If you have a prior for how likely B is to be well-informed and telling the truth in this situation, you can then update your assigned probability for A having been across the street.

    For a great example of this kind of reasoning, see this text-book chapter by Jaynes. The example concerns two rational Bayesians, one who believes (with some probability) in extra-sensory perception and one who doesn't (or rather, believes with only a low probability, less than 0.5). Some experimental results are then published which claim to support the existence of ESP. The first Bayesian becomes more certain that ESP exists as a result, but the second actually becomes more certain that ESP doesn't exist. This is because the second (quite rationally) believes the result is likely to be a deception, whereas the first (quite rationally) believes it is unlikely to be a deception: the differing priors cause a divergence of beliefs. I mention this only because it's a good example of Bayesian reasoning based on degrees of trust in the evidence, though it seems quite relevant to the philosophy of science in general as well.

    ReplyDelete
  16. Gabriele: as far as I know, Dutch book arguments refer either to probability functions not satisfying the axioms of probability theory (which is not the case here), or, sometimes, to cases in which your probability estimates don't fit the objective frecuencies of facts (but this would be a problem also for a case in which we have TWO people with different probability functions). So, my question remains of what is what makes irrational for ONE PEOPLE AT TWO DIFFERENT TIMES to have different subjective probability functions, but NOT for TWO PEOPLE AT ONE SINGLE TIME.
    .
    Regarding the second point, I know orthodox Bayesians not only have long beards, but accept the dogma that all learning is conditionalisation. But once we admit that probability functions can be different for different people, I don't see any reason why not admit as sensible the question 'is my probability function the one I must have, or should I change it?'.
    Obviously, if you ASK this question, you are not a Bayesian, but a member of some different sect. But my point is that there is nothing IN THE MATHEMATICAL CONTENT of Bayes theorem that forces the 'orthodox' interpretation

    ReplyDelete
  17. Nathaniel,

    I see Bayesian reasoning as a model for how we as scientists should reason in the ideal case, rather than as a model for how humans actually reason on a day-to-day basis. It plays a similar role to logic in that respect.

    I don't see why logic (or probability for that matter) is a model for reasoning only in ideal cases as opposed to actual ones.

    Bayesian reasoning can handle the case where you thought you saw your friend fairly well. After seeing your friend, you then believe with 100% certainty that you thought you saw your friend across the street. However, unless you have an unrealistic degree of faith in your senses, you might decide that Pr( "my friend actually was across the street from me" | "I thought I saw my friend across the street" ) = 0.99, so in the absence of other evidence you now believe with 0.99 probability that your friend A was across the street.

    That's exactly the problem--once you allow that Pr(E)t+1 is not 1, you open a big can of worms. First of all if Pr(E)t+1 then there is no rationale for Pr(H)t+1 to be equal to Pr(H|E)t. Pr(H)t+1 should be rather equal to Pr(H|E)t+1Pr(E)t+1+Pr(H|~E)t+1(1-Pr(E)t+1). Second, why sould Pr(E)t+1 be equal to 0.99 and not 0.999 or 0.89? What are the grounds for deciding what is the appropriate degree of belief? These two considerations put together seem to show that all the epistemological heavy-lifting is performe at the stage at which you update your degree of belief in E in the light of new evidence (and let it "decohere" from your other degrees of belief) and that the axioms of probability only tell you how to restore coherence.

    Also, just one quick comment on your point on objective Bayesianism. You said: 'There are some situations where there clearly is an objectively correct prior, even when one is not omniscient.' I do not necessarily deny that (although I don't know of any good example in which one can determine the "objectively correct" prior entirely a priori and certainly your example presupposes a lot of a posteriori knowledge of the situation). What I deny is that those cases are at all typical.

    Jesus,

    as far as I know, Dutch book arguments refer either to probability functions not satisfying the axioms of probability theory (which is not the case here), or, sometimes, to cases in which your probability estimates don't fit the objective frecuencies of facts (but this would be a problem also for a case in which we have TWO people with different probability functions). So, my question remains of what is what makes irrational for ONE PEOPLE AT TWO DIFFERENT TIMES to have different subjective probability functions, but NOT for TWO PEOPLE AT ONE SINGLE TIME.

    So-called diachronic Dutch book arguments purport to show that unless you conditionalize on the evidence a diachronic Dutch Book can be made against you. (See for example Alan Hájek's entry on Dutch Book Arguments).

    ReplyDelete
  18. Gabriele,

    In the example of seeing your friend across the street, the hypothesis is that your friend was across the street, whereas the evidence is that you thought you saw her. So Pr(E)t+1 is still 1 (you know with certainty that you thought you saw her) while Pr(H)t+1 is still equal to Pr(H|E)t, the probability that your friend is there, conditional on you having thought you saw her. Obviously that only works if you can assign a value for that conditional probability - in this case I chose 0.99 arbitrarily.

    One reason that I say probability theory is only a model of ideal reasoning is that, if I were to be philosophically rigourous, I might say that I can't be absolutely certain that I thought I saw my friend, since I might have imagined it after the fact. This leads to a nasty infinite regress whereby I can't be 100% sure of anything (except Pr(I am)=1 according to Descartes, though I don't really even buy that argument), which would make probability theory as I'm using it here impossible to apply.

    Another reason is the difficulty of assigning priors to things like Pr(My friend was across the street | I thought I saw her).

    However, the main reason is simply that people consistently reason incorrectly in certain situations (there are a number of psychological experiments that show this). What do we mean by "they reason incorrectly?" We mean that they don't reason consistently with respect to logic and/or probability theory. Probability theory (or logic) is a definition of ideal reasoning, not a reliable model of actual human reasoning. As scientists we strive to meet the ideal but as humans we sometimes fail.

    ReplyDelete
  19. Nathaniel,

    In the example of seeing your friend across the street, the hypothesis is that your friend was across the street, whereas the evidence is that you thought you saw her. So Pr(E)t+1 is still 1 (you know with certainty that you thought you saw her) while Pr(H)t+1 is still equal to Pr(H|E)t, the probability that your friend is there, conditional on you having thought you saw her.

    I disagree - the evidence is expressed by the sentence 'I saw my friend across the street at t+1' (call it S) not by 'I think I saw my friend across the street at t+1' (call it T), as T is perfectly compatible with a wide range of hypotheses, ranging from me seeing a lookalike of my friend to there there being an evil demon deceiving me into thinking I have a friend and that I saw her. So, even if Pr(T)t+1=1, my degree of belief that my friend is across the street (let's call it A) is still likely to be quite low, as Pr(A)t was low before seeing my friend and Pr(A|T)t+1 and Pr(~A|T)t+1 are both relatively high.

    ReplyDelete
  20. Gabriele,

    The Bayesian is committed to saying that typically one may only change ones creedences by conditionalizing. When does "typically" break down, such that one can update in some other way? As you suggest, Bayesians might try to articulate rules for this. I don't have a good rule to suggest, only some examples (btw, why don't you think that novel hypotheses are a case in point?)

    Another option is to say that there are no rules for when to do otherwise. Rather: Conditionalization is the default, and updating in some other way is constrained by reasonability and craft knowledge. Your suggested stubborness is unreasonable and so prohibited. (I don't know of any Bayesian who explicitly appeals to tacit knowledge in this way, but it's a coherent position. I am tempted by it myself.)

    ReplyDelete
  21. Here is a better way of formalising this, which makes the time dependence more explicit and requires less arbitrary-seeming assumptions. In this scheme, probabilities do not change over time, but are instead conditioned on a state of knowledge, which does change over time. The advantage of this is the two "substantial epistemological assumptions" you mentioned a while back are no longer assumptions, because they can be derived from the formalism.

    (Gabriele, I started writing this before your last comment. I will address that comment in another comment due to the size restriction on replies.)

    Let us think of probabilities in manner of Cox (1946). In Cox's axiomatisation of probability theory (which is an alternative to Kolmogorov's, although written less formally), all probabilities are conditional. In the statement p(B|A)=x, A and B are taken to be statements of (Boolean) propositional logic. The probability statement p(B|A)=1 is (by definition) identical in meaning to the logical statement A -> B (A implies B). For example, p(Socrates is a mortal | Socrates is a man ∧ All men are mortal) = 1. Conditional probabilities can thus be thought of as expressing the degree to which one statement implies another. The statements that can be expressed in probability theory are a subset of those that can be expressed in the underlying logic. Cox proved that probability theory is the uniquely consistent extension of Boolean logic to reasoning under uncertainty.

    Since in this scheme all probabilities are conditional and conditional probabilities are taken to represent relationships between statements of logic in this way, it follows that probabilities cannot change over time. However, we can consider probabilities that are conditioned on my state of knowledge, represented as U(t). U(t) is to be thought of as a huge conjunctive statement consisting of all the facts that I know (or believe with certainty) to be true. (Again I emphasise that I consider probability theory to be an idealisation of reasoning - I'm not suggesting for a second that human knowledge can really be represented by a giant logical statement in this way)

    Anyway, the point of all this: Let's say I learn (with certainty) a new fact, call it E. (Let's ignore the difficulty of how I can come to know anything with absolute certainty for now.) After I receive this new piece of knowledge E, I now know everything I knew before, but in addition I also know E. So U changes from U(t) to U(t+1) = U(t) ∧ E. Thus, instead of writing p(H)t+1 = p(H|E)t (which looks like an ad hoc assumption), we write

    p(H | U(t+1)) = p(H | E ∧ U(t)),

    which has been derived from the previous assumptions. Similarly, the assumption p(E)t+1 = 1 becomes the tautology

    p(E | E ∧ U(t)) = 1.

    ReplyDelete
  22. This addresses Gabriele's last comment, about how likely I am to believe I saw my friend, given that I thought I saw her.

    The statement T ("I thought I saw my friend") is indeed compatible with a wide range of hypotheses. So let's use Bayes' theorem to work out how likely they are. For each hypothesis H_i, we have

    p(H_i ∧ U(t+1)) = p(H_i | T ∧ U(t)) = p(H_i | U(t)) * p(T | H_i ∧ U(t)) / p(T | U(t)).
    = p(H_i | U(t)) * p(T | H_i ∧ U(t)) / ∑_j ( p(H_i | U(t)) * p(T | H_i ∧ U(t)) )

    So for each H_i, its posterior probability is proportional to its prior probability, multiplied by how likely it is to make me think I saw my friend, conditioned in each case on everything I happened to know at the moment I thought I saw her.

    If the circumstances are such that I think it's fairly likely that I'll see my friend and I have little reason to believe that I'm hallucinating or being deceived etc. then this probability distribution will be dominated by the hypothesis that my friend was indeed across the street. If I believe (to some degree) that my friend is out of town then the probability of that hypothesis will drop and the others will become more likely. This matches up to everyday experience: on a few occasions I have thought I saw a friend who I knew to be in another country, and in those cases I tended to assume I was mistaken.

    In any case it is not the sheer number of hypotheses that is important, but their prior distribution. If one has a sensible prior then treating T as the evidence will lead to sensible answers for the posterior.



    Reference which I accidentally left off my previous comment:
    R. T. Cox, "Probability, Frequency, and Reasonable Expectation," Am. Jour. Phys., 14, 1–13, (1946). For a good explanation that's available online, see Chapters 1 and 2 of E. T. Jaynes' Probability Theory: the Logic of Science.

    ReplyDelete
  23. There is a typo in my last comment but one: the statements that can be made in probability theory are a superset of those that can be made in the underlying logic.

    ReplyDelete
  24. Nathaniel @11:54am,

    The probability statement p(B|A)=1 is (by definition) identical in meaning to the logical statement A -> B (A implies B).

    I thought Lewis had proved that '|' cannot be a connective and in particular it cannot be the material conditional.

    So U changes from U(t) to U(t+1) = U(t) ∧ E. Thus, instead of writing p(H)t+1 = p(H|E)t (which looks like an ad hoc assumption), we write p(H|U(t+1)) = p(H|E∧U(t))

    Let me put aside my qualms about conditioning on one's "background knowledge", the problem is that, as you mention, this apply when "[you] learn (with certainty) a new fact, call it E". So it is not true that "[the "substantial epistemological assumptions" I mentioned] are no longer assumptions, because they can be derived from the formalism"--They are still assumption but they are disguised in the formalism.

    Nathaniel @11:56,

    In any case it is not the sheer number of hypotheses that is important, but their prior distribution. If one has a sensible prior then treating T as the evidence will lead to sensible answers for the posterior.

    I agree this is why I said "Pr(A)t was low before seeing my friend" (which if you live in a large enough city seems to be the case no matter what your occurrent beliefs as to their whereabouts are). Incidentally, since my point was that T does not provide much evidential support to A, whether the prior really matters depends on one's preferred measure of evidential support and some measures (such as the log ration) ignore the prior completely.

    ReplyDelete
  25. P.D.,

    Conditionalization is the default, and updating in some other way is constrained by reasonability and craft knowledge. Your suggested stubborness is unreasonable and so prohibited. (I don't know of any Bayesian who explicitly appeals to tacit knowledge in this way, but it's a coherent position. I am tempted by it myself.)

    So there would seem to be higher standards of epistemic (or practical?) rationality than those of Bayesian epistemology. I don't know how many subjective Bayesians would accept that as much of the appeal of their position seem to stem from the fact that it tells you that no matter the prior if you keep updating by conditionalization you'll get closer and closer to believing only truths and disbelieving only falsehoods. (Or at least this is how I understand its appeal but as you probably have realized by now I'm quite immune to its appeal).

    (btw, why don't you think that novel hypotheses are a case in point?)

    Because a novel hypothesis at t+1, H*, is, by assumption, one to which you don't attach any degree of belief at t and, a fortiori, you don't attach any degree of belief to Pr(H*|E) for every E. The point is that you haven't shown me a case in which my degree of belief in H* is not updated the light of some new evidence E so as to be equal to Pr(H*|E)t. You just showed me a case in which I come to have a degree of belief in a proposition I have never considered before. If at t+1 we both consider a novel hypothesis H* and a new piece of evidence E, then as a matter of fact we are not updating our belief that H* on the basis of the new evidence we are just assigning it a prior. If at t+1 we consider a novel hypothesis H* and at t+2 a new piece of evidence E, then we should be updating our belief that H* on the basis of the new evidence by conditionalization. If at t+1 we consider a new piece of evidence E and at t+2 a new hypothesis H*, then again we are just assigning a prior to H*.

    ReplyDelete
  26. Of course I meant to type 'log ratio' not 'log ration' in my response to Nathaniel above.

    ReplyDelete
  27. I thought Lewis had proved that '|' cannot be a connective and in particular it cannot be the material conditional.

    Could you supply me with a reference? So far I've only found a suggestion that he showed P(B|A) is not the same as P(A->B) except in a restricted class of rather trivial probability distributions. This is true but it's not what I meant.

    I didn't mean "P(B|A) is the same as P(A->B)" but only "P(B|A)=1 if and only if A->B," which is a weaker statement and as far as I can see incontrovertibly true.

    They are still assumption but they are disguised in the formalism.

    It's a matter of taste whether to call it "disguised" or "made clearer," but to my mind the definition "to learn E with certainty is to update your state of knowledge U(t) to U(t+1) = U(t)∧E" seems much clearer than "to learn E with certainty is to move to a new probability distribution Pt+1 such that, for any hypothesis H, P(H)t+1=P(H|E)t." You're right that the two are equivalent, but to me the former seems much less arbitrary.

    (aside: in the second definition, setting H=E gives P(E)t+1 = P(E|E)t = 1, so the additional assumption P(E)t+1 = 1 is not required.)

    I agree this is why I said "Pr(A)t was low before seeing my friend" (which if you live in a large enough city seems to be the case no matter what your occurrent beliefs as to their whereabouts are).

    This is true. But (assuming we're using Bayesian reasoning) what matters is not whether Pr(A)t was low in an absolute sense, but whether it was low relative to the sum of the probabilities of all the other hypotheses that could have caused you to think you saw your friend. This is what I was trying to get at with the second line in the equation in my 11:56 post: there is a normalisation term which comes from the sum of the probabilities of all such hypotheses, multiplied in each case by how likely they are to cause me to see my friend.

    I think that if I look across a city street, the (prior) probability of a particular friend being there is very low. However, I also think that the (prior) probability of me hallucinating an image of my friend, or seeing her double, or being deceived by a demon into seeing her, is quite a bit lower. Since these are priors they represent my personal opinion of each hypothesis, but I think they're reasonable since most of my friends don't have doubles and I'm not prone to hallucinations.

    This all adds up to my prior probability of thinking I see my friend on a given city street being very low. But when I do think I see my friend the (posterior) probability distribution of hypotheses that I think might have caused it is dominated by whichever had the highest prior, even though all of them had very low priors. This is because I have to divide them all by my (small) prior probability for thinking I saw my friend. This normalisation process is really the essence of Bayes' theorem. (The formula Thomas Bayes originally wrote down contained an explicit sum along the lines of the one in my 11:56 post.)

    whether the prior really matters depends on one's preferred measure of evidential support and some measures (such as the log ration) ignore the prior completely.

    As a Bayesian, I consider such measures to be wrong ;)

    More seriously, my aim is to show that, with sensible priors, Bayesian reasoning can give sensible results in the case of thinking you saw your friend across the street, since the claim (in your post of July 20th) was that it can't. Other methods of reasoning may or may not give different answers to the same problem, but that doesn't affect the point I'm trying to make here.

    ReplyDelete
  28. I think that if I look across a city street, the (prior) probability of a particular friend being there is very low. However, I also think that the (prior) probability of me hallucinating an image of my friend, or seeing her double, or being deceived by a demon into seeing her, is quite a bit lower. Since these are priors they represent my personal opinion of each hypothesis, but I think they're reasonable since most of my friends don't have doubles and I'm not prone to hallucinations.

    The point I was trying to make is that Pr(A|T)/Pr(T) is likely to much closer to 1 than Pr(A|S)/Pr(S) and that therefore T provides us with little if no support for A. Note that the reasons why I think that I saw my friend even id I didn't could in fact being much more down-to-earth than my being deceived by an evil demon or my friend's having a double. Many factors influence Pr(A|T) including how good my eyesight is, how wide and busy the road is, how good my view of my friend was from where I was, whether she was wearing anything (sunglsses, hat, etc.) that hide part of her face, whether the visibility conditions where ideal, how big the city we live is, how common her looks are. As far as I can see, by varying these factors, we can get to a situation in which Pr(A|T)/Pr(T) is as close as one as 1 pleases.

    Moreover, I don't think Pr(T) should ever be 1 either. This is why. If T is something along the lines of 'I believe (to a degree p) that I saw my friend across the street' and Pr(T)=1, then, when I'm told that my friend is out of town my degree of belief in 'I saw my friend across the street' decreases in the light of the new evidence, so should my degree of belief in 'I believe (to a degree p) that I saw my friend across the street'. But if my degree of belief is equal to 1 this cannot happen.

    ReplyDelete
  29. Damn! I came into this one late!

    Jesus: After all, could one not LEARN that his Pr was 'defective', and so try to replace it?

    Gabriele: As far as I can see, an orthodox Bayesian should reply 'No!', for all learning is done by conditionalizing.

    De Finetti's view was certainly that the probabilities (e.g. P(a,b) for me) are always fixed. Here's my favourite quote:

    'Whatever be the influence of observation on predictions of the future, it never implies
    and never signifies that we correct the primitive evaluation of the probability P(En+1)
    after it has been disproved by experience and substitute for it another P*(En+1) which
    conforms to that experience and is therefore probably closer to the real probability;
    on the contrary, it manifests itself solely in the sense that when experience teaches us
    the result A on the first n trials, our judgment will be expressed by the probability
    P(En+1) no longer, but by the probability P(En+1|A), i.e. that which our initial opinion
    would already attribute to the event En+1 considered as conditioned on the outcome A.
    Nothing of this initial opinion is repudiated or corrected; it is not the function P which
    has been modified (replaced by another P*), but rather the argument En+1 which has
    been replaced by En+1|A, and this is just to remain faithful to our original opinion (as
    manifested in the choice of the function P) and coherent in our judgment that our
    predictions vary when a change takes place in the known circumstances.'

    Anyway, I agree with the general tenor of Gabriele's comments on Bayesianism. The Dutch Book argument doesn't work even in a synchronic incarnation, incidentally, as Hájek and I (among others), have argued.

    Why not use the corroboration function instead? ;-)

    ReplyDelete
  30. The point I was trying to make is that Pr(A|T)/Pr(T) is likely to much closer to 1 than Pr(A|S)/Pr(S) and that therefore T provides us with little if no support for A.

    I can't make any sense out of p(A|T)/p(T), so I'm assuming you meant p(T|A)/p(T) and p(S|A)/p(S), which is the same as comparing p(A|T)/p(A) with p(A|S)/p(A). If you're using the definitions of A, S and T that you defined in your 22nd July post then

    S = I saw my friend across the street at t+1
    T = I thought I saw my friend across the street at t+1
    A = My friend actually was across the street at t+1

    In this case, as far as I can see, S = T∧A (what else could it mean to say you saw your friend other than that you thought you did [perhaps with some justification] AND she was actually there at the time?). So

    p(S|A) = p(T∧A|A) = p(T|A)p(A|A) = p(T|A)
    and
    p(S) = p(T∧A) = p(T)p(A) ≤ p(T),

    so p(T|A)/p(T) is indeed smaller (closer to 1) than p(S|A)/p(S). We could have seen this more directly by noting that p(A|S)=1 whereas p(A|T) is in general less than 1.

    This is to be expected because S includes information that a rational Bayesian agent couldn't have access to. The agent is trying to determine the plausibility of A based on the evidence available to it, which is T. A rational agent who knows S will decide that A is true (with probability 1) because S -> A. But that's just because the second agent has been given an unreasonable piece of information: it includes not only what her senses told her, but also what the actual state of the world was at the time. I don't think any real agent of any kind ever has access to that kind of evidence.

    ...by varying these factors, we can get to a situation in which Pr(A|T)/Pr(T) is as close as one as 1 pleases.

    Again I assume you mean p(T|A)/p(T). This is true, but in order to vary these factors you have to change the meaning of T. I was assuming T was something along the lines of

    T1 = "I had a really good look at the person (or robot/hallucination etc.) that appeared to be my friend. I noted several distinctive features of her face, gait, hairstyle and clothes, all of which were familiar to me from seeing my friend before."

    However, if you have in mind something more like

    T2 = "I had a very brief glimpse of someone (or something) of approximately the same height and build as my friend, but due to the huge crowd, my poor eyesight and the fact that they were facing the other way I wasn't able to see any more details than that."

    then I agree that p(A|T2) is low, mostly because p(T2) is large compared to p(A). However, the fact that T2 provides little evidence for A is not a problem with Bayes' theorem. Rather, it's a strength. Any system of reasoning that claimed to be able to confidently reach the conclusion A from the evidence T2 would be very strongly at odds with intuition and common sense.

    ReplyDelete
  31. Moreover, I don't think Pr(T) should ever be 1 either. This is why. If T is something along the lines of 'I believe (to a degree p) that I saw my friend across the street' and Pr(T)=1, then, when I'm told that my friend is out of town my degree of belief in 'I saw my friend across the street' decreases in the light of the new evidence, so should my degree of belief in 'I believe (to a degree p) that I saw my friend across the street'. But if my degree of belief is equal to 1 this cannot happen.

    That's a fair point. The only statements that should be assigned a probability of 1 are those which we can never have any possible reason to want to question. We therefore need to state T in a form for which this is the case. I can see two ways of doing this.

    Firstly, we can say that T is not
    "I believe (to a degree p) that I saw my friend across the street"
    but
    "I believed at time t+1 (to a degree p) that I saw my friend across the street."

    The second, which is probably better, is to say that T is not a statement about belief, but simply a summary of the sense data which led to the belief - this is the tack I took when stating T1 and T2 above.

    In either of these cases, the only reason to doubt T is if you forget what you saw (or believed), or you become sceptical about your own memory. I'm happy to concede that Bayesian reasoning probably doesn't have a neat way of dealing with these possibilities --- but they're probably not the sort of things an idealised rational agent should be able to do.

    ReplyDelete
  32. Nathaniel,

    Sorry this is going to be quick because I'm in a hurry. Hope it's not too unclear (and there aren't too many typos) ;-)

    I can't make any sense out of p(A|T)/p(T), so I'm assuming you meant p(T|A)/p(T) and p(S|A)/p(S)

    Ooops! Sorry, yes that's what I meant.

    the fact that T2 provides little evidence for A is not a problem with Bayes' theorem.

    I didn't claim it was a problem for BT. Indeed I don't think that there are any "problems" for BT--it's a theorem of probability theory and I don't think that anyone denies that. My claim was that BT does not play any special epistemological role and that all the work is done by the assumption that p(H|E)t+1=p(H|E)t and p(E)t+1 hold universally. You are trying to defend the view that those assumptions do hold universally and you are appealing to the fact that there is always some proposition E* to which I assign degree of belief 1 when I receive new evidence. And I was arguing that the more the proposition you identify is rationally unrevisable, the less differential support it provides to different hypotheses.

    The second, which is probably better, is to say that T is not a statement about belief, but simply a summary of the sense data which led to the belief

    That's exactly the direction in which I wanted to push the Bayesian--in Bayesian epistemology all the real work is done by substantial epistemological assumptions but very little epistemological work is done to substantiate those assumptions. Most philosophers today, for example, seem to think that sense data are a non-starter, so if Bayesian epistmeologist really rely on sense data they would seem to have to do a lot of work to convicne us there is anyithng like that. (Moreover, sense data are private but if you want the prior to wash out all agents need to conditionalize on the same evidence)

    ReplyDelete
  33. Sorry this is going to be quick because I'm in a hurry. Hope it's not too unclear (and there aren't too many typos) ;-)

    No need to apologise! I'm really very grateful for the time you've put into this discussion. I'm sorry this is another long reply.

    My claim was that BT does not play any special epistemological role and that all the work is done by the assumption that p(H|E)t+1=p(H|E)t and p(E)t+1 hold universally. You are trying to defend the view that those assumptions do hold universally and you are appealing to the fact that there is always some proposition E* to which I assign degree of belief 1 when I receive new evidence. And I was arguing that the more the proposition you identify is rationally unrevisable, the less differential support it provides to different hypotheses.

    Right, got it - I hadn't quite realised that was what you were getting at, so I'd misinterpreted some of what you were saying, sorry about that.

    However, I don't think that the last statement - that the more rationally unrevisable a statement is, the less support it provides to different hypotheses - is necessarily true. I think that, for an ideal rational agent, those diminishing returns do come to a stop somewhere around the reception of sense data, and you do ultimately do end up conditioning on statements that are both rationally unrevisable and provide support for non-trivial hypotheses.

    However, this can only happen if you have a prior that includes beliefs about how the world determines your sensory experiences. I think the real epistemic problem is in justifying such a prior -- and I do think it's a big problem. (note that I'm using "prior" to mean "[prior] system of beliefs" here.)

    For example, let's suppose I am a rational agent. Suppose I have a prior P which, among other things, includes the following statements with a high probability:
    S1 "There is a world outside which exists independently of me;"
    S2 "That world is primarily the cause of my sensory experiences;"
    S3 "In particular, my sensory experience of daylight is caused by the sun;"
    S4 "The sun is in the sky during daytime."
    Let's say I just woke up and I'm not sure if it's daytime yet or if I woke up in the middle of the night - so my prior also includes the the following statement with (let's say) 50% probability:
    S5 "It is daytime."
    (or "it was daytime when I woke up," to put it in a time-independent form)

    [to be continued]

    ReplyDelete
  34. [continued]


    Now let's say that when I open my eyes I have the experience of daylight. On the prior P, the experience of daylight provides quite a lot of support for the statement that it is daytime. So we have the statement
    S6 "I experienced daylight when I opened my eyes after waking up this morning"
    which is both rationally unrevisable (for example, even if I later discover myself to be living in the Matrix I don't want to question the experience, only its meaning) and which provides support for the non-trivial hypothesis S5.

    But as I said, this can only happen with a prior such as P that asserts (with some degree of certainty) the existence of an outside world that is the cause of sensory experience. It's this which allows us to interpret sensory experience as data about the world. I think that most philosophical objections to the idea of sense data (at least the sensible ones) come down to the idea that no amount of sensory experience can fully justify the belief in a prior like P.

    In other words I think the true problem concerns the justification of statements like S1 and S2. They certainly can't be arrived at by deductive reasoning from any amount of sensory experience. Inductive reasoning might help: every time I have an experience that's compatible with a self-consistent outside world it lends a little support to the hypothesis that there is, in fact, a self-consistent world that causes my experiences. But you still have to start with a prior in order to do inductive reasoning, so this cannot solve the problem entirely. I also don't think an appeal to the washing out of priors can help (see below).

    So the problem comes down to the one that's always present when reasoning with Bayes' theorem: what is the prior, and how can it be justified? In some cases these questions have definite answers, but I suspect that isn't the case in this situation. I suspect the problem of justifying S1 and S2 is probably fundamentally unsolvable. But this is a problem for all approaches to epistemology, not just Bayesian ones.

    (Moreover, sense data are private but if you want the prior to wash out all agents need to conditionalize on the same evidence)

    I'm very much not a fan of the "washing out" of priors as a way to justify Bayesian reasoning. To me it just feels like an attempt to save frequentist intuitions, and I think it's ultimately doomed to failure. Two agents with different priors conditionalising on the same data will converge to the same posterior in some circumstances, but diverge to different posteriors in other circumstances. As I mentioned earlier, Jaynes gave a dramatic example of two agents diverging to different sets of beliefs when exposed to the same data.

    ReplyDelete

Note: Only a member of this blog may post a comment.