Driving drunk is illegal for a good reason, it’s way riskier than driving sober. This article isn’t about driving drunk though, it’s more about the sloppy thought processes that can too easily confuse something as obvious as that first sentence. Here’s an example of a bogus argument that appears to support the idea that drunk driving is actually safer:

So the argument is as follows: In 2012, 10,322 people were killed in alcohol-impaired driving crashes, accounting for nearly one-third (31%) of all traffic-related deaths in the United States [1]. That means that approximately one third of traffic-related deaths involve drunk driving, meaning that two thirds of traffic-related deaths don’t involve drunk driving. Therefore, sober drivers are twice as likely to die in a traffic accident.

If you think something is wrong with that argument, you are right, but it’s not just because the conclusion intuitively seems wrong, it’s because it involves a mistake in conditional probability. To see the mistake, it helps to introduce a litle notation, we will define:

  • P(D) to be the probability that a person is drunk
  • P(A) to be the probability that a person will die in a traffic-related accident
  • P(D | A) (pronounced probability of D given A) is the probability that a person is drunk, given that there was a death in a traffic-related accident they were in

So using the 2012 CDC data, we can assign 31%, P(D | A) = 0.31. This is that the probability of a drunk driver being involved given that there was a deadly driving accident.

The first thing to point out is that the statement that ‘sober drivers are twice as likely as drunk drivers to die in an accident’ is really a statement about P(A | D), that is, the probability of a deadly driving accident given that that person is drunk. We don’t know this yet, however, we can figure it out using Bayes’ theorem.

Bayes’ Theorem

Bayes’ Theorem is unusual in that it is extremely useful and easy to prove, but hard to really understand. This is something I learned several times in college, but never really understood it’s importance until much later. To see how easy to prove it is, we go back to the definition of conditional probability:

Where P(X ∩ Y) is the probability of X and Y occurring. Since this is true for any pair of events X and Y, we can reverse them and get

Also, remember that AND is commutative, so that P(X ∩ Y) = P(Y ∩ X), so we can multiply the above two equations by P(Y) and P(X), respectively, to get:

This relates P(X|Y) to P(Y|X), P(X) and P(Y), we can solve the above equation to get:

And that’s it, we took the definition of conditional probability, did a little algebra, and out popped Bayes’ theorem, we can now apply this to the above drunk driving fallacy, and calculate the probability that we are interested in, that is, P(A | D).

Since we know P(D|A), we just need to find P(A) and P(D). Since the CDC data we are using is annual data, we need to take the number of casualties from deadly accidents in the United States for the year of 2012 (33,561) and divide by the number of drivers (211,814,830), that gives an estimate of P(A) = 33,561/211,814,830 = 0.0001584, which is about 1 in 6,313.

Next, we need to find the probability that a driver is drunk P(D), we will use the data from the study referenced in [3], and define ‘drunk’ to be a BAC of ≥ 0.1%. Then P(D) = 0.00387 or about 1 in 258 (more on this calculation in the notes below).

Now that we have:

P(D|A) = 0.31 ( probability of a driver being drunk, given they were involved in an accident where someone died ),

P(A) = 0.0001584 ( probability of a driver being involved in an accident where someone died ), and

P(D) = 0.00387 ( probability of a driver being drunk )

We can figure out P(A|D) ( probability of a drunk driver getting into a deadly accident )

P(A|D) = P(D|A)P(A)/P(D) = (0.31*0.0001584)/0.00387 = 0.0127 (12.7 %)

12.7% is significant, it’s only a little better than the chance of dying as Russian Roulette. Now, let’s compare that to sober driving, we just need to calculate P(A|Dc). We can use Kolmogorov’s Theorem of total probability, shuffle a few terms to get:

P(A|Dc) = (P(A) - P(A|D)P(D))/P(Dc) = (0.0001584 - 0.0127*0.00387)/(1-0.00387) = .000109, which is about 1 in 9118.

Conclusion

So the probability of getting in a deadly accident, given that you are drunk is 12.7%, and the probability of getting into a deadly accident, given that you are not drunk is .01%, that means that it is 1165 times more likely that you will get into a deadly accident while drunk.

References

[1] Impaired Driving: Get the Facts Centers for Disease Control http://www.cdc.gov/Motorvehiclesafety/impaired_driving/impaired-drv_factsheet.html

[2] Total licensed drivers U.S. Department of Transportation Federal Highway Administration http://www.fhwa.dot.gov/policyinformation/statistics/2012/dl22.cfm

[3] Probability of arrest while driving under the influence (George A Beitel, Michael C Sharp, William D Glauz) http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1730617/pdf/v006p00158.pdf

Notes on [3], we don’t technically have P(D), but we do have P(D|A1), P(A1), and P(A1|D), where A1 is the event that a person is arrested. We can then find P(D) = (P(D|A1)P(A1))/P(A1|D) = (0.06×0.000374)/0.0058 = .00387.

In taking the Coursera class on Mining Massive Datasets, the problem of computing word frequency for very large documents came up. I wanted some convenient tools for breaking documents into streams of words, and also a tool to remove common words like ‘the’, so I wrote up words and decommonize. The decommonize script is just a big grep -v '(foo|bar|baz)', where the words foo, bar and baz come from the words in a file. I made a script generate_decommonize that reads in a list of common words, and builds the regex for grep -v.

Example usage of words and decommonize

The full source code is available here on github.

After running make install, you should have words and decommonize in your PATH, you can use them to find key words that are characteristic of a document, I chose

  • the U.S. Declaration of Independence:
1
2
3
4
5
6
7
8
9
10
11
$ words < declaration_of_independence.txt | decommonize  | sort | uniq -c | sort -n | tail
   4 time
   5 among
   5 most
   5 powers
   6 government
   6 such
   7 right
   8 states
   9 laws
  10 people
  • Sherlock Holmes
1
2
3
4
5
6
7
8
9
10
11
$ words < doyle_sherlock_holmes.txt | decommonize  | sort | uniq -c | sort -n | tail
 174 think
 175 more
 177 over
 212 may
 212 should
 269 little
 274 mr
 288 man
 463 holmes
 466 upon
  • Working with Unix Processes (by @jstorimer)
1
2
3
4
5
6
7
8
9
10
11
$ words < working_with_unix_processes.txt | decommonize  | sort | uniq -c | sort -n | tail
  73 signal
  82 system
  88 ruby
  90 exit
 100 code
 100 parent
 143 its
 146 child
 184 processes
 444 process

So words breaks up the document into lower-case alphabetic words, then decommonize greps out the common words, and sort and uniq -c are used to count instances of each decommonized word, and then the results are sorted.

The White House just released the first ever open source budget proposal. It is released on GitHub, and it’s a bunch of CSV files. This is not very difficult, it requires only a few extra clicks when exporting an Excel spreadsheet, but hosting it on GitHub also opens it up to Pull Requests, which I’ve talked about before as being a much better tool for 21st century democracy. Instead of paper and a bunch of politicians in a room following procedure, we should intead have a digital system where all citizens can contribute as easily as they can update a facebook status or apply an instagram filter.

One huge caveat is in order though: there is no reason to assume that the White House and Congress will even consider pull requests, let alone apply them. This aside, I will experiment with this, I’ve already modified textql so that I can easily query these CSV files from a SQLite database. If I have an idea about how I’d like to change the budget, I’ll submit the pull request and then follow it’s response, if any.

Caveats aside, I am impressed with the choice of technologies for making these public issues more accessible.

I modified my tipcalc program to handle expressions of arbitrary depth, so now it can handle input like ((($100 + 2%) + 2%) - 3%) + 3.5%.

The trick was to change the start symbol to match binary_expression, and then define binary_expression recursively, like so:

1
2
3
4
5
6
7
8
binary_expression:
    dollars OP_PLUS percentage
    |
    dollars OP_MINUS percentage
    |
    LPAREN binary_expression RPAREN OP_PLUS percentage
    |
    LPAREN binary_expression RPAREN OP_MINUS percentage

This is what makes this new version a context-free grammar and not a regular grammar. Now, if you think that you could still handle this input with a regular expression, notice that adding percentages is not associative. For example, you might think we could drop the parens and just parse $100 + 2% + 2% + 2% using /\$\d+ (\+ \d\%)+/

1
\$\d+ (\+ \d\%)+

Regular expression visualization

Debuggex Demo

However, if instead we wrote $100 + 2% - 2% + 2%, associativity says we can reduce it to $100 + 2%, however, when associated to the left (($100 + 2%) - 2%) + 2% it is clear that the result is different from $100 + 2%.

As long as I’ve been able to do arithmetic, I’ve been able to figure out calculating taxes and tips, it’s easy. Given a dollar value $17.91 we can figure out the total with a tip of 18% as $17.91*(1.18) = $21.14

However, it would be nice just to enter in $17.91 + 18% and have the computer figure it out. So one time at lunch after calculating the tip for a burrito I decided to learn lex and bison, which can be used together to create a mini language.

The grammar I used was the following:

1
2
3
4
5
6
7
8
9
10
start:
    dollars OP_PLUS percentage
    |
    dollars OP_MINUS percentage

dollars:
    TOKDOLLAR NUMBER

percentage:
    NUMBER TOKPERCENT

Where OP_PLUS and OP_MINUS come from + and -. Also, TOKDOLLAR and TOKPERCENT are $ and %.

Then, below each grammar rule, I added some C code that would be generated if the input matches that rule:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
start:
    dollars OP_PLUS percentage
    {
        double dollars = $1;
        double percentage = ($3)/(100.0);
        double total = dollars + dollars*percentage;
        printf("$%.2f", total);
    }
    |
    dollars OP_MINUS percentage
    {
        double dollars = $1;
        double percentage = ($3)/(100.0);
        double total = dollars - dollars*percentage;
        printf("$%.2f", total);
    }

The full source code is available here.

Now, it is true that this is no more powerful than a regular expression, however, I intend on modifying it to allow nested expressions like (($2 + 4%) + 4%), which would be useful for compound interest calculations. That would be more powerful than regular expressions, meaning it would be at least a context-free grammar.

Update: In the future, I wrote about implementing this