Driving drunk is illegal for a good reason, it’s way riskier than driving sober. This article isn’t about driving drunk though, it’s more about the sloppy thought processes that can too easily confuse something as obvious as that first sentence. Here’s an example of a bogus argument that appears to support the idea that drunk driving is actually safer:

So the argument is as follows: In 2012, 10,322 people were killed in alcohol-impaired driving crashes, accounting for nearly one-third (31%) of all traffic-related deaths in the United States [1]. That means that approximately one third of traffic-related deaths involve drunk driving, meaning that two thirds of traffic-related deaths don’t involve drunk driving. Therefore, sober drivers are twice as likely to die in a traffic accident.

If you think something is wrong with that argument, you are right, but it’s not just because the conclusion intuitively seems wrong, it’s because it involves a mistake in conditional probability. To see the mistake, it helps to introduce a litle notation, we will define:

• P(D) to be the probability that a person is drunk
• P(A) to be the probability that a person will die in a traffic-related accident
• P(D | A) (pronounced probability of D given A) is the probability that a person is drunk, given that there was a death in a traffic-related accident they were in

So using the 2012 CDC data, we can assign 31%, P(D | A) = 0.31. This is that the probability of a drunk driver being involved given that there was a deadly driving accident.

The first thing to point out is that the statement that ‘sober drivers are twice as likely as drunk drivers to die in an accident’ is really a statement about P(A | D), that is, the probability of a deadly driving accident given that that person is drunk. We don’t know this yet, however, we can figure it out using Bayes’ theorem.

## Bayes’ Theorem

Bayes’ Theorem is unusual in that it is extremely useful and easy to prove, but hard to really understand. This is something I learned several times in college, but never really understood it’s importance until much later. To see how easy to prove it is, we go back to the definition of conditional probability:

Where P(X ∩ Y) is the probability of X and Y occurring. Since this is true for any pair of events X and Y, we can reverse them and get

Also, remember that AND is commutative, so that P(X ∩ Y) = P(Y ∩ X), so we can multiply the above two equations by P(Y) and P(X), respectively, to get:

This relates P(X|Y) to P(Y|X), P(X) and P(Y), we can solve the above equation to get:

And that’s it, we took the definition of conditional probability, did a little algebra, and out popped Bayes’ theorem, we can now apply this to the above drunk driving fallacy, and calculate the probability that we are interested in, that is, P(A | D).

Since we know P(D|A), we just need to find P(A) and P(D). Since the CDC data we are using is annual data, we need to take the number of casualties from deadly accidents in the United States for the year of 2012 (33,561) and divide by the number of drivers (211,814,830), that gives an estimate of P(A) = 33,561/211,814,830 = 0.0001584, which is about 1 in 6,313.

Next, we need to find the probability that a driver is drunk P(D), we will use the data from the study referenced in [3], and define ‘drunk’ to be a BAC of ≥ 0.1%. Then P(D) = 0.00387 or about 1 in 258 (more on this calculation in the notes below).

Now that we have:

P(D|A) = 0.31 ( probability of a driver being drunk, given they were involved in an accident where someone died ),

P(A) = 0.0001584 ( probability of a driver being involved in an accident where someone died ), and

P(D) = 0.00387 ( probability of a driver being drunk )

We can figure out P(A|D) ( probability of a drunk driver getting into a deadly accident )

P(A|D) = P(D|A)P(A)/P(D) = (0.31*0.0001584)/0.00387 = 0.0127 (12.7 %)

12.7% is significant, it’s only a little better than the chance of dying as Russian Roulette. Now, let’s compare that to sober driving, we just need to calculate P(A|Dc). We can use Kolmogorov’s Theorem of total probability, shuffle a few terms to get:

P(A|Dc) = (P(A) - P(A|D)P(D))/P(Dc) = (0.0001584 - 0.0127*0.00387)/(1-0.00387) = .000109, which is about 1 in 9118.

## Conclusion

So the probability of getting in a deadly accident, given that you are drunk is 12.7%, and the probability of getting into a deadly accident, given that you are not drunk is .01%, that means that it is 1165 times more likely that you will get into a deadly accident while drunk.

### References

[1] Impaired Driving: Get the Facts Centers for Disease Control http://www.cdc.gov/Motorvehiclesafety/impaired_driving/impaired-drv_factsheet.html

[2] Total licensed drivers U.S. Department of Transportation Federal Highway Administration http://www.fhwa.dot.gov/policyinformation/statistics/2012/dl22.cfm

[3] Probability of arrest while driving under the influence (George A Beitel, Michael C Sharp, William D Glauz) http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1730617/pdf/v006p00158.pdf

Notes on [3], we don’t technically have P(D), but we do have P(D|A1), P(A1), and P(A1|D), where A1 is the event that a person is arrested. We can then find P(D) = (P(D|A1)P(A1))/P(A1|D) = (0.06×0.000374)/0.0058 = .00387.

In taking the Coursera class on Mining Massive Datasets, the problem of computing word frequency for very large documents came up. I wanted some convenient tools for breaking documents into streams of words, and also a tool to remove common words like ‘the’, so I wrote up `words` and `decommonize`. The `decommonize` script is just a big `grep -v '(foo|bar|baz)'`, where the words foo, bar and baz come from the words in a file. I made a script `generate_decommonize` that reads in a list of common words, and builds the regex for `grep -v`.

## Example usage of `words` and `decommonize`

The full source code is available here on github.

After running `make install`, you should have `words` and `decommonize` in your PATH, you can use them to find key words that are characteristic of a document, I chose

• the U.S. Declaration of Independence:
• Sherlock Holmes
• Working with Unix Processes (by @jstorimer)

So `words` breaks up the document into lower-case alphabetic words, then `decommonize` greps out the common words, and `sort` and `uniq -c` are used to count instances of each decommonized word, and then the results are sorted.

The White House just released the first ever open source budget proposal. It is released on GitHub, and it’s a bunch of CSV files. This is not very difficult, it requires only a few extra clicks when exporting an Excel spreadsheet, but hosting it on GitHub also opens it up to Pull Requests, which I’ve talked about before as being a much better tool for 21st century democracy. Instead of paper and a bunch of politicians in a room following procedure, we should intead have a digital system where all citizens can contribute as easily as they can update a facebook status or apply an instagram filter.

One huge caveat is in order though: there is no reason to assume that the White House and Congress will even consider pull requests, let alone apply them. This aside, I will experiment with this, I’ve already modified textql so that I can easily query these CSV files from a SQLite database. If I have an idea about how I’d like to change the budget, I’ll submit the pull request and then follow it’s response, if any.

Caveats aside, I am impressed with the choice of technologies for making these public issues more accessible.

I modified my tipcalc program to handle expressions of arbitrary depth, so now it can handle input like `(((\$100 + 2%) + 2%) - 3%) + 3.5%`.

The trick was to change the `start` symbol to match `binary_expression`, and then define `binary_expression` recursively, like so:

This is what makes this new version a context-free grammar and not a regular grammar. Now, if you think that you could still handle this input with a regular expression, notice that adding percentages is not associative. For example, you might think we could drop the parens and just parse `\$100 + 2% + 2% + 2%` using `/\\$\d+ (\+ \d\%)+/`

Debuggex Demo

However, if instead we wrote `\$100 + 2% - 2% + 2%`, associativity says we can reduce it to `\$100 + 2%`, however, when associated to the left `((\$100 + 2%) - 2%) + 2%` it is clear that the result is different from `\$100 + 2%`.

As long as I’ve been able to do arithmetic, I’ve been able to figure out calculating taxes and tips, it’s easy. Given a dollar value \$17.91 we can figure out the total with a tip of 18% as \$17.91*(1.18) = \$21.14

However, it would be nice just to enter in `\$17.91 + 18%` and have the computer figure it out. So one time at lunch after calculating the tip for a burrito I decided to learn lex and bison, which can be used together to create a mini language.

The grammar I used was the following:

Where `OP_PLUS` and `OP_MINUS` come from `+` and `-`. Also, `TOKDOLLAR` and `TOKPERCENT` are `\$` and `%`.

Then, below each grammar rule, I added some C code that would be generated if the input matches that rule:

The full source code is available here.

Now, it is true that this is no more powerful than a regular expression, however, I intend on modifying it to allow nested expressions like `((\$2 + 4%) + 4%)`, which would be useful for compound interest calculations. That would be more powerful than regular expressions, meaning it would be at least a context-free grammar.

Update: In the future, I wrote about implementing this