In taking the Coursera class on Mining Massive Datasets, the problem of computing word frequency for very large documents came up. I wanted some convenient tools for breaking documents into streams of words, and also a tool to remove common words like ‘the’, so I wrote up words and decommonize. The decommonize script is just a big grep -v '(foo|bar|baz)', where the words foo, bar and baz come from the words in a file. I made a script generate_decommonize that reads in a list of common words, and builds the regex for grep -v.

Example usage of words and decommonize

The full source code is available here on github.

After running make install, you should have words and decommonize in your PATH, you can use them to find key words that are characteristic of a document, I chose

  • the U.S. Declaration of Independence:
1
2
3
4
5
6
7
8
9
10
11
$ words < declaration_of_independence.txt | decommonize  | sort | uniq -c | sort -n | tail
   4 time
   5 among
   5 most
   5 powers
   6 government
   6 such
   7 right
   8 states
   9 laws
  10 people
  • Sherlock Holmes
1
2
3
4
5
6
7
8
9
10
11
$ words < doyle_sherlock_holmes.txt | decommonize  | sort | uniq -c | sort -n | tail
 174 think
 175 more
 177 over
 212 may
 212 should
 269 little
 274 mr
 288 man
 463 holmes
 466 upon
  • Working with Unix Processes (by @jstorimer)
1
2
3
4
5
6
7
8
9
10
11
$ words < working_with_unix_processes.txt | decommonize  | sort | uniq -c | sort -n | tail
  73 signal
  82 system
  88 ruby
  90 exit
 100 code
 100 parent
 143 its
 146 child
 184 processes
 444 process

So words breaks up the document into lower-case alphabetic words, then decommonize greps out the common words, and sort and uniq -c are used to count instances of each decommonized word, and then the results are sorted.

The White House just released the first ever open source budget proposal. It is released on GitHub, and it’s a bunch of CSV files. This is not very difficult, it requires only a few extra clicks when exporting an Excel spreadsheet, but hosting it on GitHub also opens it up to Pull Requests, which I’ve talked about before as being a much better tool for 21st century democracy. Instead of paper and a bunch of politicians in a room following procedure, we should intead have a digital system where all citizens can contribute as easily as they can update a facebook status or apply an instagram filter.

One huge caveat is in order though: there is no reason to assume that the White House and Congress will even consider pull requests, let alone apply them. This aside, I will experiment with this, I’ve already modified textql so that I can easily query these CSV files from a SQLite database. If I have an idea about how I’d like to change the budget, I’ll submit the pull request and then follow it’s response, if any.

Caveats aside, I am impressed with the choice of technologies for making these public issues more accessible.

I modified my tipcalc program to handle expressions of arbitrary depth, so now it can handle input like ((($100 + 2%) + 2%) - 3%) + 3.5%.

The trick was to change the start symbol to match binary_expression, and then define binary_expression recursively, like so:

1
2
3
4
5
6
7
8
binary_expression:
    dollars OP_PLUS percentage
    |
    dollars OP_MINUS percentage
    |
    LPAREN binary_expression RPAREN OP_PLUS percentage
    |
    LPAREN binary_expression RPAREN OP_MINUS percentage

This is what makes this new version a context-free grammar and not a regular grammar. Now, if you think that you could still handle this input with a regular expression, notice that adding percentages is not associative. For example, you might think we could drop the parens and just parse $100 + 2% + 2% + 2% using /\$\d+ (\+ \d\%)+/

1
\$\d+ (\+ \d\%)+

Regular expression visualization

Debuggex Demo

However, if instead we wrote $100 + 2% - 2% + 2%, associativity says we can reduce it to $100 + 2%, however, when associated to the left (($100 + 2%) - 2%) + 2% it is clear that the result is different from $100 + 2%.

As long as I’ve been able to do arithmetic, I’ve been able to figure out calculating taxes and tips, it’s easy. Given a dollar value $17.91 we can figure out the total with a tip of 18% as $17.91*(1.18) = $21.14

However, it would be nice just to enter in $17.91 + 18% and have the computer figure it out. So one time at lunch after calculating the tip for a burrito I decided to learn lex and bison, which can be used together to create a mini language.

The grammar I used was the following:

1
2
3
4
5
6
7
8
9
10
start:
    dollars OP_PLUS percentage
    |
    dollars OP_MINUS percentage

dollars:
    TOKDOLLAR NUMBER

percentage:
    NUMBER TOKPERCENT

Where OP_PLUS and OP_MINUS come from + and -. Also, TOKDOLLAR and TOKPERCENT are $ and %.

Then, below each grammar rule, I added some C code that would be generated if the input matches that rule:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
start:
    dollars OP_PLUS percentage
    {
        double dollars = $1;
        double percentage = ($3)/(100.0);
        double total = dollars + dollars*percentage;
        printf("$%.2f", total);
    }
    |
    dollars OP_MINUS percentage
    {
        double dollars = $1;
        double percentage = ($3)/(100.0);
        double total = dollars - dollars*percentage;
        printf("$%.2f", total);
    }

The full source code is available here.

Now, it is true that this is no more powerful than a regular expression, however, I intend on modifying it to allow nested expressions like (($2 + 4%) + 4%), which would be useful for compound interest calculations. That would be more powerful than regular expressions, meaning it would be at least a context-free grammar.

Update: In the future, I wrote about implementing this