tobilehman.com: a blog on computing, structure and math

Reinventing the Wheel: Or How I Learned to Stop Coding and Read the Manpages

About a month ago I wrote about a command line utility I made that calculates word and character frequencies. It was packaged as a ruby gem and it interacted well with the Unix pipeline interface.

Then, about 2 or 3 weeks later, I come across this post on Twitter:

And I realized that I could construct a one-liner that does what my gem did. Probably faster too. I know about uniq and sort, and I’ve used awk a little bit, but am not really familiar with most of it’s capabilities.

The two features I implemented in ruby were (1) counting word frequencies and (2) counting character frequencies. I defaulted everything to lower case and stripped out non-alphanumeric characters.

Using @UnixToolTip’s suggestion of uniq -c, I came up with this alternative:

1
for word in $(cat filename); do echo $word; done | sed 's/[^a-zA-Z0-9]//g' | tr '[A-Z]' '[a-z]' | sort | uniq -c | sort -nr | head

This just outputs the file, splits everything up by whitespace, strips out anything that isn’t alphanumeric, then lowercases, sorts, and counts the number of repetitions using uniq -c. The result of that is then sorted numerically, to get the most frequent items at the top of the output, and then displays just the top 10 lines using head. There are some small numerical differences between this and my gem, and that is most likely because I split by word boundary in ruby, but split by whitespace on the bash one-liner.

For the problem I was trying to solve, I could have saved some time by digging through the manpages instead of writing another gem. I did enjoy working with the Rubygems packaging system, but I am starting to think that was overkill.

NOTE: For the character count feature, all I have to do is output one character per line, then I can insert that into the pipeline to get the desired output:

1
(CONTENTS OF FILENAME, 1 CHARACTER PER LINE) | sed 's/[^a-zA-Z0-9]//g' | tr '[A-Z]' '[a-z]' | sort | uniq -c | sort -nr | head

I’m not sure how to do this at the moment, I think awk can do it pretty simply, I’ll read the manpages, but for now I have to get to work.

Comments