Get All Phone Numbers From All Word Documents

There is a common question in interviews where you have a directory structure containing, lets say word documents, and you’re looking for all of the phone numbers in all of them.  I decided to poke around and devise a solution to this problem (credit goes to this post for getting me started).

Searching raw text files for phone numbers are easy, but word documents are slightly harder since they are in a compressed binary file.  I found a neat SourceForge project called docx2txt, a handy little Perl program that converts .docx files to easily searchable text files.  A bit of setup is needed to configure the unzip program location in the doc2xtxt.config file, but that’s it.  🙂

Warning, doing this in Windows is a bit more difficult than Linux since you’ll have to download Perl and cygwin to run my solution.

My solution:

Get/setup docx2txt from SourceForge, cd to the root of the folder containing the .docx files, then run this:

$ find . -name "*.docx" | 
      xargs -i perl c:/downloads/docx2txt-1.0/docx2txt.pl {}; 
      find . -name "*.txt" | 
      xargs -i cat {} | 
      grep -ohP "(?b[0-9]{3})?[-. ]?[0-9]{3}[-. ]?[0-9]{4}b" | 
      sort -u

Voila!  Yikes…

This is a long, multi-part statement though.  Here’s a bit about each part:

  • find . -name “*.docx” –> recurse through all directories, starting at current directory, and find all *.docx files
  • | xargs -i perl d:/Users/brian/Downloads/docx2txt-1.0/docx2txt.pl {} –> for each .docx file found from the previous statement, pipe it as an argument into docx2txt.pl
  • ; find . -name “*.txt” –> wait for the previous statement to finish, then find all of the generated .txt files from the second part
  • | xargs -i cat {} –> output the contents of each text file
  • | grep -ohP “(?b[0-9]{3})?[-. ]?[0-9]{3}[-. ]?[0-9]{4}b” –> search for all phone numbers of the format ###-###-####, (using -‘s, .’s or spaces for separators). Regular expressions are not fun, but made a lot easier by tools like Just Great Software’s RegexBuddy
  • | sort -u –> finally output each of the phone numbers (removing duplicates)

You should end up with something like this:

$ find . -name "*.docx" | 
> xargs -i perl c:/downloads/docx2txt-1.0/docx2txt.pl {};
> find . -name "*.txt" |
> xargs -i cat {} | 
> grep -ohP "(?b[0-9]{3})?[-. ]?[0-9]{3}[-. ]?[0-9]{4}b" | 
> sort -u
(888)555-0003
888-555-0001
888-555-0002
888-555-0006
888.555-0005
888.555.0004

What’s the Last Thing You’d Think Of?

Thinking from an airplane window
Airplane flights are good times to think

I have a fun mental exercise I like to run through called “what’s the last thing I’d ever think of.” It falls along the lines of the Zen Kōans (eg: “what is the sound of one hand clapping”), but find it helps me open my mind when I feel like I’m too deep in a project. It’s something I like to do when relaxing on the back porch, or sitting in an airplane like I’m doing now. It’s a fun activity to pass time that obviously can never be completed because you can always find something more obscure than your previous thought.

I used to try to think of the last thing I’d ever think of at the most intense times in school, so even during my high school graduation I tried to force myself to remember something totally far away (time and/or distance), like what the blender at home would be experiencing at that very moment, or little newts my sister and I used to find under rocks in my backyard in Virginia. Even though importantly seeming events are going on around me, or a problem I am thinking of is consuming my every thought, things like blenders at home and amphibians under rocks in back yards in Virginia still exist.  Maybe next time you’re stuck on a problem, try to broaden your thoughts to give your brain a chance to relax, then refocus itself on the problem from a different angle.

Human Readable Code

I recently got my virtual wrist slapped when a developer asked about good coding practices, and I over-generally recommended that he should “comment like crazy”. Immediately I received several responses implying that comments are evil, and that truly well written code should need little or no comments to explain what the code is doing, and the variable and function names should do most of the explaining. Could this be correct? Some people even say that inline comments are a sign of code smell.  In school, we were forced to heavily comment all code in our introductory classes, and once in upper-division, no longer required, but strongly encouraged to do so. Good comments were always brought up when lectures briefly mentioned coding styles and standards. How then, would very readable but non-commented code exist?

I did a brief Google search on “Human Readable Code”, and was delightfully impressed when I discovered a brief excerpt from Joshua Kerievsky’s book Refactoring to Patterns.  In this excerpt, he uses some code written by Ward Cunningham which gives the perfect example of human-readable code:

november(20, 2005)

Afterwords Kerievsky goes on to generally state that human readable code:

  • Reads like spoken language
  • Separates important code from distracting code

His example is beautiful!  It perfectly illustrates what perfect human readable code is (assuming the function returns a Date object initialized to Nov. 20th, 2005).

However, I have never seen production code like this.  I wonder why, look around at work, and it is obvious.  When would the average production programmer have time to write and rewrite function names so that someone unfamiliar with his work (or even them 12 months later) could view the function and know at a glance exactly what was being done?  Kerievsky’s example above is extremely beautiful code, but simply that he has a november function implies there are most likely january, february, march, april, may, june, july, august, september, october, and december functions as well.  Realistically though, it is just bad code.  Though descriptive, it will never be used in production environments.  I wont get into the potential headache that could arise from having to refactor each of these functions if the underlying code of each function had to be changed.  What I did notice is that Kerievsky said that the above code is nicer than this:

java.util.Calendar c = java.util.Calendar.getInstance();
c.set(2005, java.util.Calendar.NOVEMBER, 20);
c.getTime();
While that is certainly true, production coders just don’t get into the proper states of mind to write the best, most human readable name for the function.  The problem with function names is that they need to be specific enough to be able to be understood by other people reading/using the code, but vague enough that it can be used for multiple purposes (such as getting any date).  Most programmers would wrap the above code in a function like this:
Date getDate(int year, int month, int day)

While the november(20, 2005) sample is a perfect example of human readable code, the getDate example will be seen over and over again simply because it is good enough.  The problem arises when method names that need to be generic (specifically thinking of functions declared in interfaces or headers) which contain something along the lines of establishConnection() or initializeVariables().  These names are nearly always too generic, usually require them to do so.

Enter the comment!

Comments not only increase code’s human readability, but are essential to it.  I firmly believe that stigmas against commenting arise from comments like this:

//initialize the variables
initializeVariables()

That’s just nasty.

I propose a solution: Write comments before you even begin programming.  Describe exactly what you are about to do in your code both to yourself and to the future code maintainer who will see the code (and may even look back at the source-control revision control system logs to see which developer wrote this, and is punishing them from months/years before, who may even be yourself :D).  After your first draft, rewrite it in as short and concise language as possible, summarize what you are going to do into actionable steps or verbs.  Finish off by filling in the code.

This method absolutely should be done before you write any code.  It will work much better because your “English brain” is still capable of coming up with complete intelligible sentences, and your engineer brain hasn’t quite taken over.  Furthermore, it will be much more readable in the future when you have to come back to try to find what went wrong or got overlooked.  If you start commenting after writing code, you are likely to write comments like the one above.  Write what you mean to do in comments first in human-readable English, then start programming.

I feel like this article would benefit from a trivial example, so this is a C# function I wrote for my server-side development class to parse Twitter-like “mentions” of a user out of a users’s string (ie “blah blah blah @brianesserlieu“):

Phase 1 – write a summary of what I want to do:

private static List ParseMessage(string message)
{
    //I need to check the message and see if there are any mentions at all.
    //If there are, I need to parse each mention, and check to see if the
    //parsed users exist.  If they do exist, I need to add them to the list
    //of users to be tagged to this message and return that list of users.
    return null;
}

Phase 2 – break it down into short, concise “verbs”:

private static List ParseMessage(string message)
{
   //parse the string, get each "@..."

    //check to see if any parsed users exist

    //if they do, then check to see if they are valid users

    //if user found, verify the user is an existing user

    //if user exists then add them to the list of users

    //finally return the list of users

    return null;
}

Phase 3 – Fill in the code

private static List ParseMessage(string message)
{
  List userList = new List<RegisteredUser>();
  List errors = new List();

  //parse the string, get each "@..."
  string[] splitStrings = message.Split('@');

  //check to see if any parsed users exist
  if (splitStrings.Length > 1)
  {
    foreach (string s in splitStrings)
    {
      //if they do, then check to see if they are valid users
      Match match = Regex.Match(s, @"A([A-Za-z0-9]+)", RegexOptions.IgnoreCase);

      if (match.Success)
      {
        //if user found, verify the user is an existing user
        string parsedUsername = match.Groups[1].Value;

        RegisteredUser user = DALUser.GetUser(parsedUsername, ref errors);
        if ( user != null)
        {
          //if user exists then add them to the list of users
          userList.Add(user);
        }
      }

    }
  }

  //return the list of users
  return userList;
}

Just remember, when you press the build button or type in make, all sorts of magical things happen. The comping process removes all the comments, fills macros, inlines inline functions, and links everything into one wonderful computer-readable program. Before all of that, computer programmers need to focus on everything occurring before that build button gets pressed, and that is writing human readable code.

Here’s a great discussion on writing and commenting good code.

On Your Mark, Get Set…

So I’ve come to a strange point in my life where, maybe for the first and only time, I can be employed but can also openly say that I’m thinking of looking for a full time job away from the one that I’m at now.  I have put a lot of thought into this.  I have asked around to close friends, coworkers, mentors, and have even posted the question to the programmers.stackexchange.com website on whether or not I should stay with the company I have interned at for the past 3 years.  I have heard differing advice from as many people as I have asked.  A few say I should stay, a few say I should go, but all agree I should at least interview for other jobs if only just to keep my options open.

My name is Brian Esserlieu, a Computer Science undergraduate at the University of California, San Diego.  I am doing exceptionally well academically (with a 3.9 GPA), and professionally I have had internships in the mobile application, web and pc application areas since 2007.  After deciding to interview for other potential positions, I realized I would like to start a blog, if not only to collect all of my ideas and experiences with interviewing preparations, but also to try to share as much of it as I can with anyone who may be starting out fresh in the programming world.

I have never really had a personal blog to capture my own interests (besides the one I worked on for 9 weeks last winter for a digital communications class), but I hope it will be as interesting to read as it will be fun to write.