Get All Phone Numbers From All Word Documents

There is a common question in interviews where you have a directory structure containing, lets say word documents, and you’re looking for all of the phone numbers in all of them.  I decided to poke around and devise a solution to this problem (credit goes to this post for getting me started).

Searching raw text files for phone numbers are easy, but word documents are slightly harder since they are in a compressed binary file.  I found a neat SourceForge project called docx2txt, a handy little Perl program that converts .docx files to easily searchable text files.  A bit of setup is needed to configure the unzip program location in the doc2xtxt.config file, but that’s it.  🙂

Warning, doing this in Windows is a bit more difficult than Linux since you’ll have to download Perl and cygwin to run my solution.

My solution:

Get/setup docx2txt from SourceForge, cd to the root of the folder containing the .docx files, then run this:

$ find . -name "*.docx" | 
      xargs -i perl c:/downloads/docx2txt-1.0/docx2txt.pl {}; 
      find . -name "*.txt" | 
      xargs -i cat {} | 
      grep -ohP "(?b[0-9]{3})?[-. ]?[0-9]{3}[-. ]?[0-9]{4}b" | 
      sort -u

Voila!  Yikes…

This is a long, multi-part statement though.  Here’s a bit about each part:

  • find . -name “*.docx” –> recurse through all directories, starting at current directory, and find all *.docx files
  • | xargs -i perl d:/Users/brian/Downloads/docx2txt-1.0/docx2txt.pl {} –> for each .docx file found from the previous statement, pipe it as an argument into docx2txt.pl
  • ; find . -name “*.txt” –> wait for the previous statement to finish, then find all of the generated .txt files from the second part
  • | xargs -i cat {} –> output the contents of each text file
  • | grep -ohP “(?b[0-9]{3})?[-. ]?[0-9]{3}[-. ]?[0-9]{4}b” –> search for all phone numbers of the format ###-###-####, (using -‘s, .’s or spaces for separators). Regular expressions are not fun, but made a lot easier by tools like Just Great Software’s RegexBuddy
  • | sort -u –> finally output each of the phone numbers (removing duplicates)

You should end up with something like this:

$ find . -name "*.docx" | 
> xargs -i perl c:/downloads/docx2txt-1.0/docx2txt.pl {};
> find . -name "*.txt" |
> xargs -i cat {} | 
> grep -ohP "(?b[0-9]{3})?[-. ]?[0-9]{3}[-. ]?[0-9]{4}b" | 
> sort -u
(888)555-0003
888-555-0001
888-555-0002
888-555-0006
888.555-0005
888.555.0004

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: