There is a common question in interviews where you have a directory structure containing, lets say word documents, and you’re looking for all of the phone numbers in all of them. I decided to poke around and devise a solution to this problem (credit goes to this post for getting me started).
Searching raw text files for phone numbers are easy, but word documents are slightly harder since they are in a compressed binary file. I found a neat SourceForge project called docx2txt, a handy little Perl program that converts .docx files to easily searchable text files. A bit of setup is needed to configure the unzip program location in the doc2xtxt.config file, but that’s it. 🙂
Warning, doing this in Windows is a bit more difficult than Linux since you’ll have to download Perl and cygwin to run my solution.
My solution:
Get/setup docx2txt from SourceForge, cd to the root of the folder containing the .docx files, then run this:
$ find . -name "*.docx" | xargs -i perl c:/downloads/docx2txt-1.0/docx2txt.pl {}; find . -name "*.txt" | xargs -i cat {} | grep -ohP "(?b[0-9]{3})?[-. ]?[0-9]{3}[-. ]?[0-9]{4}b" | sort -u
Voila! Yikes…
This is a long, multi-part statement though. Here’s a bit about each part:
- find . -name “*.docx” –> recurse through all directories, starting at current directory, and find all *.docx files
- | xargs -i perl d:/Users/brian/Downloads/docx2txt-1.0/docx2txt.pl {} –> for each .docx file found from the previous statement, pipe it as an argument into docx2txt.pl
- ; find . -name “*.txt” –> wait for the previous statement to finish, then find all of the generated .txt files from the second part
- | xargs -i cat {} –> output the contents of each text file
- | grep -ohP “(?b[0-9]{3})?[-. ]?[0-9]{3}[-. ]?[0-9]{4}b” –> search for all phone numbers of the format ###-###-####, (using -‘s, .’s or spaces for separators). Regular expressions are not fun, but made a lot easier by tools like Just Great Software’s RegexBuddy
- | sort -u –> finally output each of the phone numbers (removing duplicates)
You should end up with something like this:
$ find . -name "*.docx" | > xargs -i perl c:/downloads/docx2txt-1.0/docx2txt.pl {}; > find . -name "*.txt" | > xargs -i cat {} | > grep -ohP "(?b[0-9]{3})?[-. ]?[0-9]{3}[-. ]?[0-9]{4}b" | > sort -u (888)555-0003 888-555-0001 888-555-0002 888-555-0006 888.555-0005 888.555.0004