In this example, I try something a bit more complicated. This was originally done in Ruby, as an exercise for an interview, but I’ve decided to make it a bake-off post to follow up from last month. What follows is mostly a linguistic analysis. In the coming months, I’ll start doing performance and reliability comparisons as well.
The Test
Working with only the standard libraries for Python and Ruby, I read in three files, each delimited differently, and each with different column orders and date formats. The goal: parse the data, and combine it all into a single array-of-arrays, and provide a mechanism for printing formatted output to the console.
Here are the input data specifics:
You will be given 3 files, each containing records stored in a different format.
The pipe-delimited file lists each record as follows:
LastName | FirstName | MiddleInitial | Gender | FavoriteColor | DateOfBirthThe comma-delimited file looks like this:
LastName, FirstName, Gender, FavoriteColor, DateOfBirthAnd lastly, the space-delimited file:
LastName FirstName MiddleInitial Gender DateOfBirth FavoriteColorYou may assume that the delimiters (commas, pipes and spaces) do not appear anywhere in the data values themselves. Write a program to read in records from these files and combine them into a single set of records.
And here is what the console output should look like:
An output record consists of the following 5 fields: last name, first name, gender, date of birth and favorite color.
o Output 1 – sorted by gender (females before males) then by last name ascending.
o Output 2 – sorted by birth date, ascending.
o Output 3 – sorted by last name, descending.
* Ensure that fields are displayed in the following order: last name, first name, gender, date of birth, favorite color.
* Display dates in the format MM/DD/YYYY.
My Thinking
I decided to write this as a rudimentary class library, to see how each language handles object orientation. Also, I decided to include a Perl version of the project, just for kicks, and to provide a “reference” comparison of the functionality of the newer languages with something a little more mature (I’ll spare y’all the REXX version).
First step in the process, was to decide what the objects were, and what I needed to do to them. In a nutshell, the exercise is really just asking for an array object, containing all the data from the three files, merged into one list. The things we need to do to the list are: reorganize the columns, sort the rows, and reformat the date strings.
But not all of these things need to be methods. Printing the data in various ways could just be done in an application (or script), that uses the object. But I didn’t have this in mind when I started. You’ll see that the Ruby version has a sort, a print(format), and various forms of get methods. The Python version, which I wrote next, also has the basic get, a sort, and a print(format) methods. But with the Perl version, I pared the class down to nothing more than a get and a sort method, and did my print layout in the execution script.
So, the basic class would look like:
Class: DataParser
Method: get
Method: sort
Method: print (sometimes omitted)
Notes:
It turns out, there are a few bits and bobs missing from each language (as I expected would be the case). First, is that I couldn’t figure out how to tell the date formatters to use slashes instead of dashes, on the returned dates. So, in all three, I had to roll-my-own. Second, was discovering that, even after all these years, Perl still has no trim (“strip”) method on strings. That’s just bizarre.
Again, I had to roll my own.
Classes and Methods
One of the more confounding dialectic differences, is the way in which code blocks are bundled in each language. To me, Python has the most straightforward: Package, Module, Class, Method/Function. I used to complain about the strict spacing requirement in Python, but I’ve actually found it to be a helpful tool, when trying to read my code. How do I know which definitions are methods and which are functions? If they are indented under the current class, they are methods owned by it. What’s more, just about anything can be easily imported into any other module in Python, just by including it in your pythonpath, and importing the specific classes you want. With Ruby and Perl, getting things recognized was a bit of a chore.
File Handling
A feature I really appreciated in both Python and Perl, was the ability to abstract directory lookups and filespec construction. This makes it easy to port those scripts to other (non-*nix) environments. This is probably possible in Ruby as well. I just haven’t had the time to comb over the Ruby version with the best ideas I culled from Python and Perl.
Data Manipulation
Hands-down, the easiest language with which to build and sort arrays was Ruby. The selection of methods available for manipulating strings and tabular data in Ruby is breathtaking! Python and Perl, by comparison, were quite a bit more difficult because of the esoterics necessary to make the sorts work the way I expected them to (lambdas in Python, and the need for nested comparisons in Perl).
Python and Ruby standard libraries come equipped with a lot of intuitively easy to use methods for which the analogous functionality in Perl requires a lot of deciphering. Also, a lot of building from scratch. So, I gave up and decided to cheat on the Perl version – I used a few of the most commonly accepted add-ons for Perl. One of those, was a library that dynamically identified the field separator in CSV files. While it was easy to use, and put little extra burden on the script as a whole, you can see that it didn’t save me much typing, when compared to the Python version of the script. In hindsight, I probably could have done a grep-like lookup the way I did in the Python version.
The other thing the Perl version taught me, was that I probably didn’t need a separate date formatting function to accomplish what I was trying to do with the raw data. As you can see there, with no method calls at all (except for basic string manipulation), I was able to get the date format I wanted in one small line of code.
Comments, Doc, and Testing
One of the nicest features of Python, is the inclusion of a doc generator from docstrings. Ruby and Perl offer the same as add-ons, but it came out of the box with my instance of Python 2.6.4. With both Python and Ruby, setting up unit tests for these classes was also a snap, as both languages offer it out of the box as well. With Perl, setting that up was a good deal more complicated.
And, while I’m on the subject, I did indeed include a set of unit tests in my experiment, for both the python and the ruby version. But I’m not going to address those here, as I want to do a full separate post on testing, later.
The (Somewhat) Finished Products
So, without further ado, here is the code —
First, Ruby:
###
# REQUIRED
###
require 'date'
###
# FUNCTIONS
###
def format_date(date)
year, month, day = date.split(/-/) # splits date into 3 variables
month = month.sub(/^0/,'') # Strips leading '0's from the month
return month + "/" + day + "/" + year # concatenates and returns date
end
module CSV_Processor
class DataParser
attr_accessor :input_dir, :file_mask
def initialize(input_dir, file_mask)
@all_records = []
@input_dir = input_dir
@file_mask = file_mask
Dir[@input_dir + "/" + @file_mask].each {|fname|
case
when fname.match('comma') then sep = ','
when fname.match('space') then sep = ' '
when fname.match('pipe') then sep = '|'
#in case a stray file enters the data directory
else raise "Unable to identify delimiter for " + fname
end
#In case we can't open the file.
begin
f = File.open(fname,'r')
rescue Exception => e
puts e.message
exit
end
f.each_line {|row|
fields = row.split(sep).collect {|x| x.chomp().strip}
case
when fname.match('comma')
last, first, gender, color, dobraw = fields
dob = format_date(Date.strptime(dobraw,"%m/%d/%Y").to_s)
when fname.match('space')
last, first, unused, gender, dobraw, color = fields
dob = format_date(Date.strptime(dobraw,"%m-%d-%Y").to_s)
when fname.match('pipe')
last, first, unused, gender, color, dobraw = fields
dob = format_date(Date.strptime(dobraw,"%m-%d-%Y").to_s)
else raise "Invalid File Type."
end #case
gender = 'Male' if gender == 'M'
gender = 'Female' if gender == 'F'
out_record = [last,first,gender,dob,color]
@all_records.push(out_record)
}#each line
} #foreach file
end#initialize
def sort_records(sort_type)
case
when sort_type == 1 then @all_records.sort_by{|e| [e[2],e[0]]}
when sort_type == 2 then @all_records.sort_by{|e| [e[3].split('/')[2]]}
when sort_type == 3 then @all_records.sort_by{|e| [e[0]]}.reverse!
else raise "Invalid Sort Type."
end#case
end#record_sort
def format_records(sort_type)
print_records = []
print_records << "Last, First \t Gender \t Date of Birth \t Favorite Color \n"
print_records << "----------- \t ----------- \t ------------- \t -------------- \n"
sort_records(sort_type).each {|line|
print_records << "#{line[0]}, #{line[1]} \t #{line[2]} \t #{line[3]} \t #{line[4]} \n"
}
print_records.to_s
end
def to_array()
@all_records
end
def to_s()
# The raw accumulated total of all records in all files, without sorting,
# But with data formatting.
@all_records.to_s
end#to_s
def dir_to_s()
@input_dir + "/" + @file_mask
end
end#class
end#module
And next, the Python version:
###
# REQUIRED / IMPORTS
###
import os, glob #needed for the directory listing
from datetime import datetime #needed for the date parsing/formatting
import re #regex, needed for the delimiter parsing
###
# FUNCTIONS
###
def format_date(datestr,fmtstr):
"""format_date: reconfigure the normal dash syntax date formatting, to
slash syntax, and reorder the elements to conform to the exercise requirements
@param datestr: the input datestr from the data file
@param fmtstr: the format
"""
pydate = str(datetime.strptime(datestr, fmtstr)).split()[0] #python adds the time
datechunks = pydate.split("-")
return datechunks[1] + "/" + datechunks[2] + "/" + datechunks[0]
class DataParser(object):
"""DataParser: Parses and outputs data from various delimited data files."""
def __init__(self, input_dir, file_mask):
self._input_dir = input_dir
self._file_mask = file_mask
self._all_records = []
for infile in glob.glob(os.path.join(self._input_dir, self._file_mask)):
#couldn't quite figure out a way to make this a single block
#(rather than three separate if/elifs. But you can see the split is
#generalized already, so if anyone can come up with a better way,
#I'm all ears!!
for row in open(infile,'r').readlines():
if infile.find('comma') > -1:
datefmt = "%m/%d/%Y"
last, first, gender, color, dobraw = \
[x.strip() for x in re.split(r'[ ,|;"\t]+', row)]
elif infile.find('space') > -1:
datefmt = "%m-%d-%Y"
last, first, unused, gender, dobraw, color = \
[x.strip() for x in re.split(r'[ ,|;"\t]+', row)]
elif infile.find('pipe') > -1:
datefmt = "%m-%d-%Y"
last, first, unused, gender, color, dobraw = \
[x.strip() for x in re.split(r'[ ,|;"\t]+', row)]
#There is also a way to do this with csv.Sniffer, but the
#spaces around the pipe delimiter also confuse sniffer, so
#I couldn't use it.
else: raise ValueError(infile + "is not an acceptable input file.")
dob = format_date(dobraw,datefmt)
if gender == 'M': gender = 'Male'
if gender == 'F': gender = 'Female'
self._all_records.append([last,first,gender,dob,color])
def get_records(self):
return self._all_records
def sort_records(self,sort_type):
self._sort_type = sort_type
#By gender ascending, then by last-name ascending, using key sort
#Unlike Ruby, this can be extended to sort every element hierarchically
#I stopped at sorting by gender and name, to meet the exercise requirements.
if sort_type == 1: self._all_records.sort(key=lambda row: (row[2],row[0]))
#By date-of-birth ascending, using cmp sort
elif sort_type == 2: self._all_records.sort(cmp=lambda x,y: cmp(x[3].split('/')[2], y[3].split('/')[2]))
#By last-name descending, using reverse parm instead of reverse method, for efficiency
elif sort_type == 3: self._all_records.sort(reverse=True)
else: raise ValueError("Invalid Sort Type")
return self._all_records
def format_records(self,sort_type):
self._print_records = []
self._print_records.append("Last, First \tGender \tDate of Birth \tFavorite Color")
self._print_records.append("----------- \t----------- \t------------- \t--------------")
for record in self.sort_records(sort_type):
self._print_records.append(record[0] + ", " + record[1] + " \t" + record[2] + "\t\t" + record[3] + "\t" + record[4])
return self._print_records
And, for a bonus round, here’s the entire thing rewritten in Perl:
#!/usr/bin/perl -w
package DataParser;
use autodie;
use File::Spec;
use Text::CSV_XS;
use Text::CSV::Separator qw(get_separator);
sub new {
my $class = shift;
my $self = {
_input_dir => shift,
_file_mask => shift,
};
$self->{ALL_RECORDS} = []; #the output array
# get the directory file list
opendir(my $dh, $self->{_input_dir}); #relies on autodie
my @filelist = grep {$_ ne '.' && $_ ne '..'} readdir $dh;
closedir $dh;
# print each file in the directory
foreach $file (@filelist){
my $fullfile = File::Spec->catfile( $self->{_input_dir}, $file );
#Identify separators
my @chars = get_separator( path => $fullfile,
include => [" ","|",","],);
my $sep;
if (@chars) {
if (@chars == 1) {$sep = $chars[0];} #first character in array
else {$sep = $chars[1];} #second character in array
} else { die "Couldn't detect the field separator in $fullfile: $!\n";}
my $csv=Text::CSV_XS->new({ sep_char => $sep });
#parse the file, and generate the new array
open(my $fh,'<',$fullfile); #relies on autodie
while (<$fh>){
$csv->parse($_);
my @columns = $csv->fields();
$last_name = trim($columns[0]);
$first_name = trim($columns[1]);
if ($sep eq " "){
$middle = trim($columns[2]);
$gender = trim($columns[3]);
$dob = join('/', split(/-/, trim($columns[4])));
$color = trim($columns[5]);
} elsif ($sep eq "|"){
$middle = trim($columns[2]);
$gender = trim($columns[3]);
$color = trim($columns[4]);
$dob = join('/', split(/-/, trim($columns[5])));
} elsif ($sep eq ","){
$gender = trim($columns[2]);
$color = trim($columns[3]);
$dob = join('/', split(/\//, trim($columns[4])));
}
else{ die "Couldn't identify data columns in $fullfile: $!\n"}
if ($gender eq 'M'){$gender='Male'}
if ($gender eq 'F'){$gender='Female'}
@newrow = [$last_name,$first_name,$gender,$color,$dob];
push(@{$self->{ALL_RECORDS}},@newrow);
}
close $fh;
}
bless ($self,$class);
return $self;
}
sub get_records {
my $self = shift;
return @{ $self->{ALL_RECORDS} };
}
sub sort_records {
my $self= shift;
my $sort_type = shift;
my @sorted_records;
if ($sort_type == 1) {
#By gender ascending, and then by last-name ascending.
for $record ( sort {($a->[2] cmp $b->[2]) || ($a->[0] cmp $b->[0])}
@{ $self->{ALL_RECORDS} } ) {
push(@sorted_records,$record);
}
}
elsif ($sort_type == 2) {
#By date-of-birth ascending
for $record ( sort {substr($a->[4],5,4) cmp substr($b->[4],5,4)}
@{ $self->{ALL_RECORDS} } ) {
push(@sorted_records,$record);
};
}
elsif ($sort_type == 3) {
#by last-name descending
for $record ( reverse sort { $a->[0] cmp $b->[0] }
@{ $self->{ALL_RECORDS} } ) {
push(@sorted_records,$record);
};
}
else {die "Sort Type $sort_type Not Valid: $!";};
return @sorted_records;
}
# Shockingly, Perl still has no builtin method for this.
sub trim($){
my $string = shift;
$string =~ s/^\s+//;#trim the front
$string =~ s/\s+$//;#trim the back
return $string;
}
###############################################################################
# SHORT SCRIPT TO INSTANTIATE THE OBJECT, AND MANIPULATE THE DATA
###############################################################################
$dp = new DataParser("Data", "*.txt");
@recds = $dp->get_records();
$recds = @recds;
print "\nTotal Unsorted Records: ",$recds,"\n";
@sort_types = (1,2,3);
foreach $type (@sort_types){
print "\n--------------> SORT TYPE [",$type,"] <----------------------\n"; print "Last, First \tGender \tColor \tBirthday\n"; print "------------------\t------\t------\t-------------\n"; @sortrecs = $dp->sort_records($type);
foreach(@sortrecs){
print $_->[0], ", ", $_->[1]," \t", $_->[2], "\t", $_->[3], "\t", $_->[4], "\n";
}
}