Programming languages are the medium through which we turn our intentions into actions. Ideally, the choice of language to use should be a neutral decision. In particular, the language itself shouldn’t get in the way: easy things should be easy, obvious things should be obvious and it shouldn’t be too idiosyncratic. I am going to write a simple program in five different programming languages and comment on how easy and pleasant or otherwise it is to work with them.
The five languages chosen are, in alphabetical order: C++, Java, Node.js, Perl and Python. The program is a simplified version of the head
utility. head
prints the first few lines of its input or one or more files passed on the command line. Our utility will have some restrictions compared to the real version: it will only process a single file and will not have the ability to print all but the first few lines. The problem is what I’d call bread-and-butter programming: reading files, processing them line-by-line and producing output. I don’t think that this is a problem designed to favour one particular language over another – this is the sort of activity that any language should be suited for. At the end, I will also compare performance of the different solutions, just for fun.
I will point out that if I were to write a real implementation of head
, I wouldn’t read input one line at a time. I’d read input in conveniently sized chunks, search for newline characters and do whatever needed to be done. However, in the more general case, we want to treat a file as a sequence of lines rather than a bucket of bytes so this is what I’ll do in the different programs.
The Basics
The default behaviour of head
is to print the first ten lines of the file passed on the command line so this is the behaviour that we’ll implement as a first cut. File errors will be reported on stderr
and will cause an abnormal program exit. I have plans for when no arguments are passed on the command line so this condition will result in a normal program exit.
C++
#include <iostream> #include <fstream> #include <string> #include <cstring> using namespace std; const int lines = 10; int main(int argc, char **argv) { if (argc < 2) { return 0; } ios_base::sync_with_stdio(false); ifstream file(argv[1]); if (!file.is_open()) { cerr << "Unable to open " << argv[1] << ": " << strerror(errno) << "\n"; return 1; } string line; for (int i = 0; i < lines && getline(file, line); ++i) { cout << line; if (!file.eof()) { cout << "\n"; } } return 0; }
C++ is, by a small margin, the second most verbose of the solutions (585 bytes vs. 580 bytes for the Java version). The first few lines are fixed overhead, however, so I expect this to change. Note the following:
- Line 5: I wouldn’t do this in a complex project but in a simple program, not having to qualify everything as
std::cout
,std::ifstream
and so on is a hassle-saver - Line 13: Unless you explicitly disable the feature, C++ allows you to freely mix calls to the C++ native I/O objects,
cin
andcout
, and legacy C stdio functions,scanf
andprintf
. To make this possible, a lot of work is done under the hood to keep different stream pointers synchronized, which kills performance unless you turn this questionable feature off. I’ll show the performance benefit of this call at the end, but for the moment, -1 for requiring obscure knowledge and for having what is arguably a non-sensible default - Lines 14-19: On the other hand, C++ does make easy things easy – opening a file and checking for an error condition is straightforward. We are also able to access the operating system error and we have full control of how the error is reported
- Line 21: We use the
getline
function from thestring
module.ifstream
objects have agetline
method as well but this does not handle arbitrarily long lines - Lines 23-25: The action of
getline
is to read up to a newline, and then advance the read pointer past it. This means that we have to take care not to report a newline character that does not exist if we read to the end of the file. The realhead
utility doesn’t and I would consider doing so a bug - Line 24:
"\n"
rather thanendl
sinceendl
guarantees a flush of the underlying output handle which might hurt performance. Otherwise, it doesn’t have, say, a portability benefit over a newline character which will be translated to the line terminator pattern of the system automatically. - Line 27: Although it’s not obvious, C++ file objects close the underlying handle in their destructors when they go out of scope. This means that we don’t have to explicitly call the
close
method. This pattern is known as RAII (Resource Acquisition Is Initialization) and is a nice feature of well-written C++ classes.
Java
import java.io.*; class Head { private static final int lines = 10; public static void main(String[] args) { if (0 == args.length) { System.exit(0); } try { BufferedReader reader = new BufferedReader(new FileReader(args[0])); String line = null; for (int i = 0; (line = reader.readLine()) != null && i < lines; ++i) { System.out.println(line); } } catch(Exception e) { System.err.println(e); System.exit(1); } } }
Although the Java solution has fewer lines than the C++ one, this is down to lower fixed overhead (one import
statement as opposed to four include
statements) since otherwise it is more verbose and long winded. Note the following:
- The compiled binary cannot be run directly from the command line. Instead, I have to invoke it like this:
$ java Head /path/to/file
- Line 4:
<qualifier> static final
; that’s a very long-winded way to sayconst
- Line 7: Java is the only language of my acquaintance that doesn’t treat 0 or
null
asfalse
in anif
statement. Hence, the explicit test - Line 12: Java is the only language that does unbuffered file I/O by default which means that we need to wrap an object that reads files inside an object that does buffering. Wrapping a thing inside another thing to do useful work is an unpleasant aspect of programming in Java
- Line 16: Never mind the long-winded way of saying “print”, we have a bug that we cannot easily fix in the Java version.
readLine
behaves like the C++getline
in that it discards the newline and moves the file pointer. However, there is no way, as far as I can tell, to detect EOF other thanreadLine
returningnull
by which time we’ve already emitted an erroneous newline. We could get around this by usingread
and finding the newline characters ourselves, but then Java would fail the “easy things should be easy” requirement utterly. -1 for no way to detect EOF condition other than attempting to read from the handle again - Line 20: Error comes out as something like “java.io.FileNotFoundException: doesnotex.ist (No such file or directory)” which is not very nice. Calling
getMessage
does not improve things greatly. -1 for limited control over error reporting - Line 23: Although it’s not obvious, objects that hold filehandles do not close them automatically, at least not in a deterministic manner. When the object goes out of scope, it will be marked for garbage collection. When the object is garbage collected, the file will be closed but that might not happen for a long time. As it happens, exiting the program closes all filehandles anyway.
Exiting the program means that filehandles are closed, but if we had to close files explicitly, Java would force a horrible pattern on us:
BufferedReader reader = null; try { reader = new BufferedReader(...); // Do stuff and then close when we're done reader.close(); reader = null; } catch(e) { // Handle error } finally { // Clean up if necessary if (reader != null) { try { reader.close(); } catch(e) { // close threw an error; what exactly am // I supposed to do with it? } } }
When I regularly programmed in Java,
close
throwing a checked exception used to drive me nuts!
Node.js
#!/usr/bin/env node if (process.argv.length < 3) { process.exit(); } var fs = require('fs'), readline = require('readline'), strm = fs.createReadStream(process.argv[2]), lines = 10, eof = false; strm.on("error", function(e) { console.error(e.message); process.exit(1); }); strm.on("end", function() { eof = true; }); var rd = readline.createInterface({ input: strm, terminal: false }); rd.on("line", function(line) { if (!eof) { line += "\n"; } process.stdout.write(line); if (--lines === 0) { rd.close(); } }); rd.on("close", function() { process.exit(); });
Node.js isn’t a programming language in itself, but rather a JavaScript engine taken out of the browser context and run as a script interpreter. This was by some measure the most difficult program to write which is reflected in both the greatest number of lines and the largest file size. There were also a couple of surprises and when it comes to software, surprising is never good. The twisted program structure is down to the way that Node.js works. It seems that file operations are done in a separate execution thread so that the main program thread is not blocked. Note the following:
- Line 3: First surprise. Conventionally, arguments to your script or program start at offset 0 (for example, Java and Perl) or offset 1 (for example, C/C++ and Python). For a script invoked by Node.js, “node” is the first argument followed by the script name so that the arguments for your script start at offset 2
- Line 13: This is rather a convoluted way of getting at a file error. -1 for being neither easy nor obvious
- Line 14: The error comes out as something like “ENOENT: no such file or directory, open ‘doesnotex.ist'” which is detailed but not very attractive. -1 for no control over the output
- Line 26: Again, strange but this seems to be the Node.js way.
line
does not include the newline - Lines 27-29: This is how a non-existent newline is suppressed. The “end” event on the stream object fires before the last “line” event on the reader so that the eof flag gets set before we receive the last line
- Line 30: Long-winded way of saying “print”
- Line 32: As far as I can tell, this statement doesn’t actually close anything, since the “line” events keep on coming in. What it does do is cause the “close” event to be fired. This is extremely surprising behaviour and not terribly useful so -1
- Line 37: Need to exit the program to stop reading the file.
Perl
#!/usr/bin/perl use strict; use warnings qw/all/; my $lines = 10; exit 0 unless (@ARGV); open(my $fh, "<", $ARGV[0]) or die("Unable to open $ARGV[0]: $!\n"); while (defined(my $line = <$fh>) && $lines--) { print($line); }
I must confess to having a dog in this particular fight since I enjoy programming in Perl. This sample should show why since it is the shortest and simplest solution by a long way, even taking into account a couple of lines of fixed overhead. Note the following:
- Line 7: This is idiomatic Perl. If you don’t like the “unless” idiom, you can rewrite this as
if (!@ARGV)
- Line 8:
open ... or die
is also idiomatic. We can access the system error via the special variable$!
and have full control over the error output - Line 9: Perl has a dedicated operator,
<>
, to read lines of text from a filehandle. The line returned includes any newline character. Note that there’s a subtle bug that could bite us if the input ended in “0” without a newline. This would evaluate tofalse
which means that we need thedefined
function to avoid this being a problem - Line 10: Now that’s how to name a
print
function! The Perlprint
function doesn’t do anything funny to its input like adding a newline - Line 11: It’s not obvious, but filehandles in Perl are closed when the last reference to them goes out of scope. This means that we do not need to explicitly call
close
.
Python
#!/usr/bin/python import sys if len(sys.argv) < 2: sys.exit(0) lines = 10 try: with open(sys.argv[1]) as f: for line in f: sys.stdout.write(line) lines -= 1 if not lines: break except IOError, e: sys.stderr.write("Unable to read %s: %s\n" % (sys.argv[1], e.strerror)) sys.exit(1)
I’m not such a fan of programming in Python, although it’s perfectly pleasant to do so. The syntax is very different from the other four since Java, JavaScript and Perl are all C-like languages while Python is somewhat idiosyncratic. Note the following:
- Line 7: Python forces a
try...catch
structure like Java since I/O operations throw rather than return error values. If you don’t catch exceptions, the error output is very ugly - Line 8: Python objects are garbage collected which means that their destruction is non-deterministic. However, Python also offers an RAII pattern: the
with
keyword opens a “controlled execution” block which scopes the file objectf
. This ensures that the file is closed without having to explicitly close it - Line 9: File objects are iterable which makes one-line-at-a-time processing extremely easy. The lines include a trailing newline character when present
- Line 10: Python does have a
print
function but that adds a newline character to the output which is behaviour that we don’t want - Line 15: We have full control of error output and can access the actual system error conveniently.
Part 2: Command-line Options
Well behaved programs can have their behaviour changed by reading options from the command line. Options can be short form (-?
) or long form (--help
). When there are more than a couple of options, parsing them becomes laborious and tedious so we’ll want to use a library routine to do so when available. The options we’ll accept are:
--help / -?
: Print a usage message and exit--count / -n <number>
: Print the first <number> lines of the file instead of 10
In addition, the real head
utility can take, for example, -2
as an option in place of -n 2
or --count 2
so we’ll do the same.
C++
#include <iostream> #include <fstream> #include <string> #include <cstring> #include <regex> #include <getopt.h> using namespace std; const int deflines = 10; void usage(const char *name, const char *msg) { if (msg) { cerr << msg << "\n"; } cerr << "Usage:\n " << name << " [--count|-n <lines>] [FILE]\n"; } int main(int argc, char **argv) { int lines = deflines, opt; const char *err = NULL; bool needHelp = false; string countOpt; regex numeric("\\d+"); struct option lOpts[] = { { "help", no_argument, NULL, '?' }, { "count", required_argument, NULL, 'n' }, { NULL, 0, NULL, 0 } }; if (1 < argc && argv[1][0] == '-' && regex_match(&argv[1][1], numeric)) { countOpt = &argv[1][1]; argv[1] = argv[0]; ++argv; --argc; } while ((opt = getopt_long(argc, argv, "n:?", lOpts, NULL)) != -1) { switch (opt) { case 'n': if (regex_match(optarg, numeric)) { countOpt = optarg; } else { err = "Bad count argument"; needHelp = true; } break; case '?': default: needHelp = true; break; } } if (needHelp) { usage(argv[0], err); return 1; } if (argc == optind) { return 0; } if (!countOpt.empty()) { lines = stoi(countOpt); } ios_base::sync_with_stdio(false); ifstream file(argv[optind]); if (!file.is_open()) { cerr << "Unable to open " << argv[optind] << ": " << strerror(errno) << "\n"; return 1; } string line; for (int i = 0; i < lines && getline(file, line); ++i) { cout << line; if (!file.eof()) { cout << "\n"; } } return 0; }
Ouch! Our program has just grown 3x bigger! GNU getopt is the standard for parsing command-line options and it’s part of the standard C/C++ library, libc
. Windows developers will have a harder time since Windows has a different option syntax (for example, /n
rather than -n
) and there is no standard library routine that I know of for parsing options. We could also have used argp
which provides additional bells and whistles (such as generating a usage message for you), but the overhead is higher and the learning curve somewhat steeper. Having paid the high cost of entry for option parsing, the cost of adding additional options is low – typically one variable, one entry in the options array and one case
statement. Note the following:
- Lines 11-16: Using
getopt
requires us to write a usage function. If our program grew to take many options, the cost of using a more complex argument parser likepopt
orboost::program_options
that takes care of generating a help screen may be worthwhile - Line 23: C++ 11 gives us native regular expressions. Hurrah! This regex is used to test for numeric options
- Lines 24-28: Program options; note the “NULL terminator” at the end
- Line 29: Check for an option that looks like “-number“. Note that
getopt
will choke on this so we need to remove it from the arguments array - Lines 30-33: Pointers for the win! While this is the sort of stuff that programmers unfamiliar with C/C++ find maddening, it is very much idiomatic and natural once you’re familiar with the basics. The result is that we splice the non-standard argument from the front of
argv
- Line 35:
getopt_long
returns -1 when it has processed all the options that it knows how to. The index of the first unprocessed option is available viaoptind
- Lines 56-58: I still have plans for when no file argument is supplied, so exit normally if this is the case
- Line 60: Override
lines
.stoi
throws if fed bad input but we’ve already checked that input is good, so we don’t need to catch the exception - Line 64: Our file argument is at
optind
, not 1.
Java
import java.io.*; import java.util.regex.Pattern; import java.util.regex.Matcher; class Head { private static final int deflines = 10; private static void usage(String msg) { if (!(msg == null || msg.isEmpty())) { System.err.println(msg); } System.err.println("Usage:\n java Head [--count|-n <lines>] [FILE]"); System.exit(1); } public static void main(String[] args) { int optIdx = 0, lines = deflines; boolean needHelp = false; String err = null; Pattern customOption = Pattern.compile("\\-(\\d+)"), numericOption = Pattern.compile("^\\d+"); if (args.length > 0 && args[0].startsWith("-")) { Matcher m = customOption.matcher(args[0]); if (m.find()) { lines = Integer.parseInt(m.group(1)); ++optIdx; } } for (; optIdx < args.length; ++optIdx) { if (args[optIdx].equals("--help") || args[optIdx].equals("-?")) { needHelp = true; } else if (args[optIdx].equals("--count") || args[optIdx].equals("-n")) { Matcher m = numericOption.matcher(args[optIdx + 1]); if (m.find()) { lines = Integer.parseInt(args[optIdx + 1]); ++optIdx; } else { err = "Bad count argument"; needHelp = true; break; } } else { break; } } if (needHelp) { usage(err); } if (optIdx == args.length) { System.exit(0); } try { BufferedReader reader = new BufferedReader(new FileReader(args[optIdx])); String line = null; for (int i = 0; (line = reader.readLine()) != null && i < lines; ++i) { System.out.println(line); } } catch(Exception e) { System.err.println(e); System.exit(1); } } }
Java doesn’t come out of this round terribly well. Java lacks a native command-line option parse routine. Given that it has a library to parse X.509 certificates and given that I have to work with certificates far less often than I have to handle command-line options, one wonders why. There are several dozen option parsing libraries on Github that claim to be GNU getopt compatible. A popular choice seems to be args4j but then Java throws another obstacle in our way. We have an option that a standard options parser will choke on. In all the other languages under consideration we can make this a non-issue by modifying the arguments array. In Java, arrays are immutable because, clearly, reasons. We could get round this by copying the array members that we want to keep to a container object then turning that back into an array or we could say that this is too much like pointless busywork for a recreational programming project, parse the options the hard way and mark Java down accordingly. Therefore, -2 for failing the “obvious should be obvious” and “the language itself shouldn’t get in the way” criteria. Even without the fixed overhead of using an options parser, the Java solution has overtaken the C++ one in terms of typing required if not in terms of line count. Note also the following:
- Lines 20-21: Regular expressions are somewhat painful to use in Java but by now I’m not surprised
- Lines 30-35: Custom option parsing code. Unlike the C++ solution, this will not scale at all well
- Line 31: Who in their right minds wouldn’t want ‘==’ to do the obvious thing here? The Java language designers could have made testing a String object against a string literal do the right thing yet they very deliberately chose not to. Instead, the ugly
x.equals(y)
pattern is required. Interestingly, an ‘==’ test compiles; it just doesn’t work. -1 again for failing “the language itself shouldn’t get in the way” criterion.
Node.js
#!/usr/bin/env node var lines = 10, i, optIdx = 2, needHelp = false, err = null, args = process.argv; if (args.length > optIdx && /^\-\d+$/.test(args[optIdx])) { lines = parseInt(args[optIdx].substr(1)); args.splice(optIdx, 1); } for (; optIdx < args.length; ++optIdx) { if (args[optIdx] == "--help" || args[optIdx] == "-?") { needHelp = true; } else if (args[optIdx] == "--count" || args[optIdx] == "-n") { if (/^\d+$/.test(args[optIdx + 1])) { lines = parseInt(args[optIdx + 1]); ++optIdx; } else { err = "Bad count argument"; needHelp = true; break; } } else { break; } } if (needHelp) { if (err) { console.error(err); } console.error("Usage:\n " + args[1] + " [--count|-n <lines>] [FILE]"); process.exit(1); } if (optIdx === args.length) { process.exit(); } var fs = require('fs'), readline = require('readline'), strm = fs.createReadStream(args[optIdx]), eof = false; strm.on("error", function(e) { console.error(e.message); process.exit(1); }); strm.on("end", function() { eof = true; }); var rd = readline.createInterface({ input: strm, terminal: false }); rd.on("line", function(line) { if (!eof) { line += "\n"; } process.stdout.write(line); if (--lines === 0) { rd.close(); } }); rd.on("close", function() { process.exit(); });
Node.js also lacks a standard option parser, so -1. I tried out yargs which parses options OK but doesn’t appear to allow you to specify “-n” as a shortened alias for “–count”. Handling short options after yargs did its thing took exactly as many lines as handling the options myself. There isn’t a lot of point in pulling in a third party library if it’s not buying you anything. The JavaScript code for handling options is a direct port of the Java code. It is a lot more readable by virtue of, for example, regular expressions being first class data types and by the JavaScript syntax itself being less obtrusive.
Perl
#!/usr/bin/perl =head1 SYNOPSIS headpl [--count|-n <lines>] [FILE] =cut use strict; use warnings qw/all/; use Getopt::Long; use Pod::Usage; my ($needHelp, $err, $lines) = (0, "", 10); if (@ARGV && $ARGV[0] =~ /\-(\d+)$/) { $lines = $1; shift(@ARGV); } GetOptions( "help|?" => \$needHelp, "count|n=s" => \$lines, ); unless ($lines =~ /^\d+$/) { $err = "Bad count argument"; $needHelp = 1; } pod2usage( exitval => 1, message => $err ) if ($needHelp); exit 0 unless (@ARGV); open(my $fh, "<", $ARGV[0]) or die("Unable to open $ARGV[0]: $!\n"); while (defined(my $line = <$fh>) && $lines--) { print($line); }
The Perl solution remains admirably compact despite being formatted for readability. As you can see, Perl’s reputation as a “write-only” language is undeserved. Readability or otherwise of Perl code is entirely down to the author. Note the following:
- Lines 3-7: This is POD (plain old documentation). The documentation serves double duty as the usage message which is very Perlish
- Lines 20-23: The getopt implementation in Perl is very elegant and has a low cost of entry. The function perturbs the ARGV array so that what is left over represents non-option arguments. The arguments
GetOptions
are pattern-reference pairs. We could replace the “big-arrow” (=>) operators with “,”, although the former is more idiomatic - Line 22: We could have specified this option as “count|n=i” and then
GetOptions
would discard any non-numeric value with a warning, leaving$lines
unmodified. However, the other solutions error on a non-numeric argument and since we need to reject a negative value anyway, I have chosen to check the value myself - Line 29:
pod2usage
makes usage messages easy and your messages are as good as your documentation.
Python
#!/usr/bin/python import sys import re import argparse def usage(): return '''Usage: headpy[--count|-n <lines>] [FILE] ''' lines = 10 if len(sys.argv) > 1: customOption = re.compile(r"\-(\d+)") m = customOption.match(sys.argv[1]) if m: lines = int(m.group(1)) del sys.argv[1] parser = argparse.ArgumentParser( usage = usage(), add_help = False ) parser.add_argument("-?", "--help", required = False, action="store_true") parser.add_argument("-n", "--count", required = False) parser.add_argument("argv", nargs = argparse.REMAINDER) args = parser.parse_args() err = None needHelp = args.help if args.count: numericOption = re.compile(r"^\d+$") m = numericOption.match(args.count) if m: lines = int(args.count) else: err = "Bad count argument" needHelp = True if needHelp: parser.error(err) if not args.argv: sys.exit(0) try: with open(args.argv[0]) as f: for line in f: sys.stdout.write(line) lines -= 1 if not lines: break except IOError, e: sys.stderr.write("Unable to read %s: %s\n" % (sys.argv[1], e.strerror)) sys.exit(1)
Python has a number of facilities for parsing options, including a getopt implementation. The module recommended in the pydoc is argparse
. I wouldn’t say that using argparse
is easy but it’s usable. Note the following:
- Line 16: Unlike Perl and JavaScript, Python requires an explicit cast to an integer. This halfway-house to strong typing makes Python a less friendly scripting language
- Lines 18-20: This is the code that constructs the parser.
argparse
has the ability to generate a help screen in response to-h/--help
but I didn’t like the appearance of the help. These overrides, along with theusage
function make the usage output look similar to the output of the other four solutions - Line 22: If we added “
type = int
” to the argument list,argparse
would coerce the count value to an integer. However, as with the Perl version, we also want to reject negative values so I’m choosing to parse the value myself - Line 23: This is a remarkably non-obvious way to get the option parser to not barf over non-option arguments, so -1. As coded, non-option arguments will appear in an array property of the parsed arguments called
argv
Part 3: Reading from stdin
A well behaved program that takes its input from a file should also be able to read from stdin
. This allows it to form part of a pipeline where the output of another program forms our program’s input. For example, we might want to take the output of the sort
utility to show the top ten results. Conventionally, no file argument or an argument specified as “-” indicates that we should read from stdin
.
C++
#include <iostream> #include <fstream> #include <string> #include <cstring> #include <regex> #include <getopt.h> using namespace std; const int deflines = 10; void usage(const char *name, const char *msg) { if (msg) { cerr << msg << "\n"; } cerr << "Usage:\n " << name << " [--count|-n <lines>] [FILE]\n"; } void printstream(istream &in, int lines) { string line; for (int i = 0; i < lines && getline(in, line); ++i) { cout << line; if (!in.eof()) { cout << "\n"; } } } int main(int argc, char **argv) { int lines = deflines, opt; const char *err = NULL; bool needHelp = false; string countOpt; regex numeric("\\d+"); struct option lOpts[] = { { "help", no_argument, NULL, '?' }, { "count", required_argument, NULL, 'n' }, { NULL, 0, NULL, 0 } }; if (1 < argc && argv[1][0] == '-' && regex_match(&argv[1][1], numeric)) { countOpt = &argv[1][1]; argv[1] = argv[0]; ++argv; --argc; } while ((opt = getopt_long(argc, argv, "n:?", lOpts, NULL)) != -1) { switch (opt) { case 'n': if (regex_match(optarg, numeric)) { countOpt = optarg; } else { err = "Bad count argument"; needHelp = true; } break; case '?': default: needHelp = true; break; } } if (needHelp) { usage(argv[0], err); return 1; } if (!countOpt.empty()) { lines = stoi(countOpt); } ios_base::sync_with_stdio(false); string fName; if (argc > optind) { fName = argv[optind]; } if (fName.empty() || "-" == fName) { printstream(cin, lines); } else { ifstream file(fName); if (!file.is_open()) { cerr << "Unable to open " << fName << ": " << strerror(errno) << "\n"; return 1; } printstream(file, lines); } return 0; }
Nothing difficult here. There’s a slight hindrance in that stream objects cannot be assigned. For example, this wouldn’t work:
istream p; ... p = cin;
We could use pointers:
istream *p = NULL; ... p = &cin;
However, I chose to refactor, moving the code that does the actual work into a function that takes an
istream
reference. It is then a simple matter of calling it with the correct input stream object.
Java
import java.io.*; import java.util.regex.Pattern; import java.util.regex.Matcher; class Head { private static final int deflines = 10; private static void usage(String msg) { if (!(msg == null || msg.isEmpty())) { System.err.println(msg); } System.err.println("Usage:\n java Head [--count|-n <lines>] [FILE]"); System.exit(1); } public static void main(String[] args) { int optIdx = 0, lines = deflines; boolean needHelp = false; String err = null; Pattern customOption = Pattern.compile("\\-(\\d+)"), numericOption = Pattern.compile("^\\d+"); if (args.length > 0 && args[0].startsWith("-")) { Matcher m = customOption.matcher(args[0]); if (m.find()) { lines = Integer.parseInt(m.group(1)); ++optIdx; } } for (; optIdx < args.length; ++optIdx) { if (args[optIdx].equals("--help") || args[optIdx].equals("-?")) { needHelp = true; } else if (args[optIdx].equals("--count") || args[optIdx].equals("-n")) { Matcher m = numericOption.matcher(args[optIdx + 1]); if (m.find()) { lines = Integer.parseInt(args[optIdx + 1]); ++optIdx; } else { err = "Bad count argument"; needHelp = true; break; } } else { break; } } if (needHelp) { usage(err); } String fName = ""; if (optIdx < args.length) { fName = args[optIdx]; } try { Reader r = null; if (fName.isEmpty() || fName.equals("-")) { r = new InputStreamReader(System.in); } else { r = new FileReader(fName); } BufferedReader br = new BufferedReader(r); String line = null; for (int i = 0; (line = br.readLine()) != null && i < lines; ++i) { System.out.println(line); } } catch(Exception e) { System.err.println(e); System.exit(1); } } }
Java placed no obstacles in my way here. We just need to switch the
Reader
object that we use to construct the BufferedReader
used to split the input into individual lines.
Node.js
#!/usr/bin/env node var lines = 10, i, optIdx = 2, needHelp = false, err = null, args = process.argv; if (args.length > optIdx && /^\-\d+$/.test(args[optIdx])) { lines = parseInt(args[optIdx].substr(1)); args.splice(optIdx, 1); } for (; optIdx < args.length; ++optIdx) { if (args[optIdx] == "--help" || args[optIdx] == "-?") { needHelp = true; } else if (args[optIdx] == "--count" || args[optIdx] == "-n") { if (/^\d+$/.test(args[optIdx + 1])) { lines = parseInt(args[optIdx + 1]); ++optIdx; } else { err = "Bad count argument"; needHelp = true; break; } } else { break; } } if (needHelp) { if (err) { console.error(err); } console.error("Usage:\n " + args[1] + " [--count|-n <lines>] [FILE]"); process.exit(1); } var fs = require('fs'), readline = require('readline'), fName = "", strm, eof = false; if (optIdx < args.length) { fName = args[optIdx]; } if (!fName || fName === "-") { strm = process.stdin; } else { strm = fs.createReadStream(fName); } strm.on("error", function(e) { console.error(e.message); process.exit(1); }); strm.on("end", function() { eof = true; }); var rd = readline.createInterface({ input: strm, terminal: false }); rd.on("line", function(line) { if (!eof) { line += "\n"; } process.stdout.write(line); if (--lines === 0) { rd.close(); } }); rd.on("close", function() { process.exit(); });
Again, no problems. JavaScript’s weak typing makes this easier than the Java version since
strm
is just a reference to a thing that behaves in a stream-like manner.
Perl
#!/usr/bin/perl =head1 SYNOPSIS headpl [--count|-n <lines>] [FILE] =cut use strict; use warnings qw/all/; use Getopt::Long; use Pod::Usage; my ($needHelp, $err, $lines) = (0, "", 10); if (@ARGV && $ARGV[0] =~ /\-(\d+)$/) { $lines = $1; shift(@ARGV); } GetOptions( "help|?" => \$needHelp, "count|n=s" => \$lines, ); unless ($lines =~ /^\d+$/) { $err = "Bad count argument"; $needHelp = 1; } pod2usage( exitval => 1, message => $err ) if ($needHelp); my $fh = *STDIN; if (@ARGV && $ARGV[0] ne "-") { open($fh, "<", $ARGV[0]) or die("Unable to open $ARGV[0]: $!\n"); } while (defined(my $line = <$fh>) && $lines--) { print($line); }
The Unix model that everything is a file applies to Perl: stdin is simply a filehandle which means that all we have to do is change what
$fh
refers to. Note the following:
- Line 34: The syntax may be unfamiliar. STDIN is an entry in the symbol table, but it doesn’t have a defined type (i.e., there isn’t a
$STDIN
or a%stdin
).*STDIN
is called a typeglob and means “the thing that STDIN refers to” - Line 35: Perl treats string values and numeric values interchangeably, depending on the context. Distinguishing numeric equality from string equality requires different operators:
=/!=
for numeric equality,eq/ne
for string equality.
Python
#!/usr/bin/python import sys import re import argparse def usage(): return '''Usage: headpy[--count|-n <lines>] [FILE] ''' def printstream(f, lines): for line in f: sys.stdout.write(line) lines -= 1 if not lines: break lines = 10 if len(sys.argv) > 1: customOption = re.compile(r"\-(\d+)") m = customOption.match(sys.argv[1]) if m: lines = int(m.group(1)) del sys.argv[1] parser = argparse.ArgumentParser( usage = usage(), add_help = False ) parser.add_argument("-?", "--help", required = False, action="store_true") parser.add_argument("-n", "--count", required = False) parser.add_argument("argv", nargs = argparse.REMAINDER) args = parser.parse_args() err = None needHelp = args.help if args.count: numericOption = re.compile(r"^\d+$") m = numericOption.match(args.count) if m: lines = int(args.count) else: err = "Bad count argument" needHelp = True if needHelp: parser.error(err) fName = None if args.argv and args.argv[0] != "-": fName = args.argv[0] try: if fName: with open(fName) as f: printstream(f, lines) else: printstream(sys.stdin, lines) except IOError, e: sys.stderr.write("Unable to read %s: %s\n" % (sys.argv[1], e.strerror)) sys.exit(1)
As with the C++ solution, I refactored the Python script to move the core functionality into its own function that takes some kind of iterable object. Unlike the C++ solution, I’m not seeing other ways that I could make this work. The controlled execution block that controls the lifetime of the file object is not something that I can change to use sys.stdin
, so we’re stuck with having to treat a file and stdin differently. Note the following:
- Line 47: Things like unfamiliar logic operators make programming in a language more difficult than it needs to be. Given that Python is written in C, would it have killed GvR to have used the familiar “&&”?
Part 4: Relative Performance
This section is for fun. I don’t believe that performance should be the primary criterion for all but a few problem domains. Assuming your algorithms are good, most programs these days are fast enough. The correct target for early optimization is the programmer rather than the CPU, since CPU time, unlike programmer time, is always getting cheaper. Therefore, clearer code that takes a few more microseconds to execute is a worthwhile investment. That said, I/O is something you want to be fast since that is one of the performance limiters of any program.
The software versions are the ones hanging around on my MacBook:
$ clang --version Apple LLVM version 7.3.0 (clang-703.0.31) $ java -version java version "1.6.0_51" $ node --version v4.2.2 $ perl --version This is perl 5, version 18, subversion 2 (v5.18.2) ... $ python --version Python 2.7.10
The C++ implementation was built as follows:
$ g++ -o headcpp -O3 -Wall head.cpp
To gauge performance, I ran the following for each implementation:
$ time for x in {1..10}; do <HEADCMD> -200000 </usr/share/dict/words >/dev/null; done
This means that we are reading two million lines with the various getline
implementations. Redirecting output to /dev/null
means that we are not left waiting on the terminal. For each implementation, I did three runs and took the best of the three.
head
We’ll use the real head
implementation for reference. As I said earlier, were I writing a serious implementation, I wouldn’t read one line at a time and neither does the actual head
utility. Anyway:
real 0m0.296s user 0m0.268s sys 0m0.026s
C++
real 0m4.848s user 0m3.947s sys 0m0.872s
Remember that there were a couple of source level optimizations. Let’s see how performance changes if we undo them. First, using endl
instead of “\n”:
real 0m4.860s user 0m3.978s sys 0m0.856s
No meaningful difference so not using endl
was premature optimization. Now let’s see how keeping C++ streams synchronized with C stdio affects performance:
real 0m4.683s user 0m3.841s sys 0m0.819s
Again, no change. However, it did make a big difference on a Linux system, so this is an optimization worth keeping.
Java
real 0m8.358s user 0m9.526s sys 0m2.219s
Node.js
real 0m6.597s user 0m5.592s sys 0m1.099s
Perl
real 0m1.278s user 0m1.140s sys 0m0.106s
Python
real 0m1.448s user 0m1.196s sys 0m0.211s
The real surprise here is the performance of the C++ program which is around three times worse than the Python and Perl programs. Not so surprising is that Java is the worst performer, around twice as slow as the C++ version. This is despite Java compiling to binary bytecode and then using a JIT compiler to turn the bytecode into native code. The Node.js implementation lies somewhere between the C++ and Java implementations, which is not very impressive given that it, too, compiles to native code and considering also the demented event-driven file handling that was imposed on us. The Python and Perl performance is extremely impressive, not a million miles from the reference figures with Perl just edging out its Dutch rival.
I was so surprised at the terrible performance of the C++ program that I rewrote it as a straight C program using stdio:
#include <stdio.h> #include <getopt.h> #include <ctype.h> #include <stdlib.h> #include <string.h> #include <errno.h> #define DEFLINES 10 void usage(const char *name, const char *msg) { if (msg) { fprintf(stderr, "%s\n", msg); } fprintf(stderr, "Usage:\n %s [--count|-n <lines>] [FILE]\n", name); } int numericOption(const char *s) { const char *p = s; for (; *p; ++p) { if (!isdigit(*p)) { return -1; } } return atoi(s); } int main(int argc, char **argv) { int lines = DEFLINES, needHelp = 0, opt, nOpt; size_t len = 0; ssize_t read; const char *err = NULL; char *fn = NULL, *line = NULL; FILE *fp = stdin; struct option lOpts[] = { { "help", no_argument, NULL, '?' }, { "count", required_argument, NULL, 'n' }, { NULL, 0, NULL, 0 } }; if (1 < argc && argv[1][0] == '-') { nOpt = numericOption(&argv[1][1]); if (nOpt >= 0) { lines = nOpt; argv[1] = argv[0]; ++argv; --argc; } } while ((opt = getopt_long(argc, argv, "n:?", lOpts, NULL)) != -1) { switch (opt) { case 'n': nOpt = numericOption(optarg); if (nOpt >= 0) { lines = nOpt; } else { err = "Bad count argument"; needHelp = 1; } break; case '?': default: needHelp = 1; break; } } if (needHelp) { usage(argv[0], err); return 1; } if (argc > optind && strcmp(argv[optind], "-")) { fn = argv[optind]; } if (fn) { fp = fopen(fn, "r"); if (!fp) { fprintf(stderr, "Unable to open %s: %s\n", fn, strerror(errno)); return 1; } } while ((read = getline(&line, &len, fp)) != -1 && lines--) { fputs(line, stdout); } free(line); if (fn) { fclose(fp); } return 0; }
Let’s see how it performs:
$ time for x in {1..10}; do ./headc -200000 </usr/share/dict/words >/dev/null; done real 0m0.306s user 0m0.270s sys 0m0.026s
That’s more like what you’d expect from a compiled binary and isn’t noticeably slower than the native head
utility. Importantly, it’s ten times faster than the C++ implementation and given that the only real difference between the two is the I/O routines, I have to conclude that the performance of C++ stream I/O is rather dismal. The C++ code is measurably faster when reading the file directly rather than redirecting stdin which points the finger at cin
.
Part 5: Conclusions
C++ is a pleasant language to work with and the syntax is unobtrusive, if a little long-winded. The reward for the extra effort is that the compiled program runs at native speed. One goal of C++ is the “zero overhead principle”, by which it is meant that you won’t be able to get better performance by programming in another language. When it comes to I/O, however, that simply isn’t true. C++ I/O is not only greatly slower than the legacy C stdio but also slower than two scripting languages, Perl and Python. For bread-and-butter programming, plain C would appear to be the better choice.
In a former life, I spent around three years as a Java programmer. At the time it felt like a breath of fresh air, but given that I’d spent some previous years doing Windows development using COM, that’s not surprising. On reflection, the difference between them was like the difference between warm faeces and cold vomit. If forced to express an opinion, you might favour one over the other but you’d rather have neither in your mouth. Java is simply unpleasant to work with: fussy, verbose syntax combining the developer overhead of a statically-typed, compiled language with the poor performance of an interpreted one. Somewhere along the way, Java got entrenched as the “Enterprise” development language but I don’t quite understand how: if it’s no good at the basics, how can it really be any good for the enterprise? The supposed benefit of binary portability is questionable, since Perl and Python are every bit as portable and don’t have Java’s shortcomings. In fairness to Java, I will acknowledge that I am not using the latest and greatest version, although I doubt that a single version bump has suddenly made Java run quickly. I am also aware that Java 1.7 has introduced a “try-with-resources” statement that brings a touch of RAII to the Java world. This would have made working with Java slightly less unpleasant.
Programmers love novelty as much as the next person which is why, I’m guessing, Node.js has gained traction. Otherwise, I can’t quite see what it’s bringing to the table. Event-driven programming makes absolute sense for code running in the browser, since you’re responding to mouse clicks and form submissions and so on. It doesn’t make quite so much sense in a scripting language, adding needless complexity to basic tasks. Proponents would say that the problem domain for Node.js is not basic scripting but scalable network applications and, to be fair, a non-blocking event model makes a lot more sense for sockets than it does for files. However, I would reply that Python (with Twisted) and Perl (with POE) can do scalable network applications just fine and are really good at the basic stuff as well.
I presume that love of novelty is also why Perl has lost mindshare in the last decade-or-so. To many programmers, Perl is what that crotchety old Unix-guy in the corner reaches for when awk runs out of steam. Perl is, indeed, the Unix philosophy applied to language design. But that’s a good thing, because it helps make Perl economical and elegant. Perl’s manifesto is that easy things should be easy and hard things should be possible and it succeeds in its aims. Interestingly, the file size of the Perl solution is half that of its nearest rival which translates to greater programmer productivity: if I have to write less, I’m going to get more done. It gets better. If you’ve ever read ‘Code Complete’, you’ll know that the number defects per 1000 lines of code is roughly constant, so fewer lines of code means fewer defects. Get more done with fewer defects: who wouldn’t want that?
Python is also a very solid choice as a general-purpose programming language, being both economical and performant. There is a hidden cost, however: the very idiosyncratic syntax is a barrier to any programmer who has anything like a C background. Python fans have an adjective for idiomatic Python code, “pythonic”. If you’re not “pythonic”, writing in Python can be more of a chore than a pleasure. Given that performance compared to Perl is a wash and that it’s not as economical (file size is twice that of the Perl solution), I won’t be switching to Python any time soon. That said, if I had to use Python for the bulk of my work, I wouldn’t start looking for another job. You can’t say that about Java!
At the start of this article, I said that choice of programming language should be a neutral one. Of course, that is far from the truth. With the possible exception of target platform, choice of language is the single most important engineering decision you can make. Make the wrong choice and you’re halving team productivity while doubling the number of software defects. Don’t go for the latest fashionable craze (which this year is Rust) and try to avoid the sort of mindset that believes “enterprise” means Java. If raw speed is not a primary criterion (and generally, it isn’t), give very serious consideration to a good scripting language. You’ll be a happier engineer for doing so.