csvgrep

Easy and funny searches on text delimited files

Download .zip Download .tar.gz View on GitHub

csvgrep is a command line program which enables users to execute searches on text delimited files using a rudimentary query language. Its query language is very bounded to simplicity and expressivity (in order to be easily comprehensible). It is simple and easy-to-run that csvgrep is committed to be. It aims at replacing both grep and awk when you are challenged to retrieve information of a text delimited file based on the content of a specific field (or column), you can get what you want using the semantic already presented into the file's underlying structure. It is a safe pair of hands to quickly and precisely gather the information you need from a bunch of text files.

Using csvgrep

Simply type csvgrep -h to exhibit all the csvgrep's options:

Usage: csvgrep [OPTIONS] CODE [FILE]

OPTIONS:
    -a      [print-all] prints all fields, except with @hide
    -C      [check] checks whether the number of columns is constant
    -c      [close] puts one delimiter at the end of each output line
    -IS     [input-delimiter] sets the input delimiter
    -OS     [output-delimiter] sets the output delimiter
    -SS     [io-delimiter] sets input and output delimiters
    -f      [filename] prints input filename
    -N      [number] prints line numbers
    -p      [print-headers] prints headers
    -s      [headers-only] prints only headers
    -u      [unquote] unquote output fields
    -v      [verbose] verbose mode on
    -V      [version] prints csvgrep's version
    -h      [help] prints this message

---
Version 1.0.0
Mantained by Wilson Freitas
http://code.google.com/p/csvtools/

The CODE above stands for the expressions used by csvgrep to filter the desired content. These expressions will be discussed in detail in the next sections, we are going to start with several real world examples focusing on the understanding of each of the csvgrep's features.

The way csvgrep processes files

csvgrep processes a file exactly the same way grep does, line by line. Once we are expecting structured files, they must be structured in columns and also have to have a delimiter separating columns inside the file. Despite the file contains columns, when we are processing a line we are dealing with fields, because when we broke the file into lines each line is split into fields. These are the components that usually appear while a file is being processed by csvgrep.

The basic

As said before, csvgrep works pretty much like the well known grep command, so you have text delimited lines and want translate it into useful information. It's simple like that, that's the csvgrep's purpose, filter content on text delimited files. But where is the difference between csvgrep and grep? Since the file is structured in columns csvgrep can filter it based on the content of a specific column. Considering the file contacts.csv

$ cat contacts.csv 
Name, Email, Notes, IM, Phone
Wilson, wilson@popstar.com, smart guy, wndf, 555-5555
Lorie Cabucio, lorie@lovestory.com, Hot, lolo, 555-5435
Jeckill, jeck@monster.com, , dr.jeckill,
Nando, nando@nano.com, quant guy, nando,
Rafaela, rafa@popstar.com, the girl with the dragon tattoo, rafa, 676-9876
Andrea Martins, andy@popstar.com, likes Dave Mathews Band, andy, 576-0912

This file is structured in columns and may represent a contact list where the first line has the header (with the name of each field) and the following lines have valuable information, also organized in columns. In order to obtain the complete list of names we do

$ csvgrep '$1' contacts.csv
Name
Wilson
Lorie Cabucio
Jeckill
Nando
Rafaela
Andrea Martins

The expression $1 tells csvgrep to print the first column of contacts.csv and here we have an important thing being assumed. In that example we assumed that the delimiter used into contacts.csv is a single comma (,), if it didn't, csvgrep would print the whole file. For example, if we had other file called contacts2.csv, using | as delimiter, we would get the following output.

$ cat contacts2.csv 
Name| Email| Notes| IM| Phone
Wilson| wilson@popstar.com| smart guy| wndf| 555-5555
Lorie Cabucio| lorie@lovestory.com| Hot| lolo| 555-5435
Jeckill| jeck@monster.com| | dr.jeckill|
Nando| nando@nano.com| quant guy| nando|
Rafaela| rafa@popstar.com| the girl with the dragon tattoo| rafa| 676-9876
Andrea Martins| andy@popstar.com| likes Dave Mathews Band| andy| 576-0912

$ csvgrep '$1' contacts2.csv 
Name| Email| Notes| IM| Phone
Wilson| wilson@popstar.com| smart guy| wndf| 555-5555
Lorie Cabucio| lorie@lovestory.com| Hot| lolo| 555-5435
Jeckill| jeck@monster.com| | dr.jeckill|
Nando| nando@nano.com| quant guy| nando|
Rafaela| rafa@popstar.com| the girl with the dragon tattoo| rafa| 676-9876
Andrea Martins| andy@popstar.com| likes Dave Mathews Band| andy| 576-0912

So, for this file, we need to specify the input delimiter using the -I option.

$ csvgrep -I'\|' '$1' contacts2.csv 
Name
Wilson
Lorie Cabucio
Jeckill
Nando
Rafaela
Andrea Martins

Note that since input separator is interpreted as a regular expression in Perl, so if we use | as a delimiter it must be escaped \|. The default delimiter for both input and output is the comma character (,).

If we wanted to print also the second column we could simply add $2 to our expression to have it done. Therefore, when we write the expression, what we do is define which columns we want to print and this behavior reinforces the idea of being-like-grep. The columns into the expression are delimited by the semicolon character (;).

$ csvgrep '$1;$2' contacts.csv 
Name,Email
Wilson,wilson@popstar.com
Lorie Cabucio,lorie@lovestory.com
Jeckill,jeck@monster.com
Nando,nando@nano.com
Rafaela,rafa@popstar.com
Andrea Martins,andy@popstar.com

At this point we can see that the expression or either, the code, is formed by references to columns delimited by semicolons. So, once you wanted to print the other columns all you have to do is to declare their references. If you put a reference to an inexistent column, don't worry, csvgrep won't go off on you, an empty column is printed. For example, we know contacts.csv has 5 columns, so if we try to print the column $10 between $1 and $2, we get the following result

$ csvgrep '$1;$10;$2' contacts.csv 
Name,,Email
Wilson,,wilson@popstar.com
Lorie Cabucio,,lorie@lovestory.com
Jeckill,,jeck@monster.com
Nando,,nando@nano.com
Rafaela,,rafa@popstar.com
Andrea Martins,,andy@popstar.com

It doesn't make much sense now but who knows someday it might be useful, perhaps as placeholder generator, think about and if you figure out something, please tell me.

In the end, sometimes you may want to print all columns and it can be done through the option -a. I confess that it seems to be useless unless you apply a filter to at least one column, but if you want to change the file delimiter, it can be useful. Considering we intend to use semicolon (;) instead of comma (,), the whole file could be processed with the following command:

$ csvgrep -a -O';' '' contacts.csv
Name;Email;Notes;IM;Phone
Wilson;wilson@popstar.com;smart guy;wndf;555-5555
Lorie Cabucio;lorie@lovestory.com;Hot;lolo;555-5435
Jeckill;jeck@monster.com;;dr.jeckill;
Nando;nando@nano.com;quant guy;nando;
Rafaela;rafa@popstar.com;the girl with the dragon tattoo;rafa;676-9876
Andrea Martins;andy@popstar.com;likes Dave Mathews Band;andy;576-0912

I know it appears too much simple and of course it could be easily done with grep, fortunately that's the simpler case. csvgrep handles quoted fields and in such cases it is a pain to change the delimiters when they are supposed to appear inside a quoted field. Take a look at the example below:

$ cat quoted.csv 
Name, Comment
Hyde, "The guy who didn't sleep, do you believe"
Loren, "The power, the strength, the energy what was left to us"

$ csvgrep '$2' quoted.csv 
Comment
"The guy who didn't sleep, do you believe"
"The power, the strength, the energy what was left to us"

$ csvgrep -a '' quoted.csv 
Name,Comment
Hyde,"The guy who didn't sleep, do you believe"
Loren,"The power, the strength, the energy what was left to us"

$ csvgrep -a -O: '' quoted.csv 
Name:Comment
Hyde:The guy who didn't sleep, do you believe
Loren:The power, the strength, the energy what was left to us

Note that

  • The file was correctly parsed, even with delimiters inside quoted fields.
  • The fields which contain the delimiter inside were printed surrounded with the quotation marks. If we wanted to unquote the fields we should use the -u option (unquote).
  • An empty parameter was given. That is the expression parameter and since no filter is being applied it goes empty. You can also declare a reference to a column, but it would be ignored because of the -a option use. It seems a little weird or even unnecessary, but that's the way it is.

As we could see csvgrep is extremely simple to use and offers a lot of ways for solving annoying problems. On next sessions we will look deeply at csvgrep's syntax and its functionalities.

csvgrep expression

We saw that after its command-line options csvgrep receives one string delimited by single quotation marks ('). This parameter is the expression, or either, the code which tells to csvgrep what to do. This expression contains a list of rules separated by semicolons and each rule is formed by a statement that may be defined under the following pattern:

[EXP-FUNC] REFERENCE [EXP-OP EXP-VALUE]

where

  • REFERENCE: a reference in csvgrep is a variable which either refers to a field in a row or has its own value as we will see further is the situation of the special references. The section on [References][csvgrepReferences] below explains detailedly its use.
  • EXP-FUNC: are expression function that are flags attached to a rule in order to affect the way it works. Details on section [Expression functions][csvgrepExpressionFunctions].
  • EXP-OP: are the expression operator, since the intent is to process the file interacting with its fields, the csvgrep's operators offers different ways for achieving that target. A complete list of operators can be found at section [Operators][csvgrepOperators].
  • EXP-VALUE: once we use an operator we must pass it one value, so the expression value is the content defined by users in order to find the desired content. An explanation about this is on the same section of [Operators][csvgrepOperators].

So csvgrep has these 4 entities and we will see further how easy is to define them for filtering files content. Once understood these entities we will be able to define properly the rule, that in its essence is a boolean statement, so it's valued to true or false. If all rules in the expression are true the line is printed otherwise it isn't so. The next sections will go into details in order to clarify the usage of the expression and its components.

References

The reference is the most important component of the rule. When we declare a reference we are telling to csvgrep what we is important for us and what we want to see at the output. References are declared starting with a $ followed by:

  • a number ($N numbered references)
  • the name of column surrounded with curly braces (${Column Name} named references)
  • an identifier as in case of the special references that are variables updated while the lines are being processed.

Numbered references : The numbered references start with $1 which indicates the first column and go to infinity since csvgrep doesn't go off on you for using an invalid numbered references. : The reference $0 indicates the whole line being processed.

Named references : These references must match exactly the column name inside the curly braces. For example, the reference ${Email} represents the column in which the header has a field that matches Email. : We will see soon that the header must be defined with @header to properly use named references. : Once defined the header you can still used numbered references.

Special references : Can be considered runtime-variables that are alive during the file processing and are ready for being consumed. : $NR is the number of the line currently being processed. The lines start in 1. : $NF is the number of fields in the current line. : $#{.} or $#N evaluates to the length of the referenced text. : $?{.} or $?N evaluates the rule to false if the reference is an empty string, values 0.0 or equals /^false$/i

Operators

The operators are used to compare a field to a user defined value (expression value) for getting the rule evaluated as it does a boolean statement. At this moment the expression value cannot be a reference, it only accepts the types handled by csvgrep. csvgrep can handle 2 different types of variables: strings and numbers; and each type has its own set of operators. As defined on previous section, the reference might be followed by an operator and by an expression value, always in this order. We cannot put the expression value before the operator, it won't work. So, the expression values can be defined as:

  • simple text surrounded with double quotation marks ""
  • regular expression delimited by //
  • numbers since it matches any decimal number pattern
  • shell command surrounded with that is processed and the result is interpreted as text

Internally both "" and // are converted to regular expressions. Now we will see the operators and some examples on how to use them.

String operators : eq: evaluates to true if the field matches the regular expression : ne: evaluates to true if the field doesn't match the regular expression : in: evaluates to true if the text contains the field

Numeric operators : ==: equals : !=: not equals : > : greater than : >=: greater than or equals to : < : less than : <=: less than or equals to

Expression Functions

Sometimes you want that one specific rule behaves in a different way than just being a boolean statement. For achieving this target csvgrep uses expression functions that are just flags defined into a rule. Until now we have only 2 expression functions, so we are accepting suggestions.

Expression functions : @hide is used to omit the field at the output : @header is used to specify the header, when it is present in a rule, this rule is used to match the header. A common use is @header $NR == 1 telling csvgrep that the header is in the first line.

Using csvgrep

Let's get back to our contacts.csv and now we want to use the header for defining columns. To do so we simply use:

$ csvgrep '@header $NR == 1; ${Name}' contacts.csv 
Wilson
Lorie Cabucio
Jeckill
Nando
Rafaela
Andrea Martins

Note that I use ${Name} to refer to the column Name and the header line is no longer printed. If we wanted to print the header we should use the option -p. If I try to refer to a field using its name without to specify how to find the header (with @header), csvgrep raises a fatal error message. The reference $NR is valued to the current line number (it is similar to NR special variable in AWK), so we are informing that the header is in the first line.

$ csvgrep '${Name}' contacts.csv 
Fatal: The @header field must be defined at ../csvgrep.pl line 160, <INPUT_FILE> line 1.

We can define the header using other expressions

$ csvgrep '@header $1 eq "Name" ; ${Name}' contacts.csv
Wilson
Lorie Cabucio
Jeckill
Nando
Rafaela
Andrea Martins

But it's always important to be consistent with the header definition because the header is reset each time the header-rule is true.

We can go forward and see an example where a boolean expression is used to grep file's contents based on its columns, that's the most common use to csvgrep, at least in my opinion. Let's find the email of a girl with the name Lorie:

$ csvgrep '@header $NR == 1; ${Name} eq "Lorie" ; ${Email}' contacts.csv
Lorie Cabucio,lorie@lovestory.com

Since csvgrep compiles the quoted text as a Perl regular expression all types of regular expressions supported by Perl can be used within csvgrep.

We also have the operator ne (that negates the operator eq) and we can use it to find the contacts with non-empty phone number.

$ csvgrep '@header $NR == 1; ${Name} ; ${Phone} ne "^$"' contacts.csv 
Wilson,555-5555
Lorie Cabucio,555-5435
Rafaela,676-9876
Andrea Martins,576-0912

$ csvgrep '@header $NR == 1; ${Name} ; ${Phone} eq "^$"' contacts.csv 
Jeckill,
Nando,

csvgrep printed lines which the regular expression "^$" didn't match the phone number. The trailing spaces in the fields values are removed before it's processed, so blank spaced fields are empty for csvgrep.

Every column declare as a reference in one rule is printed, but sometimes it's interesting to omit that column from the output. It can be done using the expression function @hide.

$ csvgrep '@header $NR == 1; ${Name} ; @hide ${Phone} eq /^$/' contacts.csv 
Jeckill
Nando

In this example we are saying to csvgrep to hide the phone column, so it's not sent to the output and only the contacts that have an empty phone number were printed. We can stack as many expression as we want and build more sophisticated queries

$ csvgrep '@header $NR == 1; ${Name} ; ${Phone} ne /^$/ ; @hide ${Email} eq "@popstar" ; @hide ${Notes} eq "tattoo"' contacts.csv
Rafaela,676-9876

In some cases it is useful to find contents inside a list and for this task we have the in operator.

$ csvgrep '@header $NR == 1; ${IM} in "rafa lolo manu laura ana joana" ; ${Name} ; ${Phone}' contacts.csv
lolo,Lorie Cabucio,555-5435
rafa,Rafaela,676-9876

it could also be done with a regular expression

$ csvgrep '@header $NR == 1; ${IM} eq "rafa|lolo|manu|laura|ana|joana" ; ${Name} ; ${Phone}' contacts.csv
lolo,Lorie Cabucio,555-5435
rafa,Rafaela,676-9876

The in operator is most useful when applied to search for a content into a file.

$ csvgrep '@header $NR == 1; ${IM} in `cat names.txt` ; ${Name} ; ${Phone}' contacts.csv
lolo,Lorie Cabucio,555-5435
rafa,Rafaela,676-9876

where

$ cat names.txt 
rafa
lolo
manu
laura
ana

The string delimited by the grave accent is executed as a shell command and its output text is used as text into the expression value.


Let's take a look at another file.

Date,Open,High,Low,Close,Volume,Adj Close
2008-11-17,1494.74,1526.96,1481.70,1482.05,1831540000,1482.05
2008-11-14,1560.59,1587.76,1513.09,1516.85,2243750000,1516.85
2008-11-13,1503.06,1596.70,1428.54,1596.70,3009550000,1596.70
...
1971-02-09,100.76,100.76,100.76,100.76,000,100.76
1971-02-08,100.84,100.84,100.84,100.84,000,100.84
1971-02-05,100.00,100.00,100.00,100.00,000,100.00

This files was downloaded from Yahoo Finance and it has the stock values of some company. The dates range from 05/02/1971 to 11/17/2008 and we want to find the dates where the Close price has broken the $1000 level.

$ csvgrep '@header $NR == 1 ; $NR ; ${Date} ; ${Close} >= 999 ; @hide ${Close} <= 1001' table1.csv
3344,1995-08-10,1000.61
3355,1995-07-26,1000.18
3363,1995-07-14,999.33

We can see the on date 07/14/1995 the Close price was very close to $1000.

One important thing to mention about operators that handle numeric operands is the coercion rules:

  1. the column's content comes as text.
  2. the text is coerced to a number following Perl's coercion engine: any text that doesn't represents a number values 0.
  3. Missing fields are empty strings, so in a numeric context it values 0.

Let's suppose now that we want get all points after the break of $1000. We saw that the break had occurred approximately on 07/14/1995, so we can get all data after this date. This date is in the line number 3363, so as the data is listed with decreasing date, we want all lines with $NR less than 3363:

$ csvgrep '@header $NR == 1 ; $NR <= 3363 ; ${Date} ; ${Close}' table1.csv
2,2008-11-17,1482.05
3,2008-11-14,1516.85
4,2008-11-13,1596.70
5,2008-11-12,1499.21
...
3360,1995-07-19,952.87
3361,1995-07-18,988.53
3362,1995-07-17,1005.89
3363,1995-07-14,999.33

the line number is the first field at the output, to hide it use @hide within the first rule.


A simple way to quickly evaluate a field regarding whether or not it has a valid content is using the question mark reference $?N or $?{.}.

$ csvgrep -aN '' bool.csv 
1:Zero,0.0
2:Minus Zero,-0.0
3:False,false
4:True,true
5:One,1
6:One Decimal,1.0
7:Minus One Decimal,-1.0
8:Empty,

$ csvgrep '$NR ; $1 ; $?2' bool.csv 
4,True,true
5,One,1
6,One Decimal,1.0
7,Minus One Decimal,-1.0

$ csvgrep '$NR ; $?1 ; $2' bool.csv 
1,Zero,0.0
2,Minus Zero,-0.0
4,True,true
5,One,1
6,One Decimal,1.0
7,Minus One Decimal,-1.0
8,Empty,

As we can see all fields with

  • 0.0
  • false (ignoring case)
  • /^$/ (empty)

were evaluated to false.

Pay attention to the first command, it uses the option -N to print the line numbers.


For checking whether the number of fields remains constant in the file respecting the defined header or not, we have the option -C that returns 0 for the shell in case of success or 2 in case of failure indicating that at some point into the file the number of fields is different from the number of columns defined by the header.

$ csvgrep -C '@header $NR == 1' table1.csv

$ echo $?
0

$ csvgrep -C '@header $NR == 1' table2.csv 

$ echo $?
2

$ cat table2.csv 
Date,Open,High,Low,Close,Volume,Adj Close
2008-11-17,1494.74,1526.96,1481.70,1482.05,1831540000,1482.05
2008-11-14,1560.59,1587.76,1513.09,1516.85

Sometimes it is useful to put a delimiter at the end of each line sent to output. The option -c, close-line command, solves that problem.

$ csvgrep -c -O':' '@header $NR == 1; ${Name} ; ${Phone} ne "^$"' contacts.csv 
Wilson:555-5555:
Lorie Cabucio:555-5435:
Rafaela:676-9876:
Andrea Martins:576-0912:

As seen before, when the field contains a delimiter it is printed surrounded with quotation marks. In order to have that field unquoted we should use the option -u (unquote).

$ csvgrep -a '' quoted.csv 
Name,Comment
Hyde,"The guy who didn't sleep, do you believe"
Loren,"The power, the strength, the energy what was left to us"

$ csvgrep -au '' quoted.csv 
Name,Comment
Hyde,The guy who didn't sleep, do you believe
Loren,The power, the strength, the energy what was left to us

Though its simplicity csvgrep makes a great job solving some well defined problems when handling with structured files. I will continue looking forward to find ways to improve csvgrep.