This website presents a corpus analysis of rock harmony. We analyzed 200 songs from Rolling Stone magazine's list of the 500 Greatest Songs of All Time. Each of us (TdC and DT) analyzed all 200 songs. The songs are analyzed in Roman numeral notation, showing the relationship of each chord to the current key. We used a recursive notation that allows a repeated pattern or section to be encoded as a single symbol. We wrote a program that expands such a "reduced" analysis into a list of chords, and we created tools for extracting aggregate statistics from such data.
Our aim in doing this was to gather statistics about patterns in rock harmony, such as the frequency of different chords and chord progressions. The results are reported in our article, "A Corpus Analysis of Rock Harmony" (Popular Music 30 [2011], 47-70) [PDF]. (The results in that article only reflect a 99-song subset of the 200-song set presented here.)
At this website, you can access
Our complete analyses (both TdC's and DT's)
Programs (for expanding an analysis into a chord list, comparing two chord lists, and extracting aggregate statistics from a chord list)
Documentation (for all of the above)
We analyzed 200 songs from the RS 500 list. This set was selected in a slightly complex way. First, we took the top 20 songs on the RS 500 list from each decade, the 1950s through the 1990s. (We did it this way to give a bit more chronological balance to the set, since the original list is somewhat skewed towards earlier decades.) One of these songs (Public Enemy's "Bring the Noise") was judged not to contain any triadic harmony, so we excluded it, leaving 99 songs. (This 99-song set is the set that we analyzed for the Popular Music article.) Then we added the highest-ranked 101 songs on the RS 500 list that were not in this 99-song set, thus creating a set of 200 songs.
Here is a list of the 200 songs we analyzed. The list is formatted: Artist, title, then the abbreviated title we used in filenames. An asterisk at the end indicates that the song was part of the 99-song set analyzed in the Popular Music article.
Each of us (DT and TdC) analyzed all 200 songs on the list. We did the analyses on our own, by ear, without consulting each other or any printed sources (e.g. lead sheets). When we were done, we compared our analyses. Any differences in meter or barlines were resolved (in this respect, the two sets of analyses are identical). Other differences were mostly not resolved. When a difference was clearly due to an error on one of our parts, we corrected it; but differences that reflected a real difference of opinion about the harmony (or key or form) were left standing. For more detail on our analytical process, see the Popular Music article.
With regard to chromatic root (the root of each chord in relation to the key), our analyses are in agreement 92.4% of the time. (This was calculated using the compare.pl script, available below.)
TdC's analyses of the 200 songs
(A tarred file of all 200 analyses)
DT's analyses of the 200 songs
(A tarred file of all 200 analyses)
Below are some programs we used for extracting aggregate data from our analyses. These programs are all described in the documentation below. Click on a link to see the code, or right-click (control-click) to download.
expand6.c
tally.pl
compare.pl
compare-meter.pl
trigram.pl
A. Overview
B. Basic Syntax
C. Chord Symbols
D. Special Symbols
E. Measures and Dots
F. Using the Expander Program
2. Extracting Aggregate Statistics
tally.pl
compare.pl
compare-meter.pl
trigram.pl
---------------
Our analyses use a recursive notation: the analysis of a section may be defined with a single symbol, and that symbol may then be used in a higher-level expression. For example, an analysis (for a hypothetical song) might look like this:
VP: I IV | Vr: $VP $VP I ii | V | Ch: I V | vi IV | S: [C] $Vr $Ch $Vr $Ch $Ch I |
"VP" is a short (one-measure) harmonic progression, consisting of the chords I and IV; "Vr" (verse) contains two repetitions of VP, and some other chords; "Ch" (chorus) likewise contains a series of chords; and "S" (the entire song) contains a pattern of alternating verses and choruses, ending with a I chord. "[C]" indicates a key of C.
The program expand6 (which we call the expander) takes such a reduced analysis and expands it like this (for the reduced analysis above):
[C] I IV | I IV | I ii | V | I V | vi IV | I IV | I IV | I ii | V | I V | vi IV | I V | vi IV | I |
The expander can also output the above representation as a list of chords, on a timeline defined by measures. The following shows the "chord list" for the beginning of the song defined above. (The integers at right will be explained below; see section F.)
0.00 0.50 I 0 1 0 0 0.50 1.00 IV 5 4 0 5 1.00 1.50 I 0 1 0 0 1.50 2.00 IV 5 4 0 5 (etc.)
We also provide tools for extracting aggregate data from such a list.
In this documentation, we explain the syntax we use, how the expansion works, and the tools for extracting aggregate data.
An analysis file is a text file consisting of a series of rules, one on each line.
A rule consists of a left-hand-side (LHS) and a right-hand-side (RHS). The LHS consists of a string, followed by a colon. (Unless otherwise indicated, a "string" here implies any series of letters, numbers, or punctuation symbols, except for a few symbols with special meanings, described below.)
The RHS is a series of nonterminals, defined measures, and key/meter symbols. A nonterminal is a string preceded by "$"; each nonterminal in an RHS must be defined somewhere else as the LHS of a rule (here the $ must be omitted). A "defined measure" is a series of one or more terminals (chord symbols or special symbols) followed by a barline "|". So in the third rule of the hypothetical song above (restated here)
Vr: $VP $VP I ii | V |
"I ii |" constitutes one defined measure; "V |" constitutes another. The following rule is invalid
Vr: | $VP $VP I vi | V
for two reasons: 1) it starts with a barline (which is not a defined measure or part of one), and 2) it ends with a harmonic symbol not followed by a barline (which is not a defined measure or part of one).
One exception is that, following a defined measure, another defined measure may be indicated with a single barline (meaning that the previously stated chord continues through the entire measure). So this is a valid rule:
Vr: $VP $VP I vi | V | |
A key/meter symbol is surrounded by square brackets and indicates key or meter; these will be discussed further below.
The top level symbol (representing the entire song) is assumed to be "S". The expander searches for the rule with S as the LHS, and then outputs its RHS, recursively expanding any nonterminals.
Note that the names of nonterminals are arbitrary; they have no meaning for any of the programs described below. However, we try to use meaningful symbols such as Vr for verse and Ch for chorus; in this way, the definition of S becomes a kind of formal analysis of the song.
Each chord symbol is assumed to have this syntax (using "regular expression" notation):
(RN).*(/RN)?
where .* may not contain a slash. RN is a Roman numeral, which must be one of the following: I #I bII II #II bIII III IV #IV bV V #V bVI VI #VI bVII VII (or the lower-case versions of these). (We assume upper-case for major triads, lower-case for minor triads.)
In other words: A harmonic label must begin with a Roman numeral symbol. After that, anything can happen (as long as it doesn't contain a slash); this is to allow all kinds of additional symbols such as "o", "7", "63", "b9", etc. After that, there may be an optional "/" plus Roman numeral to indicate an applied chord, e.g. "V7/IV".
The portion of the chord symbol after the first Roman numeral (before the slash, if any) can be used to indicate what might be called "subcategorical" information about harmony, such as chord quality, inversion, and extensions. For the most part, our tools for aggregate data extraction look only at root and key, not at subcategorical information. And we did not attempt to fully standardize our treatment of subcategorical information in our analyses. However, we did agree on certain conventions, most of which are quite standard:
Inversions: 6 = first inversion triad, 64 = second inversion triad; 7 = root-position seventh chord, 65 = first inversion seventh chord, 43 = second inversion seventh chord, 42 = third inversion seventh chord
Triads: Upper-case for major triads, lower-case for minor triads, lower-case plus "o" for diminished, upper-case plus "a" for augmented.
Seventh chords: capital Roman numeral plus 7 (e.g. IV7) is a major seventh, lower-case Roman numeral plus 7 (e.g. iv7) is a minor seventh, capital RN plus d7 (e.g. Id7) is a dominant seventh, lower-case RN plus h7 is a half-diminished seventh, lower-case RN plus x7 is a fully diminished seventh. The exception is V7, which indicates a dominant seventh chord. Inversions may be used with any of these: for example, iih65 is a first-inversion half-diminished ii chord.
Miscellaneous: "s" indicates a suspended note: for example, "Vs4" indicates a triad with a suspended fourth (and no third). V11 indicates a IV triad over 5 in the bass. Other symbols may also be used occasionally.
A few symbols have special meanings.
The asterisk is used to represent repetitions of a nonterminal; for example, "$Ch*3" means Ch three times in a row. This notation may only be used with nonterminals and barlines (e.g. "|*3").
Strings surrounded by square brackets are key/meter symbols. Keys must be pitch names such as C or C#. (Major/minor distinctions are not recognized. For black-note keys, either of the two common spellings may be used, e.g. G# or Ab.) The time signature string must be N/D, where N is an integer 1-12 and D is 2, 4, 8, or 16. The S statement must start with a key symbol; a time signature symbol is optional (if no time signature is stated, 4/4 is assumed). Key and time signature statements may also be inserted in other RHS expressions (at the beginning or in the middle) to indicate changes of time and meter. (Time signature symbols may only occur at the beginning of a measure.) Key/meter symbols stated in a rule apply recursively to all descendant nonterminals, but may be overridden by a symbol stated in a descendant rule; at the end of the descendant span, the key/meter reverts to that stated in the parent rule.
'.' indicates the continuation of the previous chord. See section E below for explanation. This symbol may not be used at the beginning of an RHS.
'R' means a segment of "rest" that seems to have no harmony. This may occur at the beginning of the song (e.g. if there is an intro with just drums) or elsewhere. (In the chord-list output, R's are ignored; the previous chord is assumed to continue over them. However, R's are recognized as taking time at the beginning of the piece, e.g. if the analysis starts "R | I |" then the first chord statement is assumed to start at 1.0.)
'%' means that everything afterwards on that line is a comment. ('%' need not be at the beginning of a line.)
To summarize, the following are the symbols with special meanings:
':' must be used after the LHS of a rule, nowhere else.
'%' means that everything afterwards on that line is a comment.
'*' may only be used immediately following a nonterminal or barline, and must be immediately followed by an integer.
'$' must be used at the beginning of a string in an RHS expression that is defined elsewhere; it may not be used anywhere else.
'[' and ']' may be used around a key/meter symbol, nowhere else.
'.' (as a complete string) indicates the continuation of the previous chord, and may not be used at the beginning of an RHS.
'R' means rest and may be used anywhere that a chord symbol may be used.
The chords stated in a defined measure are assumed to partition the measure evenly. So this
I vi IV V |
indicates I in the first quarter of the measure, vi on the second, IV on the third and V on the fourth. (There is currently no check to ensure that the number of divisons of the measure makes sense given the time signature.)
For uneven divisions, the dot may be used, e.g.
I . IV V |
This implies that the I chord takes up the first half of the measure.
A dot has the same meaning as simply repeating the previous symbol. This may be done anywhere, even when it is redundant (except at the beginning of the LHS). So the following are all legal and equivalent:
I | | I | . | I I . . | . I . I |
The expander program takes an analysis file - a list of rules, written in the syntax defined above. It searches for the rule with "S" as the LHS, and then outputs its RHS, recursively expanding any nonterminals. If there is an error in the syntax or the analysis cannot be interpreted - for example, because a nonterminal symbol is used in a definition but not defined elsewhere - the program outputs an error message and quits.
The program is in C and requires a C compiler. Compile the program like this (in a Unix window, e.g. the Mac "terminal" window):
cc expand6.c -o expand6
Run it like this:
./expand6 -v [verbosity] [input file]
If verbosity=0, the output is just a "chord list" - a list of chord statements, like this:
0.00 4.00 I 0 1 4 4 4.00 6.00 IVb7 5 4 4 9 6.00 7.00 V7 7 5 4 11 7.00 12.00 I 0 1 4 4 ---
(The "---" at the end is to separate one song from the next if multiple chord lists are concatenated.)
Each chord statement has the form
[start] [end] [Roman numeral] [chromatic root] [diatonic root] [key] [absolute root]
start = the start time of the chord segment, in relation to measures, e.g. 0.0 = start of m. 1, 0.5 = halfway point of m. 1, etc.
end = end time of chord segment
Roman numeral = complete chord label for chord, exactly as in input file
chromatic root = integer of root in relation to the current key, adjusted for applied chords (e.g. I=0, bII=1, II=2; V/ii = VI = 9)
diatonic root = diatonic category of chromatic root, e.g. VI = 6
key = integer of current tonic, e.g. C = 0, C#/Db = 1
absolute root = chromatic root + key, e.g. V in D = A = 9
(Note: When two successive chords have the same root and key, they are collapsed into a single chord in the chord list.)
If verbosity = 1, the output also includes the expanded one-line analysis shown in section A, and other information.
If verbosity = -1, the program simply outputs a list of measure numbers with the time signature for each measure. A time signature is represented with an integer, which is (100 x numerator) + denominator, e.g. 2/4 is 204.
We provide some tools for extracting aggregate statistics about harmony from an analysis or series of analyses. Most of these tools assume input in the form of a "chord list", as output by the expander program described above; they can also take multiple concatenated chord lists (separated by "---"). (These programs were used to generate the statistics presented in the Popular Music paper.) All the programs described below are Perl scripts.
tally.pl. This script takes in a chord list and outputs aggregate data as follows. ("Time" is measured in terms of measures on the timeline, i.e. one measure is one unit.) Specify the input file on the command line, e.g. "./tally.pl [input-file]" (or pipe in using the UNIX "pipe" command).
1. Overall statistics: Total chord count, total time, number of major/minor/diminished/augmented chords (including all sevenths with whatever triad type they are based on), number of root-position/inverted chords.
2. The number of occurrences of each chromatic root
3. The total amount of time spent on each chromatic root
4. The count of each chromatic-root transition between one chord (the "antecedent") and the next (the "consequent"). (This assumes the same key for both chords; key-changing transitions are skipped.)
5. For each possible consequent chord, the proportional frequency of each antecedent chord
6. For each possible antecedent chord, the proportional frequency of each consequent chord
8. The distribution of chromatic root intervals. Pitch-class notation is used: each interval is represented by its size in semitones, and all intervals are assumed to be ascending. Thus, 0 is a repetition; 1 is an ascending minor second (or descending major seventh); 2 is an ascending major second (or descending minor seventh); etc.
9. The distribution of diatonic root intervals (so minor and major seconds are lumped together). In this case, each interval is represented by its smallest form. So "+M/m2" means an ascending major/minor second (or descending M/m seventh); "-M/m2" means a descending major/minor second (or ascending M/m seventh); etc.
compare.pl. This script takes two chord-list files (specified on the command line) and compares them, outputting the total amount of time for which they are in agreement.
If $v (verbosity) = 1, the program outputs parallel chord lists indicating differences. If $v=0, it just outputs the total number of measures found, the number of measures in agreement, and the latter as a proportion of the former. So the output "50.00 40.00 (0.800)" means, 50 measures were found; the analyses were in agreement on 40 measures; and 40 / 50 = 0.8.
The script requires that the start times of the two chord-lists (i.e. the start times of the first chord) are the same. If the end time (i.e. the end time of the final chord) of the two lists are not the same: if v=0, the script simply outputs an error message and exits; if v=1, it adjusts the earlier end time to match the later one, and then does the comparison, but outputs a warning as well.
The script can be used to compare chromatic roots, absolute roots, or key, depending on the value of $cf ("compared feature"), set at the top of the code. If $cf = 3, it compares chromatic roots; $cf = 5, keys; $cf = 6, absolute roots.
compare-meter.pl. This script takes two chord-lists and compares their time signatures (using the time-signature list of the kind output by expand6 with verbosity = -1). If the lists are identical - the same number of measures, with the same time signature for each measure - it outputs "OK". If there are mismatches, it identifies them, e.g. "Mismatch on m. 56 (304, 404)".
(In our paper, we wanted to compare our harmonic analyses, but there didn't seem to be any point in doing this unless the analyses were identical metrically, i.e. with the barlines in the same places. So we used compare-meter.pl to check this before comparing the harmony.)
trigram.pl. This script takes a chord list and extracts "trigrams", sequences of three successive chromatic roots (all wthin the same key; trigrams spanning a key boundary are ignored). Basic Unix commands can then be used to get aggregate data, e.g.
./trigram.pl [input file] | sort | uniq -c | sort -nr
This gives you a list of all the trigram types, with counts, ranked by count.