Filtering for only the first instance of all duplicate items

If you’ve been using Excel for a long time (i.e., at least since Excel 2003), then you will know that the Filter tool has come a long way. Already in the earlier versions of Excel (i.e., ?Excel 2003) it was a very useful tool, but it underwent a big makeover in Excel 2007, which was not without its problems, and Microsoft ironed those out in Excel 2010 to give us a really nice tool. But even now, it still has its shortcomings. I quite enjoyed Dick Kusleika’s posts on Daily Dose of Excel about “An Even better AutoFilter” (although I don’t quite see it as something that I will be adding to my armoury). See:

In this post, I will present another interesting use case that extends the capabilities of the filter.

Consider the following list (Figure 1). It presents the exported data from a great time management tool I use (ManicTime). The export is of the application-level data recorded by the program, exported as csv and imported into Excel (I also modified the column arrangement). There are a lot of duplicates (quite understandably), such as those in E5, E7, E10, E12. But what if I only wanted to see one instance of each item, regardless of whether it is duplicated or not?

Figure 1    Sample data list to be filtered

One method could be to make use of a helper column, but I thought it might be better to do without that.

Filtering for only duplicates or only singles is easy—just apply conditional formatting to highlight duplicate items, and then filter to show only those items with or without formatting, respectively. But filtering for only one of each item requires showing all the singles, as well as the first of all duplicates.

Working with ranges in dialogs

Illustrating this also allows me to highlight some tips of using the Range text box (I am actually not certain whether this is its “official” name, but you will see what I am referring to)—which appears in numerous places, amongst others the New Formatting Rule dialog. The range text box is identified by the Range Select tool (ditto about the “official” name) which appears to its right—clicking on this will shrink the current dialog (and another click will expand it), and while the I-beam is in this text box, you can select a range on the sheet, the range address of the selected range will then be entered into the text box. Essentially, clicking in the Range text box is akin to editing a cell’s contents (you can enter a formula, select ranges, etc.).

New Formatting Rule dialog

Figure 2    New Formatting Rule dialog, showing Range text box

Range text box tip #1:
Sometimes, when you are typing a formula in the range text box, you make a mistake and want to go back and correct it. If you’re like me, that does not entail abandoning the keyboard, grabbing the mouse, clicking where you want to edit (sometimes a few clicks to get it in the exact right place) and then returning to the keyboard to edit, but rather just a few quick left arrows with or without Ctrl) to get to where you want to be and edit. But there is a problem. Because you are in the Range text box, Excel assumes you want to select a range (starting from the Active cell), so when you press the left arrow (as an example), Excel dutifully selects the cell to the left of the Active cell and enters its range in the middle of your formula. But remember that working in this text box is akin to editing a cell formula—in a cell, you switch (toggle) between moving within the cell contents, and moving across worksheet ranges, by pressing F2. Same here. So press F2, and then you can happily move around within the formula you are typing.

Range text box tip #2:
Whenever you do select a range while in the Range text box (whether with the keyboard or the mouse), Excel enters the range address as an absolute address, and this is definitely by design, as most of the time, this is exactly what you need. But there are times that you want relative addresses. Again, you can edit the formula just like in a cell: Move to the range address, and press F4 until you get the reference type—Relative, Mixed (x2), or Absolute—that you want. But here is the real tip: When using relative references (or, for the relative parts of mixed references), the formula is defined for each cell in the selected range in the same relative way that the formula applies to the Active cell. So always look to see what the Active cell is, and write your formula accordingly.

Now that may be confusing (I am finding it hard to explain), so let me illustrate.

The data shown in Figure 1 actually covers this range: $A$1:$E$607 (it was a quiet day on my laptop, and I did most of my work for the day on my office PC). If I select $E$2:$E$607 by starting at $E$2 and pressing Ctrl+Shift+Down, Excel then extends the selection all the way down to row 607, showing me, then, those bottom rows; but the Active cell remains $e$2 (Figure 3). So when I enter my formula into the Conditional formatting dialog, I must keep in mind that it is relative to E2 for E2, and that it is adjusted accordingly for every cell from E3:E607 in the range.

Figure 3    Active cell in a selected range

This is important, because even though E2 is the Active cell, I can’t see it. And if, for example, I had select E607 and pressed Ctrl+Shift+Up, I would then see E2, but E607 would be my Active cell.

Filtering for only the first instance of all duplicate items

Now after all that, we can turn our attention to the filter problem. This, of course, actually very simple. I add a new conditional format, using a formula, to mark all duplicates from the second on (note that the formula is relative to E2):


Let me unpack that.

The AND function ensures that both criteria must be met: 1) It must be a duplicate, and 2) it must be a subsequent occurrence of the duplicate, not the first occurrence.

COUNTIF($E$2:$E$607,E2)>1 uses $E$2:$E$607 as the comparison range (we keep that absolute, because we don’t want it adjusting down for all the cells below E2), and counts all the occurrences of the current cell (because E2 as the criterion is relative). For each cell, if it occurs in the range more than once (i.e., >1), it is one of a set of duplicates. All singles will return FALSE for this, and all duplicates will return TRUE.

The second AND argument is perhaps slightly more complex:
MATCH returns the first instance of the criterion (E2—relative, so it is always adjusted for the current cell) in the lookup range ($E$2:$E$607, again, absolute)—Note that I am using exact matching. So it will always return the position of the first occurrence of the set of duplicates to which the current cell belongs, if the current cell is one of a set of duplicates, and it will always return the position of the current cell if the current cell is a single. However, MATCH returns the position in the range, not the row number (i.e., position 1 in $E$2:$E$607 is 1, not 2). But I need to check whether the row number of the first instance of the duplicate set (not its position) corresponds to the row number of the current cell. The relative ROW(E2) gives me the row of the current cell (whichever one of the 606 cells in this example that may be). I then just need to compare that to the row corresponding to the position returned by the MATCH function. Now I could have simplified things and said: MATCH(E2,$E$2:$E$607,0)<>ROW(E2)-1, but that would mean that my target range must always start in Row 2, as I have essentially hard-coded the row into the formula. Instead (perhaps somewhat pedantically), I avoid the hard-coding by supplying the first row of my range ($E$2) and returning its row— thus ROW($E$2)—and then subtracting 1 to find the row just before it). I suppose I could also have used ROW(OFFSET($E$2,-1,0)) to eliminate any confusion about what this part of the formula does. Nonetheless, this gives me the more complex, but more flexible, MATCH(E2,$E$2:$E$607,0)<>ROW(E2)-(ROW($E$2)-1).

Now that we have the formula, we just select $E$2:$E$607 and click on: Home | Styles | Conditional Formatting | New Rule…. That opens the New Formatting Rule dialog (Figure 4) and we add our formula (note in Figure 4 that E2 is my Active cell).

Figure 4    Creating the Conditional formatting

Once this is added, all subsequent duplicates are highlighted, but not the first instance of the duplicate set, and not any singles (see Figure 5).

Figure 5    Conditional formatting successfully applied

Once the conditional formatting has been set, it is a simple matter to filter for those cells showing no formatting (Figure 6), which gives us all singles, and all first instances from all duplicate sets.

Filtering for no formatting

Figure 6    Filtering on formatting