Manipulating text with Excel: Take II

Quite some time ago, I did a post in which I demonstrated some techniques for manipulating repetitive text blocks. However, I will confess, as my Excel-L friend Brian Canes also pointed out, that the example is somewhat contrived, as it would probably have taken less time to use the CONCATENATE function and just to click on all the cells. What I don’t like about doing something like that is the possibility of error—of clicking on one cell twice, or skipping a cell.

I want to go just a bit further with the idea of using Excel to manipulate text (also using a less contrived example), and show how I use Excel to create repetitive (or even partially repetitive) text blocks. Granted, people tend to think of Excel as a tool for working with numerical data (which, of course, it is), and granted, Excel really has some big problems working with rich text, but people tend to miss the fact that the tabular layout of a spreadsheet can still be a useful palette for working with text. For example (and before I get to my post), if I have to create something like an attendance list (a table with space for names, signatures, and perhaps some other information to be completed, but nothing else), then I prefer to do it in Excel. The page layout is a bit trickier, but I find I can do it quicker in Excel than, for example, by creating a table in Word.

Here, then, are four somewhat less contrived (or not!) examples of using Excel to manipulate text. They all relate to using excel to create programming code. The first one I used to crea VBA code, and the last three SAS code.

Example 1

Here is an example where I used Excel to write VBA code for the Word uTIlities. I wanted to be able to determine exactly what is selected, especially if an object is selected. I copied a table from the Microsoft Word help file (and the example below is from the Word 2010 help file), but the same list can be found for Word 2013 at https://msdn.microsoft.com/en-us/library/office/ff860759.aspx. The table consists of three columns, the Name, the value, and a description. I wanted to take that and use it as the base for my programming. So all I did was copy the table and paste it into Excel, and then added some extra columns to do all my “writing” for me. The process is explained below.

Figure 1    Word Help file text pasted into Excel and modified for further programming

If you look at this, the process was simple. My three original columns from the table were pasted into the worksheet columns B:D. I then added two columns between B:C. Then I typed “elseif selection.shaperange(1).type=” into A2 and filled it down. I typed “Then” into C2 and filled that down. Then I added one double quote to D2 and filled that down, and then merged those values with the aforementioned uTIlity of mine (note that the actual values are now, in Figure 2, in Column D and no longer in E). I then typed “WhatIsSelected = WhatIsSelected & ” in an” in F2 and filled that down, and again one double quote to H2 and filled that down. The double quote in D2 deserves an explanation. Because the single quote is Excel’s cue to treat a number as text, if I added that, and copied it across to the VBE, only the number would be pasted, not the single quote (which, as you will see in the code snippet below, I wanted to use for commenting out the number). But a simple search and replace could change that double quote to a single quote once it had been copied and pasted. One more search and replace would also allow me to break the Excel rows into two code rows each. And I just had to delete the very first else to make it the starting if. So, in a matter of a minute or two, I was able, from the help file, to produce the code below. Counting only the text from Excel, that’s 367 words, 2603 characters in fifty lines of code—you would have to be a very fast typist to beat that (and I am not a very fast typist, so it helped me all the more).

intTestForError = .ShapeRange(1).Type
If Err.Number = 0 Then
  '1
  If .ShapeRange(1).Type = msoAutoShape Then
      WhatIsSelected = WhatIsSelected  _
       & " in an AutoShape"
  '2
  ElseIf .ShapeRange(1).Type = msoCallout Then
      WhatIsSelected = WhatIsSelected  _
       & " in a Callout"
  '20
  ElseIf .ShapeRange(1).Type = msoCanvas Then
      WhatIsSelected = WhatIsSelected  _
       & " in a Canvas"
  '3
  ElseIf .ShapeRange(1).Type = msoChart Then
      WhatIsSelected = WhatIsSelected  _
       & " in a Chart"
  '4
  ElseIf .ShapeRange(1).Type = msoComment Then
      WhatIsSelected = WhatIsSelected  _
       & " in a Comment"
	'21
  ElseIf .ShapeRange(1).Type = msoDiagram Then
      WhatIsSelected = WhatIsSelected  _
       & " in a Diagram"
	'7
  ElseIf .ShapeRange(1).Type = msoEmbeddedOLEObject Then
      WhatIsSelected = WhatIsSelected  _
       & " in an Embedded OLE object"
	'8
  ElseIf .ShapeRange(1).Type = msoFormControl Then
      WhatIsSelected = WhatIsSelected  _
       & " in a Form control"
	'5
  ElseIf .ShapeRange(1).Type = msoFreeform Then
      WhatIsSelected = WhatIsSelected  _
       & " in a Freeform"
	'6
  ElseIf .ShapeRange(1).Type = msoGroup Then
      WhatIsSelected = WhatIsSelected  _
       & " in a Group"
	'24
  ElseIf .ShapeRange(1).Type = msoSmartArt Then
      WhatIsSelected = WhatIsSelected  _
       & " in a SmartArt graphic"
	'22
  ElseIf .ShapeRange(1).Type = msoInk Then
      WhatIsSelected = WhatIsSelected  _
       & " in an Ink"
	'23
  ElseIf .ShapeRange(1).Type = msoInkComment Then
      WhatIsSelected = WhatIsSelected  _
       & " in an Ink comment"
	'9
  ElseIf .ShapeRange(1).Type = msoLine Then
      WhatIsSelected = WhatIsSelected  _
       & " in a Line"
	'10
  ElseIf .ShapeRange(1).Type = msoLinkedOLEObject Then
      WhatIsSelected = WhatIsSelected  _
       & " in a Linked OLE object"
	'11
  ElseIf .ShapeRange(1).Type = msoLinkedPicture Then
      WhatIsSelected = WhatIsSelected  _
       & " in a Linked picture"
	'16
  ElseIf .ShapeRange(1).Type = msoMedia Then
      WhatIsSelected = WhatIsSelected  _
       & " in a Media"
	'12
  ElseIf .ShapeRange(1).Type = msoOLEControlObject Then
      WhatIsSelected = WhatIsSelected  _
       & " in an OLE control object"
	'13
  ElseIf .ShapeRange(1).Type = msoPicture Then
      WhatIsSelected = WhatIsSelected  _
       & " in a Picture"
	'14
  ElseIf .ShapeRange(1).Type = msoPlaceholder Then
      WhatIsSelected = WhatIsSelected  _
       & " in a Placeholder"
	'18
  ElseIf .ShapeRange(1).Type = msoScriptAnchor Then
      WhatIsSelected = WhatIsSelected  _
       & " in a Script anchor"
	'-2
  ElseIf .ShapeRange(1).Type = msoShapeTypeMixed Then
      WhatIsSelected = WhatIsSelected  _
       & " in a Mixed shape type"
	'19
  ElseIf .ShapeRange(1).Type = msoTable Then
      WhatIsSelected = WhatIsSelected  _
       & " in a Table"
	'17
  ElseIf .ShapeRange(1).Type = msoTextBox Then
      WhatIsSelected = WhatIsSelected  _
       & " in a Text box"
	'15
  ElseIf .ShapeRange(1).Type = msoTextEffect Then
      WhatIsSelected = WhatIsSelected  _
       & " in a Text effect"
  End If
End If

 Example 2

The first SAS example concerns the scaling of values in a questionnaire for a paper we were writing. Essentially, we have thirty three items, each of which is divided into four sub-questions. The respondents had to indicate to what level they felt their mothers were involved in various aspects of their sport (A), to what level they wanted their mothers to be involved (B), and then again to what level they felt their fathers were involved in various aspects of their sport (C), to what level they wanted their fathers to be involved (D). For each parent, we had to calculate a discrepancy score—the degree to which the perceived behaviour differed from the desired behaviour (i.e., A-B for mothers, and C-D for fathers.

This could also, admittedly, be done in a different way. I must confess that my SAS macro skills are not what my Excel skills are, and even in Excel I’m probably still not at the true level of guru-like enlightenment. But presumably, this could be done with a SAS macro. However, the simple macro below doesn’t work because my incrementer variable I is used next to other characters, and so SAS doesn’t recognise it. Contrary to my own advice, I did not go and teach myself how to conflate the incrementer with the succeeding text into a new variable name, but rather thought to myself: “I’m under a tight time deadline, and I can do this in a minute or two in Excel.” Which is what I did.

So the first thing I did was enter this in Row 1:

Figure 2    Starter line of code in Excel

Next, copy all of that down to row 2 and change the value of B2, E2 and I2 to 2:

Figure 3    Second line of code, with 1 changed to 2

Now select it all, and use autofill to drag and fill it down to row 33.

Now copy all 33 rows, paste below, and in that, use search & replace to replace ‘Mom’ with ‘Dad.’

The last step can be done in one of two ways. If I copy all of that, and paste it into UltraEdit which I use for my SAS programming, each column’s values are added as tab-separated. So a simple search and replace in UltraEdit to replace tabs with nothing, gives me the final result I want:

B_1_Mom_Discrep=B_1A-B_1B;

However, I used one of my own Excel uTIlities (which I have still not released to the public!) to merge all that information into Column A (see the before and after views below). From there, I could just copy that into UltraEdit and I was good to go. Sixty six lines of code written in about two minutes from when I started typing in Excel, until I had the finished product in UltraEdit!

Before

After

Figure 4    Excel text before and after merging cell contents

Turns out, though, that a period is what joins the counter in a SAS macro to the text which comes after it, so I could just have written the macro like this (had I known this at the time):

%MACRO Discrep;
    %do i=1 %to 33;
        B_&i._Mom_Discrep=B_&i.A-B_&i.B;
    %end;
%MEND;

 Example 3

The third example, I will confess, is again a bit more contrived, but this is something I actually did, so it is worth mentioning. It also employs a technique I learned on Excel-G from the Excel guru, Bob Umlas, which, I think, is featured in his book This isn’t Excel, its Magic!

The scenario is this:

I was writing a SAS program to analyse a student’s data. The student found that some of the data points were incorrect (his error in the coding), and sent me an Excel file with the corrections. However, instead of taking the data file, and correcting it, he sent me a file containing only the corrections, and blanks for all the rest. The (de-identified) corrections look like this:

Figure 5    Data which has to be merged into SAS program

The file indicates the respondent number, as well as the items in which errors were found, and the correct values. Unfortunately, I could not simply read the data file into SAS and merge it with the existing data, because the layout was not ideal (SAS would read all the blank cells as missing values). I could certainly try to manipulate the layout and create a useable data set, and either import that from Excel, or read it into SAS as a data card. But that may have been about as much effort as the solution I chose:

The first thing I did was to delete all the empty columns, i.e., those columns to which no corrections applied. This could easily be done for a large dataset by adding a new top row, and using a COUNTA function in each column—those columns with a count of zero are empty and can be deleted (and even that process can be expedited by sorting horizontally). The result now looked like this:

Figure 6    Data with empty columns removed

Next, I needed labels for each row (this was optional, I could have written a more complex formula that pulled everything out of the top row, but I chose the simpler option for its speed). So here I used Bob Umlas’ trick. I added another column (optionally called “Order”). As before, I added 1 and 2 in the first two rows of that column (L2:L3):

Figure 7    Creating the order column

With autofill, that was quickly copied down all 33 rows. That was then copied again and pasted directly below itself, so that I had two number sequences across the rows, from 1 to 33, right below each other. Then I copied the labels from row 1 and pasted them below the data (i.e., next to the second 1 in my sequence, which was in row 35):

Figure 8    New labels added

Autofill copied that down all the remaining rows (up to row 67) and I quickly sorted on the order and No columns, to give me this:

Figure 5    Labels positioned above each row

Note that, in terms of time, this is probably about one minute’s worth of work.

Next, I added a new column A. Now I was ready to create my lines of SAS code.

The interesting thing is that I even wrote my cell formulas with the method I am describing in this post. I know it will seem contrived, but essentially, I wrote these formulas (note the cells in which I wrote them) in a blank worksheet (some cells contain text, other contain formulas, which I have highlighted, but note that B2 starts with an unseen single quote, so it is not counted as a formula). Note also that this is much less typing that it seems, as C2 & F3 contain exactly the same formula, which is only modified slightly for both B3 & D3 respectively:

Figure 9    Formulas to write an Excel formula

The result is shown below:

Figure 10    Results from formulas to write an Excel formula

Then I again use autofill to copy them down to row 12 (L is the 12th letter) and just add one snippet in B13:

Figure 11    Final text for Excel formula

Now comes the fun bit. I copy that (B2:G13), paste it into my handy text editor UltraEdit, copy it again, and paste it (F2, Ctrl+V) into cell A1 of my worksheet containing the data. The result is this:

Figure 12    Formula in first cell

Now I select A1:A2, and fill down (A2 so that the formula is filled down to every second row):

Figure 13    Final code

I can copy that, and paste it directly into my SAS program, and it will make all the corrections I want.

if no eq 8 then do;B1=3;B2=2;end;
if no eq 10 then do;B1=3;B2=3;B3=3;B6=3;B7=3;B16=3;B17=1;B18=2;B22=3;end;
if no eq 11 then do;B1=3;B2=3;B6=3;B7=3;B16=3;B18=2;B22=3;end;
if no eq 12 then do;B1=3;B2=3;B3=2;B6=3;B7=2;B16=3;B17=1;B18=2;B22=3;end;
if no eq 13 then do;B1=3;B2=3;B3=2;B6=3;B16=3;B17=2;B18=2;B22=3;end;
if no eq 14 then do;B1=3;B2=3;B3=2;B6=3;B16=3;B17=1;B18=3;B22=2;end;
if no eq 15 then do;B1=3;B2=3;B3=3;B6=3;B7=1;B16=3;B17=1;B18=2;B22=3;end;
if no eq 16 then do;B1=1;B2=3;B3=3;B6=3;B7=1;B16=3;B17=3;B18=3;end;
if no eq 17 then do;B1=3;B2=2;B3=2;B6=3;B7=3;B16=3;B17=3;B18=3;B22=3;end;
if no eq 18 then do;B1=3;B2=3;B3=3;B6=3;B16=3;B17=3;B18=2;B22=3;end;
if no eq 19 then do;B1=3;B2=3;B3=3;B6=3;B7=2;B16=3;B17=1;end;
if no eq 20 then do;B1=3;B2=3;B3=3;B6=2;B16=3;B17=1;B18=2;B22=3;end;
if no eq 21 then do;B1=3;B2=3;B3=2;B6=3;B7=3;B16=3;B17=3;B18=3;B22=3;end;
if no eq 22 then do;B1=3;B2=3;B3=3;B6=3;B7=2;end;
if no eq 26 then do;B16=3;end;
if no eq 29 then do;B1=3;B14=1;B16=3;B17=3;end;
if no eq 33 then do;B1=3;end;
if no eq 36 then do;B16=3;end;
if no eq 38 then do;B22=3;end;
if no eq 43 then do;B16=3;end;
if no eq 45 then do;B16=3;end;
if no eq 48 then do;B22=2;end;
if no eq 57 then do;B16=3;end;
if no eq 59 then do;B16=3;end;
if no eq 63 then do;B17=1;end;
if no eq 69 then do;B6=3;end;
if no eq 72 then do;B6=3;B16=3;B18=3;end;
if no eq 73 then do;B6=3;end;
if no eq 83 then do;B16=3;end;
if no eq 84 then do;B6=3;B18=1;end;
if no eq 89 then do;B3=2;end;
if no eq 99 then do;B17=1;end;
if no eq 111 then do;B18=2;end;

The process may be crude, but is surprisingly effective and efficient: This kind of programming takes just a minute or two to complete.

Example 4

This example is much the same as before: Data corrections (yes, it seems as if I spend more time correcting data than analysing data, but let’s just say that proper preparation leads to efficient execution).

The basic layout is as shown below–almost 70 rows of corrections that need to be made. The trick here is that it is not always the same variable that is to be corrected. The variables names are listed in B1:J1, and if a column has a value, that variable must be set to that value for that respondent (as listed in Column A).

Excel Text example 4b

Figure 14    Listed corrections

So I use this formula, and the results are shown in Column K in the final figure.

="if Number eq " & A2 & " and " & $B$1 & " eq " & B2 & " and " &OFFSET($B$1,0,MATCH(MAX(C2:H2),C2:H2,0),1,1) & " eq " & MAX(C2:H2)& " then do; " & $B$1 & "="&I2&";" &OFFSET($B$1,0,MATCH(MAX(C2:H2),C2:H2,0),1,1) & "="&J2&"; end;"

Excel Text example 4a

Figure 15    Formula result showing code which can be copied to SAS

Essentially, my formula finds which cells have values, and then pulls the column headings in for those cells, setting them equal to the value found in the cell. Using this method, I could get Excel to create 6557 characters of code for me by typing 228 characters (the length of the formula) and doing a little bit of layout.

Conclusion

In closing, this is definitely not a conventional usage of Excel, but in terms of getting “repetitive” (actually, “patterned” is a better term) text created, it is hard to beat for speed if you know Excel well. I can create hundreds of lines of code, if need be, in a fraction of the time it would take to type it.

Share

Manipulating text data with Excel: Using Excel to reverse engineer Mendeley’s documents database

Some introductory notes

Perhaps “reverse engineer” is a bit of a strong term. Bit it sketches the general idea I want to illustrate—how I tried to figure out in what way Mendeley stores document information in its database.

Also, this is a long blog post. I have a friend who has “coached” me a bit on bloging, and he warned that if I write long posts, people won’t read them. Something about short attention spans and clicking out of long web pages. While he doesn’t fit into that category himself, here’s my limerick to those people:

In an age of Twitter
With minds fully a-flutter
Every single Twit
Will continually want to flit
From this to that quick wit!

Nonetheless, be warned: This is a long post. The aim is to show you how some of Excel’s techniques and functions are used in practise to manipulate text data. Secondarily also, it explains how I derived the various Mendeley document field tables in my forthcoming book on Mendeley. For each Excel function or tool mentioned in this post, I have provided a link to the Microsoft support site explaining that function or tool. So because this post is so long already, I have not included explanations of the functions themselves, as that would really make it too long. But the reality is that the explanation below probably won’t make sense if you don’t have an inkling of how the various functions work.

In the discussion below, I have indicated each Excel function/technique using Strong style formatting.

Also, you will want to download the example workbook [Mendeley fields for different source types (example for book)] and look at it as you read through the discussion. The discussion uses the names of the worksheets for headings, so it helps you find your place in the workbook.

The following functions and techniques were used in the creation of this workbook (just kidding about CONCATENATE—no one in their right mind uses concatenate!). Each item contains a link to the Microsoft support site, so that you can learn more about that feature from there, which I hope will help you understand how I used it in the example workbook:  Mendeley fields for different source types (example for book):

And now, without further ado:

Background

For my book on Mendeley, I had to explain what all the fields in the Mendeley document database were about, so that readers could understand what information belongs where. The problem, of course, is that Mendeley has not made this kind of information freely available. So I had to do some guesswork and some sleuthing. And that’s where Excel comes in. I thought demonstrating the process I used would:

  1. Illustrate that Excel can be a powerful tool for working with text data
  2. Be a good demonstration of some advanced Excel techniques like macros, array formulas, conditional formatting, etc.

The first point is interesting. Excel gets used and abused with numerical data all the time. But few people see it as a good tool for working with text data, which is an unfortunate misconception. Excel has its flaws when it comes to text, but in my work as a research consultant, I work with a lot of text data together with a lot of numerical data. Knowing how to work with text in Excel can be very useful.

So here’s the first challenge—figuring out which fields are actually in the database. This is challenging, not only because Mendeley has not published their database design. Rather, two more complex reasons tend to befuddle things: Firstly it is clear on inspection that, for different document types, they have named the same fields differently (i.e., even though the field appears, in the interface, to be called one thing, the same field, for a different document type, is called something else). Secondly, the interface only gives you the relevant fields for each document type, which means that with no document type are you ever looking at all the fields. To get a clear picture of what all the fields are, you have to piece together the information from all 20 document types provided by Mendeley and build a coherent whole. That’s where Excel comes in.

Doing the work

Worksheets:
Document types–Mendeley to CSL
Mendeley to CSL Fields
Field mapping–Mendeley to CSL

So I started by going to this page, which maps the Mendeley document types against the CSL types:

https://github.com/citation-style-editor/csl-editor/wiki/CSL-Editor-Type-and-Field-Mappings-for-Mendeley-Desktop

The page contains two tables. The first maps the Mendeley document types against the CSL document types. The second maps the Mendeley fields against the CSL fields. I copied and pasted these into two worksheets (note that it is a general principle/best practice of mine to keep one block of data to one worksheet, hence the two sheets). The first I formatted as a table. The problem with the second one (the Mendeley to CSL Fields worksheet) is that some of the rows show that multiple Mendeley fields get mapped against single CSL fields. This means that Mendeley’s database has a lot more detail in terms of field distinctions than currently exists in the CSL specification. Although the process is not shown, a quick Text to columns using the commas as the delimiter, and then copy with Paste SpecialTranspose, allowed me to get each Mendeley field in one row. All I had to do then was use Fill Down (Ctrl+D) to fill down the CSL fields for the multiple Mendeley fields against which they were mapped (thus, from the Mendeley to CSL Fields worksheet, to the desired end result in the Field mapping–Mendeley to CSL worksheet). This was then also formatted as a table. Getting from the original table pasted into the worksheet, to the one-field-per-row version, in other words, took probably much less than two minutes (sorry, I didn’t record a screen cast, and I didn’t run my ManicTime stopwatch, so you’ll have to take my word for it).

This then accounts for the first three sheets.

Worksheets:
Generic
to Working paper (20 in total)

Next I had to create a worksheet for each document type. Fortunately, I had written a macro a long time ago that creates worksheets using the names in a selected range of cells (you don’t need to write these, you can search for one on the net easily). But the code could be as simple as:

Sub AddWorksheetsFromCells()
For Each c In Selection.Cells
With ActiveWorkbook.Worksheets
Set ws = .Add(After:=ActiveWorkbook.Worksheets(.Count))
ws.Name = c
End With
Next c
End Sub

Note that this code does the job, but doesn’t check for error such as worksheet names that contain illegal characters, are too long, or already exist, amongst others. But it does the job.

Using the table of Mendeley document types (from the Document types–Mendeley to CSL worksheet), I have my 20 worksheets in just a few seconds. Not too bad. Probably five minutes into the process in total, and I have 23 worksheets, 22 of which I will be using, all of which have been named, and two of which have already been populated.

The next step was more tedious, and took a lot longer. I went to the Mendeley web interface and stepped through the process of adding a document, for all twenty document types. I selected each document type in turn, activated the Additional Fields option, and copied all the fields that Mendeley gave me, pasting them into their host worksheet (the worksheet with the same name as the current document type). When you copy and paste information from the net like that, a lot of crud comes with it in the form of the html formatting, but this is quickly removed by applying the Normal Cell Style (on Excel’s home ribbon).

On each of these twenty sheets, I added two columns. Column C (Web: Basic/Additional) shows whether the field is in the initial listing, or only available once the Additional Fields button has been selected. This was easy to do, as the top items were the initial ones, and the lower items the additional ones. So it was a simply matter of adding the two terms (Basic or Additional) and then filling down for the other rows. Some minor changes were made later: The Note field was added (at the end), as was the Tags field (at the end of the basic list).

Column B was more work. I opened the Document Details pane in the Mendeley Options dialog on the Mendeley Desktop application, and noted which fields were displayed as marked. These were marked as “Core.” Then I also scanned the full list and indicated which fields I, personally, would recommend also using, or would also consider using. So far, so good.

The next column (D) was to show which CSL field this particular Mendeley field was mapped against. The Mendeley Support site lists the fields mapped for each document type here: http://support.mendeley.com/customer/portal/articles/364144-csl-type-mapping. I copied and pasted each list to the relevant worksheet. Using Text to Columns from the Data ribbon quickly gave me two columns (Columns G:H), one for Mendeley, one for CSL. Once I had that information, it became quite easy to grab that and populate column D with a VLOOKUP:

=IFERROR(VLOOKUP(A2,G:H,2,0),””)

However, note the use of the IFERROR function (introduced with Excel 2007), which is a great wrapper to use with VLOOKUP (I would guess that more than half of my VLOOKUPs these days are wrapped in IFERRORs). I also grouped the twenty sheets and added the formula once for all the worksheets together, thus the slightly excessive use of the whole of columns G:H for my lookup table (the exact number of rows in G:H varies amongst the different worksheets). The formula was filled down, and there it was. I had one problem, though. The main table in columns A:D was showing all possible fields, and the mapping provided by Mendeley was only for the fields they map against CSL for that document type. But the other fields get mapped for other document types, and these are not shown. This was corrected by adding the Filter, and filter out the successfully matched items (filter to show only blanks).

The information for the full field mapping was already available in the table on the Field mapping–Mendeley to CSL worksheet, so I just added a new VLOOKUP:

=VLOOKUP(A9,’Field mapping–Mendeley to CSL’!$A$1:$B$89,2,FALSE)

This formula was added in the top row after the filter had been applied, and filled down across all the unfiltered rows (thus not wiping out the formulas that have already returned something).

Thus, if you scan down column D, you will see these two different VLOOKUP formulas intermingled. Where they appear is determined by how the filter filtered out non-blank lookups from the first one.

One last column (E) was added, to indicate whether I wanted to include this field in the table for that document type in my book (see Appendix A of the book). Here the OR function comes in handy—if the Mendeley field is mapped against a CSL field, or is one of the fields listed in the new document dialog in the desktop application (‘Core’) or is not shown only when the Additional fields link in the web application is clicked (not ‘Additional’), then it must be added:

=OR(B2=”Core”,C2<>”Additional”,LEN(D2)>0)

I had an additional check in cell F1 (using the COUNTA and COUNTBLANK functions), but to tell the truth, it became moot when I started adding the extra ‘Also consider’ or ‘Recommended’ options in column B.

Worksheets:
 Field listings for each type
All fields from all types

 Fields by type

One thing I noticed was that for each of the twenty document types, the Web client showed, when additional fields were displayed, a total of 67 fields were returned (I also later added the Files field from the Desktop application). The worksheet Field listings for each type shows the use of an INDIRECT function to pull these fields over from each sheet (and the ROW function tells the INDIRECT function where to look):

=INDIRECT(“‘”&A$1&”‘!a”&ROW())

However, demystifying the Mendeley database was not simply a matter of reading across the rows, since the initial items (the Web client’s ‘Basic’ fields) were different. So the same item (e.g., Length, or Series Editor, to mention a few) would not always be in the same row). Because of the field name changes for different document types, an alphabetical sort would also not solve the problem.

So now the last trick was to pull all this together. The challenge, of course, was Mendeleys renamed fields—those ‘fields’ that are actually the same field, but have different field names for different document types—I had to find these, and that meant figuring out when the different name was an alias, or an actual different field.

First another macro that was gleaned (and then modified) many years ago from the Excel-G list and stored in my personal macro workbook copied the contents of each of the twenty worksheets and dumped it into one sheet—All fields from all types. I didn’t actually do all that much with this worksheet—it just served as a starting point for another worksheet. What I did do was to copy and paste the worksheet names (from column A of the same sheet) to Column H, and then use Remove duplicates to list only the unique items. Then a quick MATCH function confirms that each of them is in the document type mapping table (if the MATCH returns #N/A then the item is not found):

=MATCH(H2,’Document types–Mendeley to CSL’!$A$2:$A$21,0)

I then did the same Copy and Paste and then Remove duplicates operation with the fields, so that column K listed all the unique field names across all document types. At this stage, the list would consist of ‘real’ field names as well as aliases, but the important thing is that it was complete—it listed 90 entries, to Mendeley’s 67, meaning I had to find and eliminate 23 aliases.

This list of unique field names and aliases was copied and pasted to a new sheet (Fields by type), which would be the one mapping fields across document types. This list was somewhat longer than the previous lists, but it would help me sort out what was going on. Again a VLOOKUP wrapped in an IFERROR would tell if this field was, for this document type, a core field or not. But oh no! I now had to VLOOKUP to twenty different worksheets! That would mean twenty different formulas, one for each column, to refer to the right worksheet. Or maybe not…. After all, each worksheet is named for the document type (thanks, in part, to my macro, and also to good naming conventions). So with one formula that includes an INDIRECT function to build the worksheet name into the reference, I can VLOOKUP the right sheet for that column:

=IFERROR(VLOOKUP($A2,INDIRECT(“‘”&C$1&”‘!$a$1:$b$100″),2,FALSE),””)

Note that the function will return a blank if the field is not found (the IFERROR sees to that), but if the field is found, then the function will return either the text from column B of the lookup sheet (if it has text), or a zero. Zero thus means the item is on that sheet, but it is not a core field. I could have “translated” that into something else with my function, but I was happy to leave it as is for the moment. What I did want to do was to use Conditional Formatting to show me on which sheets a field is not found. This would help me to see how commonly a field was being used, which could also give me clues as to when the field name was actually an alias (it would not appear in many document types ‘under the alias’). To quantify this, Column B counts the various text options and displays them separated by slashes:

=COUNTIF(C2:V2,”Core”)&”/”&COUNTIF(C2:V2,”Recommended”)&”/”&COUNTIF(C2:V2,”Also consider”)

Worksheets:
Field listings for each typeFIN

Almost done. I just need to look at this, and figure out what was being used where. Some were obvious (e.g., the Abstract field, which is there in all twenty document types). The worksheet Field listings for each typeFIN is just the content of the Field listings for each type worksheet copied and pasted as values with Paste Special. Here I used the ‘Generic’ document type as my reference point, and I Sorted alphabetically. Then I looked for mismatches—fields that did not match the same name as was in the Generic field. This was easy to do with another application of Conditional Formatting. Note how the conditional format uses a partially relative reference, so that I could set up one conditional format, that is row-specific—each column in the table is compared to the field name for the Generic field in Column C. Now the gumshoe work came in. For the mismatches, I had to look at the previous sheet, and try to discern what was the most likely name for that field by reading across all the columns. So, for example, for the Case document type, I was able to figure out that Reporter is not an actual field, but is the alias for Authors, as was also Date Decided the alias for Last Updated. So Mendeley, when defining the document information capture for the Case document type, captures the Reporter data to the Authors field, and the Date Decided field to the Last Updated database field. The Film and Television Broadcast fields do the same. To make this task easier, I used the filter to help me zoom in on the fields that need attention. Column B counts how many of the field names for the various document types are the same as for the Generic document type—if it was nineteen, then all twenty agreed, and that field needed no work:

=COUNTIF(D3:V3,C3)

So I filtered out all the 19s, and worked on the rest. What the sheet does not show is that each time I looked at a name (I started with those with the lowest occurrences), I tried to figure out what that field was an alias for. I then entered that alias in the row for the field for which I believed it stood (e.g., adding the value ‘Reporter’ for the Case document type in the ‘Authors’ field), and then I deleted that field row. This process of manual elimination whittled down the fields until I was left with 68 (the 67 former field count, plus 1 for Files).

Worksheets:
Field listings for each typeCOR

The last thing to be done was to map all of this out sensibly, so that it could become a table in the book. The worksheet Field listings for each typeCOR shows again all the document types across the columns, and the now-finalised list of fields down the rows. Again a VLOOKUP to the previous worksheet with an INDIRECT to reference the right worksheet for that column, brings in the name of the document type when the field is listed as core or recommended (not when listed as “also consider”), but to execute that decision, it requires the use of the OR function, nested in an IF function: Quite a mouthful:

=IF(OR(VLOOKUP(‘Field listings for each typeFIN’!C3,INDIRECT(“‘”&D$2&”‘!$a$1:$b$100″),2,FALSE)=”Core”,VLOOKUP(‘Field listings for each typeFIN’!C3,INDIRECT(“‘”&D$2&”‘!$a$1:$b$100″),2,FALSE)=”Recommended”),D$2,””)

Then I used Column A to count for how many document types the field was listed. This is an interesting formula, as I could not simply use a COUNTA or even a COUNT minus a COUNTBLANK, because the VLOOKUP formulas meant that the cells were not actually empty. Thus I had to use an Array formula, testing the length of the value returned by each cell’s formula—one of those instances where an array formula is the only thing that can get a result:

{=SUM(IF(LEN(D3:W3)>0,1,))}

This told me in how many document types, that field was used by Mendeley. However, presenting such a big table (essentially 69 rows x 22 columns) in a book is problematic. For my final step, I had to summarise columns D:W of this worksheet. I would have thought to use CONCATENATE, except that I still believe it is the most useless Excel function ever (ill-)conceived. Now I have a confession to make: I actually used Charles Williams’ excellent CONCAT.RANGE (which forms part of the FastExcel SpeedTools which I was beta testing). But if I distributed this workbook as a demo, that wouldn’t do, since most people don’t have those tools (and I won’t soon, either!). So I had to come up with an alternative. So the original:

=IF(A3=20,”All”,CONCAT.RANGE(D3:W3,”; “, , TRUE,,))

became:

=IF(A3=20,”All”,IF(D3=””,””,D3&”; “)&IF(E3=””,””,E3&”; “)&IF(F3=””,””,F3&”; “)&IF(G3=””,””,G3&”; “)&IF(H3=””,””,H3&”; “)&IF(I3=””,””,I3&”; “)&IF(J3=””,””,J3&”; “)&IF(K3=””,””,K3&”; “)&IF(L3=””,””,L3&”; “)&IF(M3=””,””,M3&”; “)&IF(N3=””,””,N3&”; “)&IF(O3=””,””,O3&”; “)&IF(P3=””,””,P3&”; “)&IF(Q3=””,””,Q3&”; “)&IF(R3=””,””,R3&”; “)&IF(S3=””,””,S3&”; “)&IF(T3=””,””,T3&”; “)&IF(U3=””,””,U3&”; “)&IF(V3=””,””,V3&”; “)&IF(W3=””,””,W3&”; “))

Now that’s a huge formula, and I am way too lazy to do all that typing. But here’s the last technique I used: Excel can actually be used to (sort of) build its own formulas.

So I entered this into D73 (D72 shows a simplified version):

=”if(“&MID(CELL(“address”,D3),2,1)&ROW(D3)&”=”””””&”,”””””&”,”&MID(CELL(“address”,D3),2,1)&ROW(D3)&”&””; “”)&”

Note how the ” has to be preceded by another ” if it is to be displayed in the cell results (it always takes me a moment to get my head around it when I have to write formulas like that). The formula results in this text:

if(D3=””,””,D3&”; “)&

Now all I did was fill D73 across to W73, then copy D73:W73, paste it into my text editor (UltraEdit), where each cell is separated by a tab and do a search and replace to strip out the tabs. I copied the result, and went to cell A3, where I typed the =IF(A3=20,”All”, bit, pasted my text, backspaced out the last ampersand (I could have done that in UltraEdit too), and added the final parenthesis. All this is a lot quicker than typing all of what is in cell A3. The formula then just gets filled down, and there I have my information for the book’s table!

After all of this, I was in the position to simply copy the relevant parts out of this Excel workbook, and paste them into the Word document for the book, formatting them with a table style. If you want to see what these final tables look like, you would, of course, have to get your hands on the book (due out in about a month and a half). The table names (I won’t give table numbers right now, as these are not yet final) as they appear in the book are:
Mapping of Mendeley Document Types to CSL Document Types
M
endeley Document Information Fields
Mendeley Fields Mapped to CSL fields
Mendeley Field Types Showing Core Document Types and Alternative Names
And then all the tables in Appendix A: Mendeley fields per document source type.

Share