Saturday, 18 June 2011

Google Scholar and the Spectra of the Scientists

Introduction

Google Scholar can be used to construct a metric which can show the relative "merit" of scientists in their corresponding fields of research, based on the work they've done.

Assuming that an author's name is unique (which is not always the case), one can construct a characteristic publication number or a publication eignevalue for a given author, say "john doe", as follows:

Enter "author:j-doe" in the Google scholar field and press "Search".
a0=number of results for this author, shown on the upper right hand.
a1=number of citations under the first result. Click on these citations. A new window opens.
a2=number of citations under the first result. Click on these citations. A new window opens.
...
Repeat, until first result shows no citations or the sequence falls into a cycle.
The publication eigenvalue for this author then, can be the number C(john doe), which has the continued fraction expansion:

C(john doe)=[a0;a1,a2,...,an,...].

To simplify the ordering which is present in the set {C(x):x\in author}, without loss of generality we can set a0=1 and look instead at the number:

C(john doe)=[1;a0',a1',a3',...,an',...], with an-1'=an, which maps the set {C(x):x\in author} into the interval (1,∞).

Note that in this case, supx{C(x):x\in author}=∞ and infx{C(x):x\in author}=1.

Adding a citation entry a>0 to an existent continued fraction expansion of C(x), can make C(x) either larger or smaller, depending on where a is added and the number of citations at level n[18]. Specifically:

[1;a1,a2,a3,...,an,a]<[1;a1,a2,a3,...,an], if n odd,
[1;a1,a2,a3,...,an,a]>[1;a1,a2,a3,...,an], if n even
[1;a1,a2,a3,...,an+a]<[1;a1,a2,a3,...,an], if n odd.
[1;a1,a2,a3,...,an+a]>[1;a1,a2,a3,...,an], if n even.
The main "weight" of the number C(x) will then be carried by the term a1, which is the number of publications of author x and which provides a good approximation of C(x), as C(x)~C2(x)=1+1/a1, which is fairly reasonable.

The formal definition of C(x) is slightly more involved, mainly because one needs to define it uniquely. Here's then the formal definition:

Let x be the name of an author in Google Scholar.
Search on x gives rise to a1 results.
Each result gives rise to a2,k citations, indexed by k.
Each of those results gives rise to a3,l citations, indexed by l, and so on.
Define C(x)=[a0=1;a1,...,an,...], with:
a1=supk{a1,k},a2=supl{a2,l},...,an=supw{an,w}.
It can now be seen that the definition above gives rise to a unique number C(x), as in the first definition for "john doe", above, because the suprema are taken over finite sets indexed by k,l,m,...,w.

A Metric Based On Google Scholar

The definition above gives rise to the metric: d(x,y)=|C(x)-C(y)|. Let's verify the metric's fundamental properties:

d(x,y)≥0: Follows from the definition of |.|.
d(x,y)=0 <=> x=y: Let C(x)=[a0=1;a1,a2,...,an] and C(y)=[b0=1;b1,b2,...,bm]. If x=y, then m=n and C(x)=C(y), so |C(x)-C(y)|=d(x,y)=0.
Conversely, if d(x,y)=|C(x)-C(y)|=0 then m=n (why?) and then ai=bi, for all i \in {0,1,2,...,n}, which happens if x=y.
For any w, d(x,y)=|C(x)-C(w)+C(w)-C(y)|≤|C(x)-C(w)|+|C(w)-C(y)|=d(x,w)+d(w,y), by the triangle inequality for |.|.
It is clear that a person with no publications, will have a characteristic number equal to infinity and the more publications an author has, the closer C(x) is to 1. This gives rise to a tempered distribution, and then one can define the publication percentile P(x) of a scientist x in this distribution to be: P(x)=100/C(x).

Google Scholar