Riccardo "Jack" Lucchetti1.02017-04-02Basic statistics about gaps in panel datasetsC23 C33MAINWIN/Sample
The paneldesc function was inspired by Stata's xtdescribe command, but
it's slightly different. It is used to check for gaps in a panel
datasets. Unlike Stata's xtdescribe, however, its action can be
restricted to a list of variables, which can be useful if you want to
check for data availability only for a subset of your variables.
The syntax is:
pat_series = paneldesc(X, limit)
both arguments are optional.
X: a list (possibly null, which is the default). Check for gaps in any
of the variables contained in the list. If null, instead, the logic is
reversed and an observation is deemed "gappy" if at least one variable
in the dataset is missing for that observation.
limit, a positive scalar. Controls how many patterns to print out
(see below; default = 10).
The function returns a series in which pattern of missing/valid
observations is uniquely coded with
a binary technique: for each unit in the sample, the returned series
has one non-missing observation, at the first non-missing entry, coded
as (in LaTeX notation)
y_i = \sum_{t=1}^T 2^{t-1} I_{it}
where I_{it} is 0 if the corresponding observation is missing, and 1
otherwise.
For example, take a dataset with T=5; a unit that has observations at
t=2 3 and 5 and a missing observation at t=3 and t=3 would have y_i =
0 + 2 + 4 + 0 + 16 = 22, so that paneldesc(x) would return a series y
as in:
t x y
-----------
1
2 5 22
3 3
4
5 -1
scalar nolist = nelem(X) == 0
if nolist
series o = 0
smpl --no-missing
o = 1
smpl full
else
series o = ok(X)
endif
series tt = time-1
series ot = o ? tt : NA
series z = ot == pmin(ot) # z==1 -> first valid obs
series x = z ? psum(2^tt * o) : NA # binary code
# print o tt z x -o
discrete x
matrix a = aggregate(null, x)
a = -msortby(-a, 2)
scalar m = ceil(log2(maxc(a[,1])))
u = umat(a, m)
if rows(u) == 1
printf "Panel is balanced\n"
else
printf "\nDistribution of T_i:"
s = quantiles(z ? psum(o) : NA)
limit = xmin(limit+1, rows(u))
printumat(u, sumc(a[,2]), limit)
endif
descstr = "Panel gap pattern"
setinfo x --description="@descstr" --discrete
return x
# this function turns each pattern into its binary
# representation
matrix id = a[, 1]
matrix wrk = id
matrix ret = zeros(rows(a), m)
loop i=1..m --quiet
matrix uno = wrk % 2
matrix wrk = (wrk - uno) / 2
matrix ret[,i] = uno
endloop
return a[,2] ~ ret
# take the binary pattern matrix and turn it into a ".1" string
r = rows(a)
c = cols(a)
string filler = ""
loop i=1..c --quiet
filler ~= "-"
endloop
b = strsub(sprintf("%1.0f", a[1:limit,2:]), "0", ".")
string l = ""
scalar c = 0
printf " Freq. Percent Cum. | Pattern\n"
printf " ---------------------------+-%s\n", filler
loop i=1..limit-1 --quiet
getline(b, l)
scalar n = a[i,1]
f = 100 * n/N
c += f
printf "%8d%11.2f%8.2f | %s\n", n, f, c, l
endloop
if r == limit
n = a[r,1]
getline(b, l)
l = " " ~ l
else
n = sumc(a[limit:,1])
l = "(other patterns)"
endif
f = 100 * n/N
c += f
printf "%8d%11.2f%8.2f | %s\n", n, f, c, l
printf " ---------------------------+-%s\n", filler
matrix a = {o}
scalar N = rows(a)
a = msortby(a, 1)
matrix q = 1 ~ ceil(N*{0.05, 0.25, 0.5, 0.75, 0.95, 1})
a = a[q]
l1 = " min 5% 25% 50% 75% 95% max"
l2 = sprintf("%18s%8d\n", "", a')
ret = sprintf("%s\n%s", l1, l2)
print ret
return ret
include paneldesc.gfn
open abdata.gdt
list X = n w
series pt = paneldesc(X, 10)
discrete pt
freq pt