Filtering Ogren Data
This paper is about data obtained from John Ogren's measurements
(noaa.gov) from three different sites measuring particle count and
properties. The three sites were in
Bondville, IL (BND), an anthropogenically influenced area, Barrow, AK (BRW), a
relatively pristine site, and rural northern OK (SGP), an area occasionally
subject to anthropogenic pollution.
1. Why Filter?
It is immediately obvious that the data contains physically
impossible data. Examples of this
include very low, even negative, measurements such as a Bsp450 of -19.31 and a
CPCcon of .2 (BRW 1998). Very low
readings, like these, are not real measurements and are instead more likely to
reflect problems with equipment design, operation, or maintenance. Though large outliers are less common, the
data also contains very high values, even several orders of magnitude larger
than expected. These values certainly
have no physical meaning, and measurements taken in error can distort
analysis. Instead they alert us to the
fact that data collection is inherently flawed.
Our intention is to screen the data in such a way that extreme
outliers and irrational values can be excluded for further analysis of data,
while maintaining the properties of the data as measured.
2. The Filter
The filter was developed using tables, graphs, and
histograms. Each station takes up to 17
measurements every minute. Though not
completely uniform, all three stations have a CPC count, Bap, Bsp450, Bsp550,
Bsp700, Bbsp4550, Bbsp550, Bbsp700, and nephRH (except BND). The final filter uses three pieces of this
data for screening.
10^-7 < Bap <
10^-4
10 < CPC < 10^5
10^-6 < Bsp550 < 10^-3
This allows four orders of
magnitude for each checked measurement to account for differences in monitoring
sites, seasonal and natural variation, and anthropogenic influence.
The first, Bap, simply eliminates Bap values greater than 10^-4
and less than 10^-7. The Bap was very
prone to error, especially negative values, because of the nature of the
instrument. Its relatively high failure
rate results in a high number of data points being removed.
The other two checks are for CPCcon (CPCsam in SGP) less than 10
or greater than 10^5 and Bsp550 less than 10^-6 or greater than 10^-3. A bad value in either check results in the
entire line, or all data at that time, being deleted. In most cases, this is a relatively small
amount of data.
3. Checking the Filter by Lognormal
Distribution
As one might expect, most of the data is lognormally
distributed. The exceptions of wind
speed, direction, and relative humidity (nephRH) are also expected.
The surprising distribution is the universally skewed Bap
distribution which has more values in the low range for every station. This makes is difficult to check the
inclusiveness of the filter by its inclusion of the lognormal
distribution. Accordingly, the filter
seems to truncate the tail caused by low values. This is, of course, especially marked in BRW, where the values
tend to be lower. Moreover, a lognormal
distribution ignored the often substantial number of negative values, which
would have further skewed this distribution.
The highly irregular Bap distribution could either other influences on
its value, the nature of the instrument, or even the unreliability or this
measurement. Simply eliminating
negative values truncates the tail at BRW, so these measurements must reflect a
constant error which might even offset measurements. Also at BRW, the filter does not significantly change the
lognormal properties of the CPC, though it is toward the lower end of the
distribution, as should be expected.
The Bsp550, though, is highly irregular. It seems to have very poor lognormal distribution. It is very flat, and almost looks like two
overlapping lognormal distributions with a spike. This highly irregular shape was slightly truncated by the filter,
but it was also plagued by negative values and poor distribution, so the filter
on the low values does not change the distribution much.
At BND and SGP, the filter maintain the distribution, even the
slightly irregular Bap distrubtion, simply shifting it. This shows the filter made no significant
changes to distribution at these stations.
4. Checking the Filter by Percentage Removed
The restrictedness of this filter can be partially analyzed by
evaluating the number of data points removed by it. I did this by using the most complete data set (CPCcon for BND,
BRW; OPC for SGP) and evaluating the number of points removed from it as a
percentage.
%removed yr station comments
.01 97 SGP 1/7771 points
0 98 SGP
0 99 SGP
0 00 SGP
0 01 SGP
11 01 SGP(minute) very
low Bsp550 counts
1.8(avg)
.02 97 BND 2/6902 points
.1 98 BND
15.7 98
BND(minute)
0 99 BND
14.5 99
BND(minute)
0 00 BND
13.6 00
BND(minute)
0 01 BND
9 01 BND(minute)
.024/13.2 (avg hour/min)
20 98 BRW low CPC, some negative Bsp550
20.7 99 BRW low CPC
18.1 00 BRW low CPC, some negative Bsp550
20.9 01 BRW
29.2 01 BRW(minute)
21.7(avg)
SGP:
It is obvious that the SGP
and BND hourly data is barely filtered at all.
This makes the filtration itself not so useful, but it does reveal the
minute data's significant weakness with respect to the hourly data. This suggests that the variation on a
shorter timescale is much larger, large enough to even give unrealistic values
some of the time, even when the averages (the hourly data) are realistic and
fully within the filter. This short
scale variance can be attributed to instrumental noise or short scale cycles
caused naturally or antropogenically. A
constant variation large enough to cause value to fall outside the filter
within one hour a significant amount of the time (13.2% at BND) is very
unlikely. If these values were accurate,
it follows that at least some of the hourly data would be outside the
filter. Since virtually none is, and a
minute scale cycle is unlikely, the measurements are probably in error. I can be assumed a certain amount will fall
outside the filter for several instrumentation reasons, such as cleaning the
Bap filter.
BRW:
At Barrow, some data does need to be filtered. Barrow has a large amount of very low, even
negative, Bsp550 readings. These
results are obviously invalid. Barrow
also has up to 9.7% CPC reading less than 10 (1998), with an average of
3.75%. Since the filter consistently
eliminates about 20% of the data, this suggests that the very low readings are
either true or caused by a consistent instrumental error. This is also suggested by the lognormal
distribution of the unfiltered data which seems to be truncated by an arbitrary
10 (becky/plots/brw/histhm13).
5. Using the Filter to Find Patterns
Filtered data can be used a number of ways. One of the plots made after filtration was a
five year (four year for BRW) timeseries plot of each criteria measurement on a
log scale. BRW and SGP showed some
local minimum points toward the end of every fourth quarter. This is, of course, not enough to suggest
the distribution at that time is lower, but it does suggest that the low points
are distributed in that time of year.
This is not as clear in BND, though it could easily be argued.
Other patterns are a close similarity of CPC and Bsp, though Bap
is less related.
6. The Structure Function
The structure function shows how closely related data a certain
distance is. This distance is notated
here as j:
str(j) =
sum(x(i)-x(i+j)^2)/sum(x(i)^2)
where i is all values in
the vector of measurements. The
structure function is useful in analysis of its change over distance of time
(lag). Also, it always positive, and
uniform data has a structure function value of 0.
We found that imposing the structure function on these variables
resulted in the same general pattern.
The value starts small at small lag and then approaches 1, probably
because the data is bounded in a relatively small range and is normalized by
the norm of the vector in the denominator.
7. Noisiness of Data
This analysis of the structure function produced consistent
results. The noisiest value (Bap) had a
structure function that started very close to 1 and converged with small slope
since the value for all lags was so close to 1. When the minute data was analyzed too, it became clear that Bap
is not even closely related at very small lags.
In contrast, every other variable, including all Bsp values,
Bbsp values and CPC values, followed a standard form. The minimum structure function value was at a very small
lag. From there it monotonically
increased (with some exception with overlaid minute and hour data) until it
converged to 1. The data with the
lowest initial structure function value and therefore the highest slope as it
approached 1 was the Bsp550. Its slope
approached 1 also. The magnitude of the
structure function was at least an order lower than that of the CPC. This suggests a high amount of continuity in
the Bsp550 data, and much less continuity in the Bap data. This is also consistent with the skewed
lognormal distribution and high percentage of irrational or negative values.
BRW:















BND:















SGP:














