Filtering Ogren Data

 

   This paper is about data obtained from John Ogren's measurements (noaa.gov) from three different sites measuring particle count and properties.  The three sites were in Bondville, IL (BND), an anthropogenically influenced area, Barrow, AK (BRW), a relatively pristine site, and rural northern OK (SGP), an area occasionally subject to anthropogenic pollution.

 

1.  Why Filter?

 

   It is immediately obvious that the data contains physically impossible data.  Examples of this include very low, even negative, measurements such as a Bsp450 of -19.31 and a CPCcon of .2 (BRW 1998).  Very low readings, like these, are not real measurements and are instead more likely to reflect problems with equipment design, operation, or maintenance.  Though large outliers are less common, the data also contains very high values, even several orders of magnitude larger than expected.  These values certainly have no physical meaning, and measurements taken in error can distort analysis.  Instead they alert us to the fact that data collection is inherently flawed.

    Our intention is to screen the data in such a way that extreme outliers and irrational values can be excluded for further analysis of data, while maintaining the properties of the data as measured.

 

2.  The Filter

 

   The filter was developed using tables, graphs, and histograms.  Each station takes up to 17 measurements every minute.  Though not completely uniform, all three stations have a CPC count, Bap, Bsp450, Bsp550, Bsp700, Bbsp4550, Bbsp550, Bbsp700, and nephRH (except BND).  The final filter uses three pieces of this data for screening.

 

      10^-7 < Bap    < 10^-4

      10    < CPC    < 10^5

      10^-6 < Bsp550 < 10^-3

 

This allows four orders of magnitude for each checked measurement to account for differences in monitoring sites, seasonal and natural variation, and anthropogenic influence.

   The first, Bap, simply eliminates Bap values greater than 10^-4 and less than 10^-7.  The Bap was very prone to error, especially negative values, because of the nature of the instrument.  Its relatively high failure rate results in a high number of data points being removed.

   The other two checks are for CPCcon (CPCsam in SGP) less than 10 or greater than 10^5 and Bsp550 less than 10^-6 or greater than 10^-3.  A bad value in either check results in the entire line, or all data at that time, being deleted.  In most cases, this is a relatively small amount of data.

 

3.  Checking the Filter by Lognormal Distribution

 

     As one might expect, most of the data is lognormally distributed.  The exceptions of wind speed, direction, and relative humidity (nephRH) are also expected.

     The surprising distribution is the universally skewed Bap distribution which has more values in the low range for every station.  This makes is difficult to check the inclusiveness of the filter by its inclusion of the lognormal distribution.  Accordingly, the filter seems to truncate the tail caused by low values.  This is, of course, especially marked in BRW, where the values tend to be lower.  Moreover, a lognormal distribution ignored the often substantial number of negative values, which would have further skewed this distribution.  The highly irregular Bap distribution could either other influences on its value, the nature of the instrument, or even the unreliability or this measurement.  Simply eliminating negative values truncates the tail at BRW, so these measurements must reflect a constant error which might even offset measurements.  Also at BRW, the filter does not significantly change the lognormal properties of the CPC, though it is toward the lower end of the distribution, as should be expected.  The Bsp550, though, is highly irregular.  It seems to have very poor lognormal distribution.  It is very flat, and almost looks like two overlapping lognormal distributions with a spike.  This highly irregular shape was slightly truncated by the filter, but it was also plagued by negative values and poor distribution, so the filter on the low values does not change the distribution much.

     At BND and SGP, the filter maintain the distribution, even the slightly irregular Bap distrubtion, simply shifting it.  This shows the filter made no significant changes to distribution at these stations.

 

 

4.  Checking the Filter by Percentage Removed

 

   The restrictedness of this filter can be partially analyzed by evaluating the number of data points removed by it.  I did this by using the most complete data set (CPCcon for BND, BRW; OPC for SGP) and evaluating the number of points removed from it as a percentage.

 

%removed  yr   station   comments

 

.01       97   SGP       1/7771 points

0         98   SGP

0         99   SGP

0         00   SGP

0         01   SGP

11        01   SGP(minute) very low Bsp550 counts

 

1.8(avg)

 

.02       97   BND       2/6902 points

.1        98   BND      

15.7      98   BND(minute)

0         99   BND      

14.5      99   BND(minute)

0         00   BND

13.6      00   BND(minute)

0         01   BND

9         01   BND(minute)

 

.024/13.2 (avg hour/min)

 

20        98   BRW       low CPC, some negative Bsp550

20.7      99   BRW       low CPC      

18.1      00   BRW       low CPC, some negative Bsp550

20.9      01   BRW

29.2      01   BRW(minute)

 

21.7(avg)

 

SGP:

It is obvious that the SGP and BND hourly data is barely filtered at all.  This makes the filtration itself not so useful, but it does reveal the minute data's significant weakness with respect to the hourly data.  This suggests that the variation on a shorter timescale is much larger, large enough to even give unrealistic values some of the time, even when the averages (the hourly data) are realistic and fully within the filter.  This short scale variance can be attributed to instrumental noise or short scale cycles caused naturally or antropogenically.  A constant variation large enough to cause value to fall outside the filter within one hour a significant amount of the time (13.2% at BND) is very unlikely.  If these values were accurate, it follows that at least some of the hourly data would be outside the filter.  Since virtually none is, and a minute scale cycle is unlikely, the measurements are probably in error.  I can be assumed a certain amount will fall outside the filter for several instrumentation reasons, such as cleaning the Bap filter.

 

BRW:

    At Barrow, some data does need to be filtered.  Barrow has a large amount of very low, even negative, Bsp550 readings.  These results are obviously invalid.  Barrow also has up to 9.7% CPC reading less than 10 (1998), with an average of 3.75%.  Since the filter consistently eliminates about 20% of the data, this suggests that the very low readings are either true or caused by a consistent instrumental error.  This is also suggested by the lognormal distribution of the unfiltered data which seems to be truncated by an arbitrary 10 (becky/plots/brw/histhm13). 

 

5.  Using the Filter to Find Patterns

 

     Filtered data can be used a number of ways.  One of the plots made after filtration was a five year (four year for BRW) timeseries plot of each criteria measurement on a log scale.  BRW and SGP showed some local minimum points toward the end of every fourth quarter.  This is, of course, not enough to suggest the distribution at that time is lower, but it does suggest that the low points are distributed in that time of year.  This is not as clear in BND, though it could easily be argued.

     Other patterns are a close similarity of CPC and Bsp, though Bap is less related.

 

6.  The Structure Function

 

     The structure function shows how closely related data a certain distance is.  This distance is notated here as j:

 

str(j) = sum(x(i)-x(i+j)^2)/sum(x(i)^2)

 

where i is all values in the vector of measurements.  The structure function is useful in analysis of its change over distance of time (lag).  Also, it always positive, and uniform data has a structure function value of 0.

     We found that imposing the structure function on these variables resulted in the same general pattern.  The value starts small at small lag and then approaches 1, probably because the data is bounded in a relatively small range and is normalized by the norm of the vector in the denominator.

 

7.  Noisiness of Data

 

     This analysis of the structure function produced consistent results.  The noisiest value (Bap) had a structure function that started very close to 1 and converged with small slope since the value for all lags was so close to 1.  When the minute data was analyzed too, it became clear that Bap is not even closely related at very small lags.

     In contrast, every other variable, including all Bsp values, Bbsp values and CPC values, followed a standard form.  The minimum structure function value was at a very small lag.  From there it monotonically increased (with some exception with overlaid minute and hour data) until it converged to 1.  The data with the lowest initial structure function value and therefore the highest slope as it approached 1 was the Bsp550.  Its slope approached 1 also.  The magnitude of the structure function was at least an order lower than that of the CPC.  This suggests a high amount of continuity in the Bsp550 data, and much less continuity in the Bap data.  This is also consistent with the skewed lognormal distribution and high percentage of irrational or negative values.


BRW:

 

 

 

 

 

 

 

 

 

 

 

 

BND:

 

 

 

 

 

 

 

 

 

 

 

 

SGP: