EXCEL 2007 Lesson 2
January 25, 2011

Ecological and geological data are frequently quantitativeHow many salamanders live in Oak Openings?  How thick is the Oak Openings sand?  How many different species of fish have taken advantage of the Secor Dam removal?

Today's objectives: students will learn to view and characterize data sets to see how these measurements vary with time and how frequently data fall within certain ranges.  Students will use graphs to display data and interpret those graphs.  Students will be introduced to Excel array functions.

Note: DATA ARE PLURAL.  A single measurement is a datum.

Spreadsheets provide a handy way to manage, store, analyze and display data.  Lists of numbers are not particularly interesting except to accountants (see The Producers - movie or musical drama).  As scientists, we display numerical data graphically and use mathematical tools to analyze the data.

Today's data were kindly provided by Toledo's Channel 13 meteorologist Stan Stachak.  I imported the data to a database and used Access to generate the files you will display and interpret today.  Thank you, Stan.  There is nothing like real data to add interest to our efforts. 

If you open a data file in your browser, "save as" text (later on, we will used some really big files that will take all week to download as .html).  I will use April 1 data for my demonstration, you may download your data - see next paragraph.

Your data files were sent attached to a Message in Blackboard.  Log in at https://blackboard.utdl.edu/ and select EEES 2500 (of course!).  Click Tools and then Messages.  Save the file to your thumb drive.  This file is in text format, a format useful in moving tables from one application to another that might not recognize the file structure particular to that application.

Henceforth, please attach all assignments to Messages in Blackboard (and send them to me, of course!)  That way the assignments take up space on Blackboard's server instead of my Outlook Inbox, which is limited to 150 Mbytes.  I cannot send email if that mailbox is filled with assignments but there is no such problem in Blackboard.

Each record  consists of 3 fields: Year, High temperature, Low temperature, in degrees Fahrenheit.  Each of you has a different day of the year.  I will demonstrate these functions using data from April 1.  Your file name specifies which day of the year your data cover.  I have deleted month, day and precipitation from your file (but will share the whole works for any student who wants to use Stan's data for research).  How do I organize this sort of file?  Microsoft Access, a database application!  You will learn how to do that in a few weeks.

How can we look at these data?  Open the file with Excel <demonstrate>.  You may have to display additional file types in order to see the data file.

First, we might try plotting temperature as a function of year.  The graph looks kind of choppy - lots of spikes, both up and down.

What is the range of the data?  The 'range' is the difference between the maximum and the minimum values, usually specified by stating something like "The daily high on April 1 in Toledo, Ohio ranged from XX to YY degrees Farenheit between 1873 and 2009."  For a small data set, you might use 'data - sort' to rearrange the table.  <demonstrate>  Sorting tables can lead to problems, however, if you fail to put the table back as it was.  Excel has statistical functions Min and Max that look up minimum and maximum values from the selected cells.

Excel can also calculate the mean (average) value.  But is average enough?  Calculate also the standard deviation.  Check out my notes on statistics for the meaning of this term.  Other statistical terms sometimes useful in describing patterns in lists of values include mode (the most frequent value - probably not applicable here where values are expressed to several significant figures - you might use a 'round' function to round off numbers before seeking a mode) and median, meaning the central value (as many values greater than the median than less than).  

Politicians frequently use 'median' when they say 'average' - as Mark Twain said, there are lies, damned lies, and statistics.  So listen carefully and ask questions when speakers are using numbers to deceive or mislead.  We are in the midst of mid-term election campaigns - need I say more?

We will use an array function to sort these precipitation data into bins and then counting how many years rainfall was within the range of each bin.  The array function frequency requires 2 arguments: and array specifying the bins and an array of data values. <demonstrate - setting up bins, always from lowest value to highest, never highest to lowest - selecting the cells for bin counts - and now for the Top Secret Decoder Ring procedure.>

  1. Set up data and bins.
  2. Select 1 empty cell more than there are bins (for values too large for the bin).  A bin holds the number of data items up to and including the bin number that have not been counted already for bins holding smaller values.
  3. Click the f(x) button and select 'frequency'.
  4. Select the data array.
  5. Select the bins array.
  6. A number appears in the top cell of the 'frequency' column.  To complete the procedure
  7. Hit key F2 (Function Key 2)
  8. Press and hold down Ctrl - Shift - Enter (all three keys in that order, holding each down until all are pressed - then release all).

You should now see a number in each of the frequency function cells.

Note: changing a data value will change the count.  You cannot resize bins after executing the 'frequency' procedure, however.  If you need to do it over with different bins, you may use other blank parts of the spreadsheet or select and delete the bins and associated counts.

In the example below, the green cells are a series of values increasing top down.  The data included 1 year with high temperature 30 degrees or less and 2 years with high temperature between 75 and 80 degrees.

High Temp (degrees F) Frequency
30 1
35 9
40 14
45 20
50 27
55 19
60 21
65 10
70 9
75 5
80 2

Table 1: Frequency of high temperatures for April 1 between 1873 and 2009 in Toledo, Ohio.

To do: download and analyze your climate data day.  Use a column graph (vertical bars) to display the results of your 'frequency' count.  Do your results resemble a normal distribution

Try different bin sizes.  Note that each bin must represent a constant interval.  Please send me a short report that includes a graph showing precipitation as a function of calendar year and a graph showing frequency of annual precipitation falling with several quantity intervals.  In your report, tell me what the data are telling you - in other words, interpret the graphs.  Your report is due Friday at noon.

To do (2): You may use daily high, daily low, daily average (assumes daily average falls midway between the high and low) or all three and see how these temperatures have varied with time.  Note that you can chart several data sets on the same graph - you did this last week!  Plot high and low temperatures as a function of year (scatter plot, symbols only - no lines connecting so scattered a set of points, please!)

Dr.D.'s short notes on elementary statistics and probability.

Note that you can calculate a trend in the data by using the slope function.  Does it appear temperatures have been increasing, decreasing or show no change since records have been kept in Toledo?  Daily highs?  Daily lows?  Both?

----------------------------------------------------------------------------------------------------------------

Extra credit:  Are there periodic swings in temperature embedded in these data?  I've read of a 10-year sunspot cycle and a 30-year cycle in high and low stands in the Great Lakes.

Chart type: XY scatter plot (with 'year' as the independent variable).

Data can be smoothed using a running average.  A 5-point running average might take the average high temperatures for year i, year i-1, year i+1, year i+2 and year i-2, the average plotted as a function of year i.  When using a running average, always specify the length of your sample window.  So the running average plotted for 1940 would be the sum of temperatures for 1938, 1939, 1940, 1941 and 1942, divided by 5.  Of course, this average is plotted as a function of year for the entire data set, with care taken near each end.  A longer (15 point running average) window might reveal evidence of a 30-year cycle - a curious scientist looks at data several ways in order to discover which numerical analysis provides the most useful information.  A more cynical observer might say, if you look at any data set the right way, you can make it say what you want it to say (see Mark Twain's quote, above).  As a geophysicist, I'm well familiar with the joke whose punch line is, 'what do you want the answer to be?'

Click thumbnail to enlarge.  Note that the 3-point backward average generates the same numbers as the 3-point smoothing average that looks both ahead and back a year, only a year earlier.  Note also that the 'Average' function works at both ends of the data series even though it is not necessarily averaging 3 data points, as are those averages toward the center of the list.  These data are precipitation data from Los Angeles. 

Data can also be smoothed using a weighted average.  For example, I might take the temperature for year i and multiply it by 0.4, years i-1 and i+1 and multiply each of those by 0.2, and then years i-2 and i+2 and multiply each by 0.1, adding together the 5 products. 

2 rules for using weighted averages to smooth data: the sum of the coefficients (weighting amounts) must equal 1, and you must explain to the reader just what smoothing filter you used (in this case, 5-point moving average using (.1,.2,.4,.2,.1) as coefficients. 

Return to Syllabus

Hit Counter