if you have forgotten why you are here, read the original post in this thread:
http://www.30bananasaday.com/group/debunkingthechinastudycritics/forum/topics/bananas-stats-course
otherwise, here's what you can expect from this course:
1. learn enough statistics and study design to hold your own in conversation with anyone who throws numbers at you.
2. have the ability to conduct your own analyses so you don't have to take anyone else's word that this is what the data shows.
3. develop sufficient skills to acquire a technical grasp of the china study and other published items which use statistical analysis to make their point.
the primary course materials are the following
1. the very well done carnegie-mellon open learning initiative statistics course (henceforth referred to as cmolis not to be confused with cialis):
http://oli.web.cmu.edu/openlearning/forstudents/freecourses/statistics
this course is interactive, well-illustrated and even has video explanations.
2. supplementary exercises using actual data from the china study.
the bonus here is that you become familiar with this data by working on it directly.
3. statistical software
we recommend R because it is open source, gnu and excellent! you can learn how to get started with it by going here:
http://wiki.math.yorku.ca/index.php/R:_Getting_started
(but ask if you have any difficulties or have no clue what to do)
alternately, you can use excel if you have forked out the money or pirated a copy and feel you must use it - the course does offer an excel stream as well.
if you have some other means of doing stats (eg scipy with python, spss or an abacus), there is nothing stopping you from using it, though you will have to obviously modify specific instructions to your software.
the routine will be as follows:
1. the module to be covered will be added under "bearings" beneath existing items in bold (see below).
2. you are to do the topic from cmolis and post any questions or interesting points of observation you may have in this thread where they will be addressed.
3. extra exercises using china study data will be presented for discussion and completion.
your course guides are:
1. normal though occasionally skewed
2. inviting of others to contribute and assist just for some variance
official start day is monday, july 26, 2010!
till then, you can still do some things ... see bearings below.
in friendship,
prad
======
bearings
======
get your software in place - ask for help if you have trouble
then work on what is below. note that bolded items show
the topic presently under discussion.
unit 1: Introduction
assignment #1
a solution in R
a solution in excel
unit 2: Exploratory Data Analysis (EDA)
Introduction
assignment #2a
module 1: Examining Distributions
You need to be a member of The Frugivore Diet to add comments!
Replies
(thanks to our anonymous expert R coder for providing excellent code)
1. summarize and create a histogram for the average number of live births per woman in the 1983 china data. describe the distribution. hint: this information is part of the questionnaire data. also, since we are interested in live births per woman, how should we subset the data?
*let’s first read in the data for 1983:
Q83data <- read.csv("CH83Q.CSV",as.is=TRUE,na.strings=".",strip.white=TRUE)
*now, since we’re interested in live births per woman, we need to subset our data differently than we have in the past, and include only females. So let’s subset our data:
Q83f <- subset(Q83data,Sex=='F' & Xiang==3)
*and now summarize the data:
summary(Q83f$Q192)
-- we get the following result:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
3.000 4.000 4.300 4.359 4.700 6.900 5.000
this tells us that the minimum average number of livebirths among all counties was 3, and the maximum was 6.9. the median is 4.3 and the mean is 4.4. since they are roughly the same, the distribution of average live births is probably close to normal, or “bell-shaped.”
*to check this, let’s plot our histogram (recall that xlab tells R the label for the x-axis, and similarly ylab tells R the label for the y-axis):
hist(Q83f$Q192,main="Distribution of average # livebirths (1983)",xlab="average # livebirths",ylab="percent (%)")
the distribution isn’t perfectly normal – it looks like it might be very slightly right-skewed (or positively skewed). the median will be smaller than the mean for right-skewed distributions and indeed this is the case for our 1983 live birth data.
2. summarize and create a histogram for the average number of live births per woman in the 1989 china data. describe the distribution.
*let’s repeat what we did for the 1983 data:
Q89data <- read.csv("CH89Q.CSV",as.is=TRUE,na.strings=".",strip.white=TRUE)
Q89f <- subset(Q89data,Sex=='F' & Xiang==3)
summary(Q89f$Q192)
hist(Q89f$Q192,main="Distribution of average # livebirths (1989)",xlab="average # livebirths",ylab="percent (%)")
-- for the summary, we get the following result:
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.800 3.600 4.100 4.116 4.500 6.500
we see that the median and mean are almost identical and again, we suspect the data is normally distributed.
-- when we look at our histogram, it indeed looks pretty “bell-shaped” but might still be a bit right-skewed.
3. describe the change in average number of live births per woman between 1983 and 1989. hint: use information on the mean, median, as well as the graphical figures to help you.
-- we see that compared to 1983, there are, on average, fewer live births in 1989.
-- we also see that the 1989 data are slightly closer to being normally distributed than the 1983 data
bonus!
another way of looking at distributions is creating what is called "density plots." imagine a histogram with very, very (very, very) narrow bars. this might give us a pretty "jagged" looking histogram, which may be hard to interpret. so now imagine we "smooth" out the rough edges... and this is basically what a density plot is. let's just try a few lines of code:
par(mfrow=c(2,2))
plot(density(Q83f$Q192,bw=0.35,na.rm=TRUE),main="Density plot: 1983 livebirths",xlab="# livebirths")
plot(density(Q89f$Q192,bw=0.35),main="Density plot: 1989 livebirths",xlab="# livebirths")
check out the plots! we can see more clearly that both the 1983 and 1989 data for average livebirths are slightly right-skewed.
(all plots attached)key2a.pdf
this is a 3 part module dealing with distributions of the two types of variables discussed in the introduction.
the first section introduces the idea of category variables whereas the second handles quantitative variables.
categorical variables are non-numerical in concept. they are essentially groupings into categories like apples and oranges (eg types of fruit), lending themselves nicely to displays like pie and bar charts.
quantitative variables, on the other hand, involve quantities that represent measurements such as the number of seeds different cantelopes contain. there are various pictorial opportunities possible here such as histograms, stemplots and boxplots as well as numerical analyses such as finding measures of central tendency (mean, median) and spread (variance, standard deviation). the concept of outliers comes into play as well which are fringe results that can be puzzling such as a cantelope having 1 seed or 30103 seeds (a prime example of palindromic excessiveness).
this is an important module forming the foundation for statistical mechanisms which will be utilized regularly!
in friendship,
prad
let's build on what we've learned from assignment 1, and on what you have covered in unit 2 (exploratory data analysis), module 1 (exploring distributions using graphs). the objectives for this assignment are to:
* plot histograms
* describe the distribution of a selected variable
* describe how that distribution might change over time
here are the tasks with some hints:
1. summarize and create a histogram for the average number of live births per woman in the 1983 china data. describe the distribution. hint: this information is part of the questionnaire data. also, since we are interested in live births per woman, how should we subset the data?
2. summarize and create a histogram for the average number of live births per woman in the 1983 china data. describe the distribution.
3. describe the change in average number of live births per woman between 1983 and 1989. hint: use information on the mean, median, as well as the graphical figures to help you.
hints for code
in assignment #1, we learned how to read in a CSV file containing data, summarize the data using the summary function, and create histograms using the hist function. we will use the same functions for this assignment.
one of our participants is also our acting "super master R coder" for this course and figured out a quirk of R: it treats character values as integers unless we tell it not to (for this, R has been given a detention). so, to read in the data and make sure R doesn't treat vectors like "county" as numeric, we should include as.is=TRUE when we use the function read.csv.
things will be summarized in the original post and linked into the discussion so it should be easy to find things.
it is easy to pick up on the course anytime and go at your own pace too - so join in whenever you can.
in friendship,
prad
i was the one who helped to get you out of there!
vm puts you in, i help you get our and you complain about me!!
excellent!
i rather like that actually - helps to establish my fearsome reputation further (as in the good old days)!!
my horns are a tingle!
in fiendship,
prad
Ok I trust you prad. No fear here. Sorry. Haha
however, it is always beneficial for you to complain about me because doing so helps maintain my fearsome reputation - without my having to actually do much. besides, right now you're the only student brave enough to post here that vm and i have to pick on. the others are hiding. :D
in friendship,
prad