as seen at stata; click to see larger pic |
Anyway, there were no citations! so I decided to go off and investigate on my own and treat it as a data visualization exercise (hi, upcoming IAP / startlabs / c2c / metrify).
EDIT: Derp! I data-failed. You can see civil/environmental engineering (course 1 -- see http://web.mit.edu/catalog/subjects.html if you're confused by the numbers) is definitely off. : / messed up copying and pasting somewhere. Will redo this for realz one day (they have data for multiple years, I want to clean it up automatically).
sorted by % |
sorted by course number |
http://www.quora.com/MIT-1/What-is-the-gender-ratio-for-each-of-the-majors-at-MIT
data source
Here's the data I used (I should just email them for the csv's, but I thought this would be a simple data scraping exercise):
http://web.mit.edu/registrar/stats/gender/index.html
http://web.mit.edu/registrar/stats/yrpts/index.html
What tripped me up is that the women vs. everyone listing of majors is ever so slightly off, and I have to decide how many distinctions I want to make. Do I really care about 2 vs 2A? not really. If I care about 6-1 vs 6-2 vs 6-3, do I just drop 6-7?
assorted google spreadsheets of data from those sources
cleaned data for charts
https://docs.google.com/spreadsheet/ccc?key=0Atr0HZeoMbmgdHFNZU5mbEdYU2VIOE1BUTRLZHJJVkE
raw data sorted for female / undergards
https://docs.google.com/spreadsheet/ccc?key=0Atr0HZeoMbmgdFg0Vk14RW9JQjJtSFFsMXJBZ2hnSEE
raw scraped html table data
https://docs.google.com/spreadsheet/ccc?key=0Atr0HZeoMbmgdFYzNUdEN2NKM2lGbHV6OVN4YkJNTmc
where i tried to (see script editor) automagically clean up the data and failed:
https://docs.google.com/spreadsheet/ccc?key=0Atr0HZeoMbmgdFRkSXp5SW9teEJiT0t5U2FJeTBUVnc&hl=en_US#gid=0
the
google apps code
I used for that last gdoc
function loadImport() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet = ss.getActiveSheet();
var data_sskey = "0Atr0HZeoMbmgdFYzNUdEN2NKM2lGbHV6OVN4YkJNTmc";
var row_init = "13";
var col_ug = "G";
var col_grad = "M";
var col_total = "O";
var col_courseName = "A";
var col_courseNumber = "Q";
var cols = [col_ug, col_grad, col_total];
//create arrayformulas for copying selected data over
//e.g. =ImportRange("0Atr0HZeoMbmgdFYzNUdEN2NKM2lGbHV6OVN4YkJNTmc","Women!G13:G")
//Add Course Names
var finalRanges = [ '=ImportRange("' + data_sskey + '","Women!' + col_courseName + "13:" + col_courseName+ '")'];
//Add Enrollment Data in alternating female / all
for (var i = 0; i < cols.length; i++){
femRange = '=ImportRange("' + data_sskey + '","Women!' + cols[i] + "13:" + cols[i]+ '")';
finalRanges.push(femRange);
everyoneRange = '=ImportRange("' + data_sskey + '","Everyone!' + cols[i] + "13:" + cols[i]+ '")';
finalRanges.push(everyoneRange);
}
//Add Course Numbers
finalRanges.push('=ImportRange("' + data_sskey + '","Women!' + col_courseNumber + "13:" + col_courseNumber+ '")');
// Sets the column header names to match copied data.
//sheet.getRange("A1").setValue(range);
var columnNames = ['Course Name', 'Fem UG', 'All UG', 'Fem Grad', 'All Grad', 'Fem Total', 'All Total', 'Course Number'];
var headersRange = sheet.getRange(1,1,1,columnNames.length);
headersRange.setValues([columnNames]);
//var dataRange = sheet.getRange(2,1,1,columnNames.length);
for (var j = 0; j < columnNames.length; j++){
sheet.getRange(2, j+1).setValue(finalRanges[j]);
}
}
//method getRange(row, column, optNumRows, optNumColumns)
Google Docs Scripting Tutorials
An excellent step-by-step free tutorial on ImportXML (similar to importHTML that I used), how I heard of xpath (which I didn't end up need to use due to importHTML magic): http://www.distilled.net/blog/distilled/guide-to-google-docs-importxml/
other useful tutorials I referred to
http://blog.ouseful.info/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/
http://blog.fastfedora.com/2011/08/how-to-use-open-apis.html *shudders* speaking of government apis, I really should be less terrible and fix my github codeforamerica repos which I left half-finished. AUGH. I FAIL. This is why I haven't applied to any jobs yet. Upcoming: post on why you shouldn't hire me (with the caveat that I feel okay listing these because these are past mistakes that I don't intend to make again).
http://blog.ouseful.info/2010/02/15/creating-a-winter-olympics-2010-medal-map-in-google-spreadsheets/
eeee shiny http://pipes.yahoo.com/pipes/
I also attempted to use google fusion tables but did not have a good experience.
http://code.google.com/googleapps/appsscript/articles/writing_spreadsheet_data.html
http://code.google.com/googleapps/appsscript/service_spreadsheet.html
To investigate: http://www.propublica.org/nerds/item/using-google-refine-for-data-cleaning
No comments:
Post a Comment