Archive for the ‘Statistical Analysis’ Category



In Medical Musings,Statistical Analysis,Statistics on June 14, 2011 by David Tagged: , ,

Just started playing around on Protovis, WordPress doesn’t allow Javascript, so will just post some .png of graphs I’ve made. It’s a shame, b/c some of them are interactive. Owell!



Professionals at Work

In Statistical Analysis,Statistics on November 29, 2010 by David Tagged: , , , , , ,

As the internet becomes a larger and larger proportion of our everyday lives, our habits in front of our computer become increasingly important. Just fifty years ago, who would have thought there would be entire cottage industries would be devoted to ergonomics and our comfort in an increasingly sedentary lifestyle? As Hulu comes bridge the divide between computer and television into one entertainment platform and the internet becomes more and more vibrant and engaging place, we will undoubtedly reexamine the evolution of the internet and browser platform. Just as important physical comfort, even slight gains in efficiency in broswer/internet use will be compounded greater and greater amounts of time in front of the desktop. I attempt to examine the relationship between self-reported proficiency, amount of time spent in front of the computer, and extension use in Firefox.

Mozilla Labs and the Metrics Team, together with the growing Mozilla Research initiative, are hosting a Open Data Visualization Competition based on Test Pilot data. This is my second attempt after sorting through demographic, use, and survey data. First, the distribution of the self-reported computer proficiency is not normally distributed. There was a much greater number of individuals selecting a high proficiency on the internet than those selecting a low number. Although I believe this is definitely a biased population (the individuals opted in, implying they understand how to opt in and care for the development of Firefox), I also think there is truth in the saying “90% of college graduates believe they graduated in the top half of their class”. There are negligible differences in responses to the amount of time spent using Firefox and age distribution. When faceted both by age of the respondent and self-reported number of hours in front of the computer, the distribution remains roughly the same.  However, there are substantive differences in terms of use of extensions and the types of activities performed – perhaps enthusiasm is more important than raw amount of time.


First, individuals self identifying as being more proficient are also more likely to use a greater number of Firefox extensions. This is to be expected as Firefox prides itself on having a vibrant sandbox of developers and a large array of functional and entertaining extensions. I would guess that more proficient users would take the time to create a more customized and optimized experience with the internet and this begins at browser extensions. I would also venture to guess that at least a portion of the sample includes web developers and individuals intimately involved in the development of Firefox (Who needs 100+ extensions?!?) and they would self-identify as professional users.


The internet is not a one-way mirror. As we use the internet, the internet changes and influences our behaviors. There is also a strong positive between the self reported proficiency and the number of activities one performs on the internet. In the survey, one of the questions on the survey asks how individuals use the internet, binned into large categories of activities ranging from social networks to online shopping. As users become more proficient, perhaps they discover a greater variety of functions that can be performed online.


Finally, more frequent users use the internet for a great number of reasons. This is intuitive as to use the internet for more functions, more time would be spent in front of the computer. As entrepreneurs continue to discover and create new ways for the internet, inevitability more and more of our life will be tied to our online personas and internet use. As such, the browser experience is vitally important. (Carry on Mozilla!)


Further Exploration of Private Browsing

In Statistical Analysis,Statistics on November 27, 2010 by David Tagged: , ,

Mozilla Labs and the Metrics Team, together with the growing Mozilla Research initiative, are hosting a Open Data Visualization Competition based on Test Pilot data. I really enjoy reading their blog posts, and now that they’ve opened up their dataset, I wanted to have a go at it. On the Mozilla website, there is an option to enroll in a data-collection study on how individuals use their browsers. In addition to usage statistics such as how many tabs are open and how frequently they use their browser, there is a survey of demographics and self-described interests.

There was a really blog post on how people use Private Browsing Mode based on usage data. I wanted to see if I could go one step further, by testing their conclusions and cross-referencing individual usage data with their survey responses. I was able to confirm some of their conclusions, such as the fact that most people spend about 10 minutes in Private Browsing mode, but because location data was stripped from the dataset, was not able to verify the spikes according to time of day. From PDT it looks like there is large peak throughout the afternoon, but this probably skewed by different numbers of users in each time zone.

A Greater Proportion of Male Users than Female Users use Private Browsing Mode

The dataset is skewed towards having more males represented in the sample population (94% male), but in terms of most metrics, there is gender equity. Males and females use roughly the same number of extensions, have similar age distributions, and have very similar self-reported number of hours in front of the computer. Females do seem more modest in self-reporting of proficiency with a computer. The most striking difference was the difference in use of Private Browsing Mode, with almost four-fold increase in the proportion of males. Further statistical analysis based on gender, either of the duration or frequency of the use of Private Browsing Mode seems suspect due to the small sample size.

Younger People Tend to spend more time in Private Browsing Mode

In addition to gender, there appears to a slight, admittedly weak, relationship between the age of the individual and the average time spent in Private Browsing Mode. The data is colored based on gender, with blue for males and red for females. There appears to be a slight bump in the 18 to 25 age category, although this could be due to differences in sample size across different ages. Note: This plot is of individuals which use Private Browsing Mode – if examining the population at large, there would be a ton of data points with a duration of 0.

Self-Identification Affects Private Mode Usage

Question 12 of the survey posed the question “What are your most frequently visited websites?”. The survey allowed for a variety of responses ranging from “Search engines” and “Social networking sites” to “Adult pages” and “Gambling and online betting”. I was curious whether this self-characterization would be a good metric to identify individuals who use Private mode more often. I was able to separate out the survey responders based on whether they chose each website different categories. For example, I subsetted the entire survey into individuals who chose “Social networking sites” vs. individuals who did not choose “Social networking sites”. A priori, if this self identification did not matter, there should be little to no difference between the average time in Private Browsing Mode between the two populations.

For each category, here is the absolute difference in the two populations.

[1] 5.594305
[1] 3.84167
[1] 0.2490019
[1] 0.7180658
[1] 3.601866
[1] 0.3737534
[1] 1.408243
[1] 1.525824
[1] 3.27257
[1] 6.79226
[1] 4.651107
[1] 5.669116
[1] 1.027911

There was the smallest difference in individuals who claim and do not claim to use the internet for “News sites”, “Social networking sites”, and “Shopping”, while there appears to be a bigger difference in individuals who claim to use the internet for “Forums”, “Adult pages”, and “Gambling and on-line betting”. There appears to be a noticeable difference in usage between individuals selected any of the “riskiest” 3 categories and individuals who did not.


US Half Marathon Statistics

In Statistical Analysis,Statistics on November 8, 2010 by David Tagged: , , , ,

Yesterday, I ran the US Half Marathon. It was my first half marathon, and I woke up at 4:30 AM to abysmal dark and pouring rain. Getting there around 5:30AM, I waited outside (the nearby Starbucks didn’t open till 6AM), and wondered whether I should really be attempting this. By 6:30AM, I met a few other runners waiting at Starbucks and their energetic reassurances convinced me to go for it. All in all, it was a really good experience, and I was happy that I ran. I finished in 2:17:26, but by mile 11, the region around my right lateral epicondyle started hurting (the bony structure lower and to the outside of your knee). Try as I may, I couldn’t run anymore, and walked the final two miles. I will write more about my experience, but I wanted to share my statistical analysis of the results of the race.

Relationship between Bib Number and the Runner's Finish Time

Relationship between Bib Number and the Runner's Finish Time

Later that day, I went to check out the posted results online. The results page had an easy to scrape PHP/javascript database, so I downloaded the information and did a brief statistical analysis on results. I graphed the relationships between gender, age, bib number, and finishing time using R and ggplot2. Some results are below:

  • I registered for the half marathon using Groupon, and all in all, 1298 people used Groupon for this event. This was approximately three weeks before the half marathon. I wondered if you could tell the Groupon participants from the regular, hypothetically more hardcore, racers.  I assumed that the bib (the paper you pin to yourself) numbers were assigned based on registration, ie. earlier registrants received lower numbers. Either that is not true or my assumption is wrong, as it does not look like later participants are slower than participants with low bib numbers. Rather people with higher bib numbers seem to be slightly faster than people with lower bib numbers.
  • I also heard that they reserved a range of bib numbers for professionals, relatively famous marathoners, or people who are extra dedicated. This does not seem to be the case as I could not determine any clusters of bib numbers in the fast runners. The winner seems to have a bib number in the middle of the range.
Boxplot of the Relationship between Runner's Age and Finish Time (in Seconds)

Boxplot of the Relationship between Runner's Age and Finish Time (in Seconds)

  • There appears to be a positive correlation with finishing time (in seconds) and age. The older you are, the more likely you are slower. Makes sense.
Distribution of Runner's Home Towns Across US

Distribution of Runner's Home Towns Across US

  • The vast majority of participants are from California. Note that the dots are in log(scale), of around 2.7k participants, maybe 2.5k were from California.
Different Divisions of Runners Represented

Different Divisions of Runners Represented

Gender Ratio of US Half Marathon Runners

Gender Ratio of US Half Marathon Runners

  • I examined the distribution of gender and age of the participants. There were more females than males running the race, and people were divided into divisions according to both age and gender. For example, M2029 would be males from 20 to 29 y/o.
Distribution of Times By Gender

Distribution of Times By Gender

  • Men are faster than women at this 13.1 mile race. The first 20 or so finishers were all men. Not sure if statistically significant.

In summary, this was an interesting experience and it was nice to see that results were easily accessible online. I did not find any very striking relationships or hidden patterns based on the results, but it is interested to examine the relationships between the different demographics and running times.


Visualizing My Budget

In Statistical Analysis,Statistics on July 31, 2010 by David Tagged: ,

Given an itemized budget, saved in a spreadsheet as such:

You can easily display a picture of how much money you are allocating in each category with a little help from R and ggplot2.

The code is such:

#Read in Budget File
budget <- read.csv(“budget.csv”, stringsAsFactors = FALSE)
#Plot Pie Chart
ggplot(budget, aes(x = factor(1), y = Cost, fill = factor(reorder(Item, Cost*-1)))) +
geom_bar(width = .9, position = “fill”, colour = “black”) +
coord_polar(theta = “y”) +
scale_fill_hue(l=70, c=150)
#Plot Stacked Bar Graph
ggplot(budget, aes(x = factor(1), y = Cost, fill = factor(reorder(Item, Cost*-1)))) +
geom_bar(width = .9, position = “fill”, colour = “black”) +
scale_fill_hue(l=70, c=150)


Investing for the Future

In Statistical Analysis,Statistics on July 24, 2010 by David Tagged: , ,

This past spring, I took a class on statistical finance at Rice. It’s mindboggling how many different ways people seek to manipulate the market, perform arbitrage, or otherwise “game” the system. I guess in a way, that’s to be expected with large sums of money at stake, low cost of entry, and what appears to be a clear relationship between insight and profit. The mentality also seems like, “If I can outsmart the market, why shouldn’t I? Someone else would most certainly do it if they had the opportunity and would get an edge over me.”

That said, I think the stock market, options, and futures are full of fascinating economic problem with creative challenges and interesting questions. Stock markets are rather black box – a lot of the approaches don’t really begin to attempt to understand the underlying mechanisms behind fluctuations or why stock prices change, but rather seek to capture momentum or short-term movements to get a quick profit. Personally, I think there are serious problems with the idea of stocks – I don’t think a company can have constant changes in value when it’s the same company at either price A or price B. This makes it really prone to speculation, and you are always in need of more players to enter the market to keep the bubble going.

Anyways with those caveats described, I just wanted to describe an interesting investing strategy that I learned in the class called the MaxMedian Portfolio. The idea is simple. Each year, you look at the stocks in the S&P500, and choose the top stocks with regard to median daily return from the previous year. You hold the stocks for a year, and the rebalance the following year. This strategy seeks to capture momentum – if a company is on the upswing, chances are it’ll keep going up. There’s more inertia going up, then coming down (remember my previous statement about bubbles? Generally, the majority of the players in the market want prices to keep going up.)

The interesting thing is the portfolio performs really well. If we run the simulation of running this strategy for the past 20 years, we would have made 25-fold our initial investment back. This is much better than the 5-fold return from investing in either the Dow Jones Industrial Average (DJIA) or the Standard and Poor’s 500 (S&P500). It does appear that this kind of strategy just amplifies the overall movement of the market. Between 2001 and 2003 with the bubble bursting, this strategy lost more money than investing in either of the indexes. I think this suggests that if you think the market will continue to grow and be good, it might be good to try this portfolio selection strategy.

To make sure this is not due to random chance alone, I simulated what would happen if I randomly chose 20 stocks from the S&P500 each year and followed a strategy of rebalancing each year. The following graph is a distribution of results from following this random strategy. Although it is possible to perform better than the MaxMedian strategy, the MaxMedian strategy is at the 75 percentile. I take it to mean this is better than random chance, although this random strategy ironically performs better than the market. Perhaps it’s because the stocks are chosen from the S&P500, which are generally larger and more established companies.

Anyways, I’ve just begun to dabble and put my money where my mouth this. The top performers for the past year have been AIG, AIV, GNW, TIE, and GCI. I avoided AIG, because it really seems like it was a strong benefactor of the bail-out and all the crap, but I put some money in equally the other four stocks. So far, they seem to be doing well, but we’ll see how it is in a year or so!

The code to import past year’s data from Yahoo Finance, generate median daily returns, and sort by returns is posted at: So is the code for analyzing the past historical data and simulation based on choosing random stocks. Code is in R and python. Enjoy!


When To Walk Away From a Mortgage

In Statistical Analysis,Statistics on January 8, 2010 by David Tagged: , , , , ,

Creative Commons

Summary: The financial repercussions for a foreclosure on a high end mortgage is estimated to be $21,000. When your residential property is worth $21,000 less than your mortgage, it might be a rational choice to walk away from the mortgage.

Background: With the recent economic downturn, there are a significant number of people with mortgages that are worth more than their house. Particularly in the western states, there are a number of overvalued mortgages due to local housing bubbles and the sliding prices and volume of housing sales. Because they bought their house when there was a housing bubble and now the house is worth much less than what it is being paid to the mortgage company, they are ‘underwater’. 65% of residential property mortgages in Nevada are underwater.

My question is now; at what point is it rational to walk away? With the overreaching and speculative lending of large banks, there are a significant number of mortgages to individuals without the ability to afford them. At that point, with an overwhelming mortgage, would it be more financially viable to try and start over or to continue walking the harrowing tightrope?

My question is not a question of morality or ethics (this question is approached in this New York Times article), but about the practical implications if one chooses to walk away from a mortgage. In other words, how far underwater does a property need to get for it to be a rational choice to walk away? Obviously, if your house suddenly becomes relatively worthless, it does not make sense to keep paying your mortgage payments, but when should the homeowner decide it is worth the lowered credit score to be foreclosed on the mortgage?

Put in another way, what is the value of the credit score? The value of credit score, or FICO score, depends on the amount of debt one wants to hold after default (as it determines the amount of interest one would need to pay) and whether you would jump from one tranche or subdivision to the next. (If a bank offers certain loan conditions to individuals with scores of 650 – 720, a score drop from 716 to 655 does not make much of a difference, but a drop from 716 to 649 would make a world of difference). I am interested in getting a ballpark estimate of the cost of having a low credit score.


1. Estimate how much the FICO/credit score drops as a result of the foreclosure.

2. Calculate the present value of the difference in interest payments. The assumption is that once you walk away from your current mortgage, you will obtain another house at a lower price (although the mortgage would be on worse terms).


1. The financial repercussions will be limited to approximately 7 years. Line items on your credit score report remain on your record for only 7 years, and cannot be considered after that timeframe. There is also a public record, but can also be removed after 7 years if requested.  So the difference should be the present value of the difference in interest for 7 years. After that time, one can refinance or approach different financial solutions.

2. Some websites overestimate the value of the FICO score (credit score) because they assume the different for the lifetime of a 30 year loan. Because it is very reasonable, readily available, and in your best interests, we will assume you will refinance after your credit score improves in 10 years. This website estimates a 25 point different to be worth $31,002, which is the difference between paying $1,189/month vs. $1,275/month for 30 years.   But this also gives us an upper ballpark, because the previous example (1) chooses a credit drop that directly drops the borrower from one tranche to the next , (2) chooses a 200,000 30yr fixed mortgage rate, a relatively large loan for the average American consumer, and finally (3) does not discount for the time-value of money. Taking into account the ability to refinance after 7 years and the time-value of money (with a discount rate of 0.50), the value of monthly payments of $86 over seven years is $5,886.96.  (This is the upper ballpark for a 25 point drop in credit score).

3. There are limited other financial repercussions. Articles also mention differences in cost in respect to insurance, but I hypothesize that this would be a minimal impact compared to difference in mortgage payments. As this is simply a rough ballpark, I will simply add an extra 5% to signify any additional, unforeseen credit repercussions.



  1. The impact on FICO score will be estimated to be 240.

The foreclosure’s actual point impact on an individual’s credit report is estimated to be from 125 to 175 points. The bigger impact is from the late payments on other bills which quickly mount up. The net effect is generally considered to be about a 240 point decline counting his late mortgage payments. Ironically, the lower your credit report to start, the less the impact of additional late payments, and if you get into the 400’s, it’s really hard to get much lower without almost trying to hurt yourself. Many of the items on any credit report can be removed over time. It requires persistence and it’s estimated that 30% of all items on credit reports are incorrect and can be removed just by an inquiry or showing a paid invoice. Also the credit score reduction for the foreclosure is reduced as time goes on, until it settles at a minimal deduction (50 to 75 points) after a few years.


We assume that since you are willfully walking away from your underwater mortgage, you can still afford your other payments, and you will limit other impacts to your credit score. We will estimate the impact on your credit score overall will be 150 points (the midpoint between 125 and 175). This would put your new FICO score at 573. A good number, as the lowest score to qualify for a mortgage is usually around 560, but then again the impact of the foreclosure on your FICO score would be less if you start initially with a bad credit score.

2. We estimate the difference in payments between a FICO score of 723 and 573.

From :

FICO score Amount of loan Interest rate Mo. payment
720-850 $200,000 5.922 $1,189
675-699 $200,000 6.584 $1,275
620-674 $200,000 7.734 $1,431
560-619 $200,000 8.531 $1,542

What is the present value of the difference in monthly payments of $1,542 and $1,189 for 7 years? This is a monthly difference of $353. Using the Present Value calculator, the present value is $19.996.90 when there is a discount rate of 0.500.

Note: This is a relatively low rate, assuming there is little/no inflation, where a more reasonable discount rate of 2.000 would have the present value be $14,305.51. So my ballpark MIN/ESTIMATE/MAX is $14,000/$20,000/$30,000 when the discount rate (similar to the interest/inflation rate) is at 2.00/1.00/0.00 respectively.

3.  Add an extra 5% for a conservative estimate and any other unforeseen financial impacts of a reduced credit score. My point estimate would be that this foreclosure would cost you an estimated $21,000. This is under the assumptions of $200,000 mortgage on your new house (this would be significantly less if you choose to rent or not borrow money during the seven years), a discount rate of 1.000, and an initial score of approximately 723.

Conclusion: Twenty thousand dollars is not a small sum of money. In most situations, it is not rational to simply walk away from a mortgage. But in select situations, where there is a significant downturn in property values, it can be better for your financial future to face foreclosure.

From the same NYTimes article:

And given that nearly a quarter of mortgages are underwater, and that 10 percent of mortgages are delinquent, White, of the University of Arizona, is surprised that more people haven’t walked. He thinks the desire to avoid shame is a factor, as are overblown fears of harm to credit ratings. Probably, homeowners also labor under a delusion that their homes will quickly return to value. White has argued that the government should stop perpetuating default “scare stories” and, indeed, should encourage borrowers to default when it’s in their economic interest. This would correct a prevailing imbalance: homeowners operate under a “powerful moral constraint” while lenders are busily trying to maximize profits. More important, it might get the system unstuck. If lenders feared an avalanche of strategic defaults, they would have an incentive to renegotiate loan terms. In theory, this could produce a wave of loan modifications — the very goal the Treasury has been pursuing to end the crisis.

Notice: The article does not constitute financial advice and comes with no warranty, guarantee, or liability. These are merely back-of-the-envelope calculations meant to highlight all possible options. Please consult a licensed professional before making significant financial decisions.