A Hugo website
/
Recent content on A Hugo websiteHugo -- gohugo.ioen-usTue, 06 Aug 2019 00:00:00 +0000Mediation analysis in R and Stata
/2019/08/06/mediation-r-stata/
Tue, 06 Aug 2019 00:00:00 +0000/2019/08/06/mediation-r-stata/Mediation analysis Traditionally mediation model can be represented in the following equestions:
\[ Y = a X + b M + \epsilon_1 \] \[ M = c X + \epsilon_2 \]
That is, we’d like to study the effect of \(X\) on \(Y\), and we see the effect can be a direct effect, and an indirect effect, through \(M\).
Baron and Kenny’s (http://davidakenny.net/cm/mediate.htm) method is done in four steps. Modern approach tends to use SEM (structural equation modeling) to model these two equations directly.Fixed or Random Effect, or Both?
/2019/05/23/fixed-effect-random-effect/
Thu, 23 May 2019 00:00:00 +0000/2019/05/23/fixed-effect-random-effect/Panel data When we have a panel data (repeated observations over time, or observations clustered at higher level), we usually think of two choices: random effect or fixed effect? Economists usually prefers fixed effect models, since it wipes out all within unit heterogeneity. Economists do not like random effect models since it has a big assumption: the random effects need to be uncorrelated to other covariates in the model. To see this, suppose we haveComparing Marginal effects with margins command
/2019/04/22/marginal-effects-in-margins/
Mon, 22 Apr 2019 00:00:00 +0000/2019/04/22/marginal-effects-in-margins/Comparing Marginal effects Stata’s margins command has been a powerful tool for many economists. It can calculate predicted means as well as predicted marginal effects. Sometimes we’d like to compare those marginal effects. People use margins and marginsplot to generate marginal effects; then draw conclusions on whether there is a difference between marginal effects, based on whether the confidence intervals overlap or not. However, that can actually be wrong. In this post, I’d like to introduce a way to compare effects.Marginal effects in models with fixed effects
/2019/01/25/marginal-effects-in-models-with-fixed-effects/
Fri, 25 Jan 2019 00:00:00 +0000/2019/01/25/marginal-effects-in-models-with-fixed-effects/Marginal effects in a linear model Stata’s margins command has been a powerful tool for many economists. It can calculate predicted means as well as predicted marginal effects. However, we do need to be careful when we use it when fixed effects are included. In a linear model, everything works out fine. However, in a non-linear model, you may not want to use margins, since it’s not calculating what you have in mind.Premier League Soccer
/2019/01/17/soccer-epl/
Thu, 17 Jan 2019 00:00:00 +0000/2019/01/17/soccer-epl/Do you have to win head-to-head matches against top contenders to win a championship? Recently Manchester City beat Liverpool 2-1 on Jan 3. I was pleased. My favorite team is Arsenal in English Premier League (EPL), City is the second favorite. My friend, who is a Liverpool fan, argued that championship was never decided by the head-to-head matches between title contenders. I was like, “Really?”. If a team wins consistently over weaker teams, theoretically they could win the title, even if they lost most of their head-to-head games against other top contenders.Treatment effects and matching
/2019/01/10/treatment-effects-with-matching/
Thu, 10 Jan 2019 00:00:00 +0000/2019/01/10/treatment-effects-with-matching/Treatment effects in observational studies Despite the popularity of randomized experiements in economics nowadays, most situations we have observational data in economic studies. One reason is experiemnts are expensive; the other reason is that sometimes it is simply not feasible to have experiments. If we have observational data, and we’d like to draw causal conclusions, then we have a few different situations. The worse situation is that we have an endogenous treatement.Interaction term in a non-linear model
/2017/12/07/interaction-term-in-a-non-linear-model/
Thu, 07 Dec 2017 00:00:00 +0000/2017/12/07/interaction-term-in-a-non-linear-model/Interaction term in a non-linear model In a non-linear model (for example, logit or poisson model), the interpretation of the coefficient on the interaction term is tricky. Ai and Norton (2003) points out that the interaction term coefficient is not the same as people can interpret as in a linear model; that is, how much effect of \(x1\) changes with the value of \(x2\). They interpret this as a crossInterpreting interaction in a regression model
/2017/12/07/interpreting-interaction/
Thu, 07 Dec 2017 00:00:00 +0000/2017/12/07/interpreting-interaction/Interaction with two binary variables In a regression model with interaction term, people tend to pay attention to only the coefficient of the interaction term.
Let’s start with the simpliest situation: \(x_1\) and \(x_2\) are binary and coded 0/1.
\[ E(y) = \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1x_2 \]
In this case, we have a saturated model; that is, we have three coefficients representing additive effects from the baseline situation (both \(x_1\) and \(x_2\) being 0).What model to use for rare events
/2017/10/26/rare-event/
Thu, 26 Oct 2017 00:00:00 +0000/2017/10/26/rare-event/Introducation In empirical studies, people are worried about rare event situation. That is, when you have, for example, lots of 0’s and only a few 1’s, or vice versa. Do you run a logit model, or do you use a “rare event logit”? When should you use either approach? Or there is a third approach?
Paul Allison said in his blog (https://statisticalhorizons.com/logistic-regression-for-rare-events):
“Prompted by a 2001 article by King and Zeng, many researchers worry about whether they can legitimately use conventional logistic regression for data in which events are rare.Causal Forest in panel data
/2017/10/23/causal-forest-in-panel-data/
Mon, 23 Oct 2017 00:00:00 +0000/2017/10/23/causal-forest-in-panel-data/Introduction In this simulation exercise, we use Causal Forest (Now is implemented in Generalized Random Forest) (https://github.com/swager/grf) to calculated conditional average treatment effect (or heterogenous treatment effect). We assume three different data generating processes. The first one is a linear interaction between a variable of interest and the treatment dummy. The second one assumes a nonlinear function (a step function) of a variable of interest, say \(X\), and the treatment dummy \(W\).Which count data model to use
/2017/10/10/poisson/
Tue, 10 Oct 2017 00:00:00 +0000/2017/10/10/poisson/A comparison of various count data models with extra zeros In empirical studies, data sets with a lot of zeros are often hard to model. There are various models to deal with it: zero-inflated Poisson model, Negative Binomial (NB)model, hurdle model, etc.
Here we are following a zero-inflated model’s thinking: model the data with two processes. One is a Bernoulli process, the other one is a count data process (Poisson or NB).Using machine learning for causal effect in observational study
/2017/09/21/tmle/
Thu, 21 Sep 2017 00:00:00 +0000/2017/09/21/tmle/A simulation for an OLS model In an observational study, we need to assume we have the functional form to get causal effect estimated correctly, in addtion to the assumption of treatment being exogenous.
library(MASS) library(ggplot2) library(dplyr) ## ## Attaching package: 'dplyr' ## The following object is masked from 'package:MASS': ## ## select ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union library(tmle) ## Loading required package: SuperLearner ## Loading required package: nnls ## Super Learner ## Version: 2.红楼梦 作者分析
/2017/09/19/red-chamber/
Tue, 19 Sep 2017 00:00:00 +0000/2017/09/19/red-chamber/简介 在本文中我们用机器学习和统计的方法来分析红楼梦的作者。红楼梦是中国最著名的小说之一(https://en.wikipedia.org/wiki/Dream_of_the_Red_Chamber)。 大多数人认为前八十回为曹雪芹所作， 后四十回为后人续作。 除非我们从考古中发现证据，唯一能告诉我们的便是小说本身。我们可以从文章的用词风格来判断作者。 即使续作者尽量模仿原作者的风格，也很难不在字里行间露出自己固有的风格。（我们也许可以用同样的方法来鉴定真画和假画，但是需要有原作者的大量画作）。
我是从这里下载的红楼梦原著120回版（具体哪个版本我也没仔细研究）: http://www.shuyaya.cc/book/2034/#download
我用了R中的 “Rwordseg” 来做中文分词。 就是把一句话分成词和词组。 然后用了 “cleanNLP” 得到分词频率矩阵 （“term frenquency matrix”）。 然后在最后一个模型用到 “topicmodels”。
读入文本进行分词 这一部分就是读入文本， 把它分为120回，每一回作为一个文本， 然后分词。
# analysis starts here library(rticles) library(cleanNLP) library(readr) library(stringi) library(ggplot2) library(glmnet) library(ggrepel) library(viridis) library(magrittr) library(topicmodels) library(tidyverse) library(rJava) library(Rwordseg) library(RColorBrewer) library(tm) require(readtext) honglou1 <- readtext("~/projects/honglongmeng/honglou1.txt", text_field = "texts") # here is to split into chapters using stringr's splitting functions my_split <- function(text) { pattern <- '第.{1,3}回 ' x <- str_split(text, pattern)[[1]] y <- str_extract_all(text, pattern)[[1]] data.About me
/about/
Thu, 05 May 2016 21:48:51 -0700/about/This is a blog site for mostly statistical blogs. I am a statistician at HBS My linkedin page.