Machine Learning Career Path of Chinese Political Leaders

Recently JunYan Jiang published a Chinese Political Elite Database (CPED), which contains demographic and career information of Chinese political leaders from multiple levels. (ref https://www.junyanjiang.com/data.html)  It’s a very interesting dataset, here I will use unsupervised machine learning to explore the structure of the dataset and then applied a supervised learning model to predict the highest position of the political leaders with all their background information. All the codes are here  https://github.com/luciasalar/government_officials.git

Information in the CPED including: name, gender, ethnicity, birthday, birthplace, education back ground, whether the person join the army, whether the person has been expelled from the Communist Party of China(CPC), current position, whether the person commits a  crime, when does the person join CPC, how long does the person work in the government, when and where does the person been relocated, job grade, name of the position and so on.

Before I do any learning, I did a bit of work to process the dataset: 1) convert categorical variables to dummy variables.  2) I extended some variables based on the existing variables on the dataset. I added age by calculating the time difference between now and birthday; added frequency to be relocated  (time worked the in government divided by the number of times they were relocated); added number of times worked in the central government; added percentage of central government position to all the positions of each person; added number of times a person been relocated by a national institute or a central government; added number of years they have been working in the government; added number of years they have been in the CPC 3) I also recode the location of where they worked (developing countries and developed countries are assigned into two big groups). However, the location later turns out to be a confusing variable later in the cluster analysis. Maybe I need to identify the location they worked for the longest time and use GDP of the location as weights.

In order to know what variables can predict the highest position of a government official, I do a cluster analysis on the variables. Here I use model based clustering,. The advantage of model based clustering is that it can adapt to Gaussian with non-spherical variance. After a couple of attempts, I find all the extended variables I generated and the job grade produce 9 clusters. Adding other variables, especially the location only confounds the cluster results. 

Best BIC values:
             EEV,9     EEV,8     EEV,7
BIC      -40592.79 -44475.21 -48851.12
BIC diff      0.00  -3882.42  -8258.33
$mean
                            [,1]       [,2]       [,3]         [,4]        [,5]        [,6]       [,7]
central_freq          4.32237099  8.1333333  3.2921195 6.979184e-01  0.64672483  4.67187223  7.3296655
relocate_freq        12.61755297 19.6000000 18.5603415 1.779856e+01 13.79750420 19.30010818 19.5700083
nat_ins_relo        4.19084397  9.6000000  6.4393863 3.455870e+00  3.99058070  6.99442958  9.5765742
central_relo          0.00000000  1.8333333  0.5695582 6.904707e-04  0.18754764  0.33255111  1.9551820
级别_deputy_director  0.75474248  1.5333333  3.1303437 3.671810e+00  2.48911104  2.07742828  1.5326028
级别_deputy_leader    0.00000000  2.8666667  0.0000000 0.000000e+00  0.00000000  0.06536624  2.0542140
级别_deputy_dept      0.06884181  0.7333333  1.4554218 2.068585e+00  1.28341355  1.31928172  0.9915247
级别_vice_minister    2.49509706  3.2333333  3.2460501 5.323188e-01  0.32912791  4.29367926  3.5884937
级别_less_dept        0.61317397  2.0333333  2.7168478 3.333408e+00  2.39241514  2.71185414  2.4306439
级别_no_rank          0.89761799  1.6333333  1.7226345 1.670180e+00  1.90671226  1.70147894  1.6974206
级别_director         2.09337233  1.5333333  4.3141123 3.500887e+00  3.25762003  2.73690397  2.5425752
级别_national_leader  0.00000000  2.3666667  0.0000000 0.000000e+00  0.00000000  0.00000000  0.0000000
级别_dept             0.33755477  1.2666667  1.9749013 2.942800e+00  2.04231632  1.76069380  1.4218369
级别_minister         1.13393696  2.1333333  0.0000000 0.000000e+00  0.00000000  2.63342184  2.4670000
gov_working_yrs      34.77427264 89.1850000 52.8895299 5.021180e+01 51.41595474 70.20722067 65.0047612
age                  69.76954176 80.0189208 62.9543031 6.287416e+01 64.06984368 74.19215959 75.3007443
join_cpc             43.65866150 58.0666667 38.6672922 3.828230e+01 40.33759873 50.62669711 50.1089734
join_cpc_age         25.24244381 22.6220278 23.4355524 2.372276e+01 23.57874446 22.83489307 24.5402508
freq_change_pos_nor   3.02921911  4.9328471  2.9339499 2.935939e+00  4.01097470  3.75754392  3.4275268
central_freq_perce     0.36299367  0.4209026  0.1859861 3.710309e-02  0.04916415  0.25133780  0.3650817
                            [,8]         [,9]
central_freq         10.34535743  0.000000000
relocate_freq                19.34634107  6.950475840
nat_ins_relo       11.18794437  0.190331466
central_relo          0.56318475  0.164688275
级别_deputy_director  1.84379429  0.766288984
级别_deputy_leader    0.06257608  0.000000000
级别_deputy_dept      1.53382976  0.113210760
级别_vice_minister    3.28088365  0.609826329
级别_less_dept        1.31481036  0.081245263
级别_no_rank          1.65582924  0.130124281
级别_director         3.31431693  2.268522452
级别_national_leader  0.00000000  0.000000000
级别_dept             1.74969366  0.277415251
级别_minister         2.27820879  0.006609097
gov_working_yrs           47.96463094 17.907235176
age                  68.92297628 66.174523443
join_cpc               43.81333803 40.977279532
join_cpc_age             24.99038289 23.577696437
freq_change_pos_nor   2.51541676  3.108479572
central_freq_perce      0.53670229  0.000000000

The above table shows all the variables I used in clustering, freq = frequency, nat_ins_relo = relocated by national institute, central_relo = relocated by central government, gov_working_yrs: number of years working in the government, join_cpc: number of years they join CPC; join_cpc_age : age when they join CPC; central_freq : number of times they worked in the central government; freq_change_pos_nor: frequency of being relocated normalized by the number of years they work in the government; central_freq_perce: number of times working in the central government divided by the number of times being relocated.

It’s hectic work to see what information contained in these groups manually, so I wrote a function to see which cluster has the highest mean score in each variable. We can see that there are a few clusters that are quite important, group 2 contains most national leaders. Let’s call it ‘the leader group; group 3 has most directors let’s call it ‘director group’, group 4 has most deputy directors and deputy department heads, department heads, lower than department head positions, let’s call this group ‘department heads’; group 5 contains officials without rankings, let’s call it ‘no ranking’. group 6 is the ‘ministers’ group, group 8 contains people who works in the central government most number of times. We can also see that officials in the leader group have highest mean age.  Group 1 contains least high level government officials, officials in group 9 work in the government shortest period of time. Now we know that unsupervised learning managed to learn some patterns in these variables.

Now let’s do a regression and see if these variables can predict the job grade. We selected the highest job grade of each official as the job grade label. We can see that all the variables we selected are significant in the prediction.

lm(formula = job_grade ~ ., data = reg_fea)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2911 -0.4882 -0.0403  0.4308  5.1570 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)          8.198359   0.159396  51.434  < 2e-16 ***
central_freq         0.098955   0.013897   7.121 1.27e-12 ***
Freq                -0.052931   0.005458  -9.697  < 2e-16 ***
nat_ins_relo         0.003095   0.006357   0.487 0.626389    
central_relo        -0.052582   0.014540  -3.616 0.000303 ***
time_diff           -0.019530   0.001595 -12.245  < 2e-16 ***
age                 -0.016151   0.005317  -3.038 0.002398 ** 
join_cpc            -0.016090   0.005559  -2.895 0.003817 ** 
join_cpc_age         -0.020595   0.006256  -3.292 0.001004 ** 
freq_change_pos_nor  0.182062   0.019311   9.428  < 2e-16 ***
central_freq_per    -2.759989   0.207682 -13.290  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8512 on 3906 degrees of freedom
Multiple R-squared:  0.4649,	Adjusted R-squared:  0.4635 
F-statistic: 339.3 on 10 and 3906 DF,  p-value: < 2.2e-16

Ok, the final part is machine learning. Here I produce a very very basic svm model with R. I’ll do a proper ML with Python on a lazy weekend. The basic model turns our really not bad! First, I recode the job grade to binary, with anyone under the minister level as 0, ministers and national leaders are 1.  We get a balanced set of data.

0    1 
1941 1982

The F1 score is

[1] 0.7790393

confusion matrix

    predictions
y     0   1
  0  477  97
  1  156 446

 

 

 

 

 

 

 

 

 

 

cosine similarity vs. Euclidean distance

In NLP, we often come across the concept of cosine similarity. Especially when we need to measure the distance between the vectors. I was always wondering why don’t we use Euclidean distance instead. I understand cosine similarity is a 2D measurement, whereas, with Euclidean, you can add up all the dimensions.

d(p, q) = \sqrt{(p_1- q_1)^2 + (p_2 - q_2)^2+\cdots+(p_i - q_i)^2+\cdots+(p_n - q_n)^2}.

So here I find a ‘Grok’ explanation on Quora.

https://www.quora.com/Is-cosine-similarity-effective

You are a very polite person and you liked my answer..so in the comment section you have written “good” 4 times and “helpful” 8 times(just numbers!! :))…something like….” a very good answer which is too much helpful. It will be helpful for good understanding. People who are not that good in maths..Can find the answer helpful…”…and so on….

A friend of you..Who doesn’t talk much..Might write just- “good and helpful..I found it helpful for my studies”

What is the count? “Good”-1, and “helpful”-2

If I try to find the cosine similarities between these comments(or..Documents, as told in a miner’s term :))..It will be exactly 1! (Refer Google to see the formula, it’s ultra easy)

There you go, with cosine similarity, you measure the similarity of the direction instead of magnitude. 

Workshop: Introduction to Qualitative Analysis II

I’m always interested in qualitative analysis out of curiosity, although I might/ might not include it in my studies in the future. After years of doing quantitative studies, it might be a good idea to learn something from the other side.  Therefore, I attended this workshop: ‘Introduction to qualitative analysis’ in the Welcome Trust Research Centre. The first lecture was loaded with theory, which I mainly forget most of it but it’s not difficult to pick it up by reading a few papers. The main focus of the theory is to identify which type of qualitative study you want to conduct. Is it the grounded theory, which focus on finding the facts and generate theory or the phenomenology, which focus on exploring the feeling and unique experience of the participants.

Then it comes to a very interesting part this afternoon, in which the students did a role play as interviewer and interviewee. We were divided into groups, each group has two interviewers (one grounded theory, one phenomenology) and one interviewee. We were all given the same research question: What factors might affect a student to drop out/continue the PhD degree. I choose the grounded theory approach because I think exploring unique experience is at a more advance level.

First of all, I come up with a bunch of questions  direct to the possible factors that could influence a student to continue the degree. As a psychologist, I tend to put everything in a structure at the very beginning. “Which theory are you using? Do you know the structure of this theory? ” I still remember years ago when I started my first research project, this is the question from my supervisor.

(external/internal)Motivation, social support, financial support should be the important determinants. Other determinants might not be quite diverse, but I guess this is why we need an interview instead of scales.

Here’s the list of questions from me:

Step one:

debrief the purpose of my study.

Step two:

demographic questions, which degree she’s doing, which year, etc

Step three:

questions:

  1. Do you have research experience before you started your PhD? Participant said yes, then I continue to explore what was that and was it related to her research field at the moment.
  2. Do you like the research area you are working on? Why or why not?
  3. Have you started a family before you do your PhD. If yes, do you think catering the needs of your family and the degree at the same time is a bit difficult? Why?
  4. Do you have close friends who study a phD or doing similar research? Do you often talk to them? Why or why not? Do you think your friends and family are supportive? Why?
  5. Do you think your supervisors are supportive? why? Participants said she has 3 supervisors. Then I continue to ask do you do group meetings regularly? Do your supervisors have different opinions about your research topics?
  6. What do you think about your working environment? Why do you like your office?
  7. Do you think your colleagues are supportive? Do you work on a project together or do you have plans to work with them in the future?
  8. How’s your funding situations at the moment?
  9. What factors do you think will affect your decision to not continue your PhD and why?

These questions mainly covers the hypothesis in my research questions and also encourage the participant to speak out others factors that my questions are not included.

Here are the questions from the phenomenology side:

Why do you want to start a phD degree?

What are your expectations?

Which expectations are met and which are not met?

What stuffs went well?

How to improve your experience?

What didn’t go well?

In general, we retrieve similar content and the phenomenology approach managed to explore specific questions at a deeper level. Since the same participant did the two interviews, her reply to the phenomenology questions might be probed by my questions a bit. There were a lot of moments that the participants find it difficult to answer phenomenology questions because it requires a lot of reflection.

The lecturer seems to appreciate my strategy but other students seem to think that I’m asking superficial questions try to direct how the participant thinks by asking these questions as prompts. From the perspective of a psychologist, it’s good to listen to a narrative description from a participant to identify a problem, however,  it’s very often that a participant might not aware that something is an issue that affects how she/he feels about a task. That’s why we need to put everything in a structure, and see if the theory apply to individual cases to a certain degree, then we analysis why or why not it doesn’t apply or what’s missing in there.

In conclusion, I think this is a very interesting practice and I have a wonderful experience in both of the interview approaches. I hope I will have chance to experience more of this in the future.

 

 

Book review: Gathering Data for Your PhD

Gathering data for your Phd is a small but useful book. It only took me one afternoon to finish it. This book gives a brief introduction the advantages and disadvantages of different data collection approaches. For instance, questionnaire, interview(structured, semi-structured, unstructured, mobile interview, face-to-fact), focus group, participant observation, etc. You will be encourage to use/invent an approach that is suitable for your study, for example, drawing and writing for young children, storying telling, diaries and so on.

While you are reading this book, you may focus on the part that covers the research approach you are interested in. There are many useful references for each approach. Since I am conducting social media data analysis, I mainly focus on the research ethics, mobile Apps and possibly gaming data (I do wish to write a mini game to collect attention/self-monitor skill related data, because it should be easier to collect data with a game, sample could be a bit biased though)

Here I selected some useful resources from the book for my future reference

 

  • Research ethics:

 

Informed consent

Research ethics online training

https://globalhealthtrainingcentre.tghn.org/

introduction to research ethics

https://globalhealthtrainingcentre.tghn.org/elearning/research-ethics/

 

social media and research ethical resource

https://vision2lead.com/what-we-do/ethics/e-research-ethics-resources/

 

(I might spend some time to do the research ethics online training in the coming few days. The training center offers a certificate, not sure how useful it is, but it doesn’t seem to take me too long to finish that course.)

 

  • Gathering data online raises tricky ethical issues around anonymity and confidentiality, because social media blurs the boundaries between private and public.

 

APIs

https://www.programmableweb.com/

(thousands of APIs in here!)

Book Review: The Research Companion

I am officially starting my PhD program in September. To get prepared for my study,  I read a book called The Research Companion in the past 3 days. I would recommend every student who is preparing to start a research degree in social science to study this book if you are still confuse about how to read an academic article, how to write a research report, build academic connections, email a professor who might be able to give you guidance on your study, research ethics and even how to get funding and write a funding application.

This post is a summary of some major points in the Research Companion and I selected some useful books and links for my future references. If you don’t have time to read this book, this blog post shall give you some ideas about how you should do your research. But it is also highly recommended to spend a few days to read the book because there are many real life examples in the book that help you understand why you should consider certain issues in your study. And I only summarise information that is useful for my studies, therefore, you might miss some important information in the book if you are reading this post only.

  • Plan your research

http://www.raulpacheco.org/2015/08/online-resources-to-help-students-summarize-journal-articles-and-write-critical-reviews/

(This is a compiled list of articles about how to write summary for the articles you read. I forget most of the articles I have read if I don’t have a table to put down a summary for them. Believe me, you will find the summary very useful when you are writing up your literature review. )

 

http://blog.efpsa.org/2013/02/28/how-to-read-and-get-the-most-out-of-a-journal-article/

book:

Pyrczak, F. (2014). Evaluating research in academic journals (6th ed.). Glendale, CA: Pyrczak Publishing.

http://www.raulpacheco.org/2015/08/highlighting-and-note-taking-on-journal-articles-as-engagement/

Gathering data for your PhD (2015)

(This book is highly recommended by many profs, according to Research Companion. I will start a new post about my review on this book)

  • Design your methods that meet the participants’ needs. Think about that if your participants are children and they can’t read very well, sending them questionnaire is not a good approach.

 

  • Connect with professors who might be able to provide constructive advice on your study. Ask short questions that they could answer, but don’t let your initial approach to be a demand for them.

 

  • As a grant proposer, you can involve the public by making links with charities or other support groups. Involving the public does not mean that you invite participants from all walks of life in the community, instead, you should assign them different roles in your project.

 

  • Draw a map on how you bring the people together in your project.

 

  • Ten ways to get your proposal turn down by funders.

 

  • Resources for funding

http://www.researchfundingtoolkit.org/

https://www.vitae.ac.uk/researcher-careers/pursuing-an-academic-career/research-funding/where-to-find-sources-of-academic-research-funding

book:

Aldridge, J., & Derrington, A. M. (2012). The Research Funding Toolkit: How to Plan and Write Successful Grant Applications. Sage.

 

Table 2.7 checklist for obtaining funding

 

  • If things are not working well, better to talk to supervisors and colleagues as early as possible. Usually problems can be sorted if they are discovered early enough.

 

  • Perceived vs. actual risk. There are issues that researchers consider as ‘safe’ but participants could be a threat. Most of the researchers rely on their common sense to define safe/dangerous. For instance, research involving young men; sex workers; drug users and so on can be dangerous. However, participants might act in an aggressive manner when they are feeling ill or in pain or due to insecurity. (I think all the social science students should pay attention to this point, you don’t want to put the participants in an uncomfortable situation)

 

  • Read the safety policy in your department and university

 

  • Dress appropriately in the study setting.

 

  • Be aware of your body language and that of your participants. End the study if they are agitated, paranoid or seem distressed or delusional.

 

  • Table 6.2 checklist for dealing with distress

 

  • Construct participant details databases. Keep a record of those who refuse to participate or are unsuitable for the study. You can show later in your study that you weren’t biased in who you invited to take part. You can also create different databases with information about the same participants.

 

  • Some participants give you quick response on whether they would like to join your study. Whereas, some are hard to get hold of, they say yes to the study but clearly, they don’t want to. You’ll end up chasing them for the reply but at the end they usually don’t finish your study.

 

  • Keep a research diary and a diary of your study progress (This is why I started the blog here! I know these book reviews will stay somewhere in my cloud and I will forget about them if I don’t post them on a blog XD) You might want to record ideas for future work, books, papers you have want to read, conference or job information or people you would like to network with.

 

  • Table 7.6 ways of keeping qualitative data clean

 

 

  • Data protection (loc 4148)

 

  • Report findings: Poster checklist, conference talk checklist, symposium