Introduction

Data are organised as data matrix as shown below:

duke, data science

Each row represents an observation or a case and each column represents a variable .

There are two types of variable: Numerical and Categorical.

  • Numerical (Quantitative) variables take on numerical values, it is sensible to add, substract, take averages, etc. with these values.
  • Categorical (Qualitative) variables take on a limited number of distinct categories. Categories can be identified with numbers, but it wouldn't be sensible to do arithmetic operations with these values. They are merely placeholders for the levels of categorical variable.

Numerical variables can be further categorized as Continuous or Discrete:

  • Continuous numerical variable are usually measured, and can take on any numerical value. Such as height, weight. While we tend to round our height or weight when we record it, it is actually measured on a continuous scale.
  • Discrete numerical variables are generally counted, such as number of cars a house, cars a household owns, and can take on only whole non-negative numbers.

It is important to think about the nature of the variable, and not just the observed values when determining if a numerical variable is continuous or discrete, as rounding of continuous variables can make them appear to be discrete.

Categorical variable can be also further categorized as follows:

  • Categorical variable that have ordered levels are called ordinal. For example: very unsatisfied, unsatisfied, neutral, satisfied, or very satisfied.
  • If the levels do not have an inherent ordering to them, then the variable is simply called categorical. For example: male or female.

duke, data science

Thus, the following data can be classifed as follows:

  • cr_req: Number of content removal requests made to Google -> discrete numerical;
  • cr_comply: Percentage of content removal requests Google complied with -> continuous numerical;
  • ud_req: Number of user data requests as part of a criminal investigation -> discrete numerical;
  • ud_comply: Percentage of user data requests Google complied with -> continuous numerical;
  • hemisphere: Hemisphere that the country is located in -> categorical;
  • hdi: Human development index -> ordinal categorical;

duke, data science

Relationships between variables

  • When two variables show some connection with one another, they are called associated, or dependent, variables;
  • The association can be further described as positive or negative;
  • If two variables are not associated, they are said to be independent.

For example, the association of the following two variable appears to be positive.

duke, data science

References & Resources

  • N/A