We’re using cookies, but you can turn them off in Privacy Settings. If you use the site without changing settings, you are agreeing to our use of cookies. Learn more in our Privacy Policy.

2021 Curriculum CFA Program Level I Quantitative MethodsPortfolio Management and Wealth Planning

graph image for data visualization

Online Learning Module

Explore data visualization using Python in our member-only module, eligible for up to 17.5 PL credits.

Introduction

Data have always been a key input for securities analysis and investment management, but the acceleration in the availability and the quantity of data has also been driving the rapid evolution of the investment industry. With the rise of big data and machine learning techniques, investment practitioners are embracing an era featuring large volume, high velocity, and a wide variety of data. This Refresher Reading (available as a PDF and ePub and eligible for 3.75 PL credits) and a new online learning module based on the content of the reading (eligible for up to 17.5 PL credits), invite you to explore and exploit this abundance of information for your investment strategies.

While this data-rich environment offers potentially tremendous opportunities for investors, turning data into useful information is not so straightforward. Organizing, cleaning, and analyzing data are crucial to the development of successful investment strategies; otherwise, we end up with “garbage in and garbage out” and failed investments. It is often said that 80% of an analyst’s time is spent on finding, organizing, cleaning, and analyzing data, while just 20% of her/his time is taken up by model development. So, the importance of having a properly organized, cleansed, and well-analyzed dataset cannot be over-emphasized. With this essential requirement met, an appropriately executed data analysis can detect important relationships within data, uncover underlying structures, identify outliers, and extract potentially valuable insights. Utilizing both visual tools and quantitative methods, like the ones covered in this reading, is the first step in summarizing and understanding data that will be crucial inputs to an investment strategy.

This reading provides a foundation for understanding important concepts that are an indispensable part of the analytical tool kit needed by investment practitioners, from junior analysts to senior portfolio managers. These basic concepts pave the way for more sophisticated tools that will be developed as the quantitative methods topic unfolds and that are integral to gaining competencies in the investment management techniques and asset classes that are presented later in the CFA curriculum.

Section 2 covers core data types, including continuous and discrete numerical data, nominal and ordinal categorical data, and structured versus unstructured data. Organizing data into arrays and data tables and summarizing data in frequency distributions and contingency tables are discussed in Section 3. Section 4 introduces the important topic of data visualization using a range of charts and graphics to summarize, explore, and better understand data. Section 5 covers the key measures of central tendency, including several variants of mean that are especially useful in investments. Quantiles and their investment applications are the focus of Section 6. Key measures of dispersion are discussed in Section 7. The shape of data distributions—specifically, skewness and kurtosis—are covered in Sections 8 and 9, respectively. Section 10 provides a graphical introduction to covariance and correlation between two variables. The reading concludes with a Summary.

Learning Outcomes

The member should be able to:

  1. Identify and compare data types;

  2. Describe how data are organized for quantitative analysis;

  3. Interpret frequency and related distributions;

  4. Interpret a contingency table;

  5. Describe ways that data may be visualized and evaluate uses of specific visualizations;

  6. Describe how to select among visualization types;

  7. Calculate and interpret measures of central tendency;

  8. Select among alternative definitions of mean to address an investment problem;

  9. Calculate quantiles and interpret related visualizations;

  10. Calculate and interpret measures of dispersion;

  11. Calculate and interpret target downside deviation;

  12. Interpret skewness;

  13. Interpret kurtosis;

  14. Interpret correlation between two variables.

Summary

In this reading, we have presented tools and techniques for organizing, visualizing, and describing data that permit us to convert raw data into useful information for investment analysis.

  • Data can be defined as a collection of numbers, characters, words, and text—as well as images, audio, and video—in a raw or organized format to represent facts or information.

  • From a statistical perspective, data can be classified as numerical data and categorical data. Numerical data (also called quantitative data) are values that represent measured or counted quantities as a number. Categorical data (also called qualitative data) are values that describe a quality or characteristic of a group of observations and usually take only a limited number of values that are mutually exclusive.

  • Numerical data can be further split into two types: continuous data and discrete data. Continuous data can be measured and can take on any numerical value in a specified range of values. Discrete data are numerical values that result from a counting process and therefore are limited to a finite number of values.

  • Categorical data can be further classified into two types: nominal data and ordinal data. Nominal data are categorical values that are not amenable to being organized in a logical order, while ordinal data are categorical values that can be logically ordered or ranked.

  • Based on how they are collected, data can be categorized into three types: cross-sectional, time series, and panel. Time-series data are a sequence of observations for a single observational unit on a specific variable collected over time and at discrete and typically equally spaced intervals of time. Cross-sectional data are a list of the observations of a specific variable from multiple observational units at a given point in time. Panel data are a mix of time-series and cross-sectional data that consists of observations through time on one or more variables for multiple observational units.

  • Based on whether or not data are in a highly organized form, they can be classified into structured and unstructured types. Structured data are highly organized in a pre-defined manner, usually with repeating patterns. Unstructured data do not follow any conventionally organized forms; they are typically alternative data as they are usually collected from unconventional sources.

  • Raw data are typically organized into either a one-dimensional array or a two-dimensional rectangular array (also called a data table) for quantitative analysis.

  • A frequency distribution is a tabular display of data constructed either by counting the observations of a variable by distinct values or groups or by tallying the values of a numerical variable into a set of numerically ordered bins. Frequency distributions permit us to evaluate how data are distributed.

  • The relative frequency of observations in a bin (interval or bucket) is the number of observations in the bin divided by the total number of observations. The cumulative relative frequency cumulates (adds up) the relative frequencies as we move from the first bin to the last, thus giving the fraction of the observations that are less than the upper limit of each bin.

  • A contingency table is a tabular format that displays the frequency distributions of two or more categorical variables simultaneously. One application of contingency tables is for evaluating the performance of a classification model (using a confusion matrix). Another application of contingency tables is to investigate a potential association between two categorical variables by performing a chi-square test of independence.

  • Visualization is the presentation of data in a pictorial or graphical format for the purpose of increasing understanding and for gaining insights into the data.

  • A histogram is a bar chart of data that have been grouped into a frequency distribution. A frequency polygon is a graph of frequency distributions obtained by drawing straight lines joining successive midpoints of bars representing the class frequencies.

  • A bar chart is used to plot the frequency distribution of categorical data, with each bar representing a distinct category and the bar’s height (or length) proportional to the frequency of the corresponding category. Grouped bar charts or stacked bar charts can present the frequency distribution of multiple categorical variables simultaneously.

  • A tree-map is a graphical tool to display categorical data. It consists of a set of colored rectangles to represent distinct groups, and the area of each rectangle is proportional to the value of the corresponding group. Additional dimensions of categorical data can be displayed by nested rectangles.

  • A word cloud is a visual device for representing textual data, with the size of each distinct word being proportional to the frequency with which it appears in the given text.

  • A line chart is a type of graph used to visualize ordered observations and often to display the change of data series over time. A bubble line chart is a special type of line chart that uses varying-sized bubbles as data points to represent an additional dimension of data.

  • A scatter plot is a type of graph for visualizing the joint variation in two numerical variables. It is constructed by drawing dots to indicate the values of the two variables plotted against the corresponding axes. A scatter plot matrix organizes scatter plots between pairs of variables into a matrix format to inspect all pairwise relationships between more than two variables in one combined visual.

  • A heat map is a type of graphic that organizes and summarizes data in a tabular format and represents it using a color spectrum. It is often used in displaying frequency distributions or visualizing the degree of correlation among different variables.

  • The key consideration when selecting among chart types is the intended purpose of visualizing data (i.e., whether it is for exploring/presenting distributions or relationships or for making comparisons).

  • A population is defined as all members of a specified group. A sample is a subset of a population.

  • A parameter is any descriptive measure of a population. A sample statistic (statistic, for short) is a quantity computed from or used to describe a sample.

  • Sample statistics—such as measures of central tendency, measures of dispersion, skewness, and kurtosis—help with investment analysis, particularly in making probabilistic statements about returns.

  • Measures of central tendency specify where data are centered and include the mean, median, and mode (i.e., the most frequently occurring value).

  • The arithmetic mean is the sum of the observations divided by the number of observations. It is the most frequently used measure of central tendency.

  • The median is the value of the middle item (or the mean of the values of the two middle items) when the items in a set are sorted into ascending or descending order. The median is not influenced by extreme values and is most useful in the case of skewed distributions.

  • The mode is the most frequently observed value and is the only measure of central tendency that can be used with nominal data. A distribution may be unimodal (one mode), bimodal (two modes), trimodal (three modes), or have even more modes.

  • A portfolio’s return is a weighted mean return computed from the returns on the individual assets, where the weight applied to each asset’s return is the fraction of the portfolio invested in that asset.

  • The geometric mean, X ¯ G , of a set of observations X 1, X 2, …, Xn , is X ¯ G = X 1 X 2 X 3 X n n , with Xi ≥ 0 for i = 1, 2, …, n. The geometric mean is especially important in reporting compound growth rates for time-series data. The geometric mean will always be less than an arithmetic mean whenever there is variance in the observations.

  • The harmonic mean, X ¯ H , is a type of weighted mean in which an observation’s weight is inversely proportional to its magnitude.

  • Quantiles—such as the median, quartiles, quintiles, deciles, and percentiles—are location parameters that divide a distribution into halves, quarters, fifths, tenths, and hundredths, respectively.

  • A box and whiskers plot illustrates the interquartile range (the “box”) as well as a range outside of the box that is based on the interquartile range, indicated by the “whiskers.”

  • Dispersion measures—such as the range, mean absolute deviation (MAD), variance, standard deviation, target downside deviation, and coefficient of variation—describe the variability of outcomes around the arithmetic mean.

  • The range is the difference between the maximum value and the minimum value of the dataset. The range has only a limited usefulness because it uses information from only two observations.

  • The MAD for a sample is the average of the absolute deviations of observations from the mean, i = 1 n | X i X ¯ | n , where X ¯ is the sample mean and n is the number of observations in the sample.

  • The variance is the average of the squared deviations around the mean, and the standard deviation is the positive square root of variance. In computing sample variance (s 2) and sample standard deviation (s), the average squared deviation is computed using a divisor equal to the sample size minus 1.

  • The target downside deviation, or target semideviation, is a measure of the risk of being below a given target. It is calculated as the square root of the average squared deviations from the target, but it includes only those observations below the target (B), or for all X i B n ( X i B ) 2 n 1 .

  • The coefficient of variation, CV, is the ratio of the standard deviation of a set of observations to their mean value. By expressing the magnitude of variation among observations relative to their average size, the CV permits direct comparisons of dispersion across different datasets. Reflecting the correction for scale, the CV is a scale-free measure (i.e., it has no units of measurement).

  • Skew or skewness describes the degree to which a distribution is asymmetric about its mean. A return distribution with positive skewness has frequent small losses and a few extreme gains compared to a normal distribution. A return distribution with negative skewness has frequent small gains and a few extreme losses compared to a normal distribution. Zero skewness indicates a symmetric distribution of returns.

  • Kurtosis measures the combined weight of the tails of a distribution relative to the rest of the distribution. A distribution with fatter tails than the normal distribution is referred to as fat-tailed (leptokurtic); a distribution with thinner tails than the normal distribution is referred to as thin-tailed (platykurtic). Excess kurtosis is kurtosis minus 3, since 3 is the value of kurtosis for all normal distributions.

  • The correlation coefficient is a statistic that measures the association between two variables. It is the ratio of covariance to the product of the two variables’ standard deviations. A positive correlation coefficient indicates that the two variables tend to move together, whereas a negative coefficient indicates that the two variables tend to move in opposite directions. Correlation does not imply causation, simply association. Issues that arise in evaluating correlation include the presence of outliers and spurious correlation.