
Applied ML 2020 - 02 Visualization and matplotlib
Saylor University
Overview
This video introduces the principles and practices of data visualization, emphasizing its importance for both exploring data and communicating findings. It covers foundational concepts from visualization pioneers like Edward Tufte and William Cleveland, discusses the hierarchy of visual channels for encoding data, and delves into the practical application of these principles using Python libraries, primarily Matplotlib. The lecture highlights common pitfalls such as overplotting and poor color map choices, and offers guidance on creating effective, reproducible, and informative visualizations.
Save this permanently with flashcards, quizzes, and AI chat
Chapters
- Data visualization serves two main purposes: exploring data to understand it better and communicating findings to others.
- For this course, the focus is on data exploration, but the principles apply to communication as well.
- Effective communication of results is crucial in data science and often required in capstone projects and professional settings.
- Edward Tufte's principles emphasize 'showing the data' and maximizing the 'data-to-ink ratio' by minimizing non-data elements.
- William Cleveland highlights that 'tools matter,' stressing the importance of mastering visualization software.
- A critical, often overlooked principle is to always label all elements of a plot (axes, colors, points) and to invest significant time in creating clear visualizations.
- Data can be mapped to various visual channels like length, angle, shape, area, volume, color, and position.
- Human perception follows a hierarchy: position and length are easiest to compare, followed by angle, area, and then color hue.
- Avoid using color hue for quantitative data as it's difficult for humans to accurately compare or calculate ratios based on hue alone.
- Color maps are used to map continuous data values to colors, and their choice significantly impacts readability.
- There are sequential (light-to-dark), diverging (midpoint to extremes), and qualitative color maps.
- Perceptually uniform color maps (like 'viridis') ensure that changes in color correspond linearly to changes in perceived brightness, avoiding artifacts seen in older maps like 'jet'.
- Matplotlib is the foundational library, with Pandas plotting built on top of it.
- Seaborn builds on Matplotlib and Pandas, offering more sophisticated statistical plots.
- Bokeh and Altair are modern, declarative libraries often used for interactive web-based visualizations.
- Yellowbrick focuses on model visualization for scikit-learn.
- A Matplotlib visualization consists of a Figure (the overall window/page) and Axes (the actual plotting area with coordinates).
- Matplotlib offers two interfaces: a stateful (pyplot-based) interface with global state, and an object-oriented interface that is more explicit and generally preferred.
- Functions like `plt.subplots()` are useful for creating grids of Axes within a Figure.
- `plot()` can create both line plots and scatter plots by changing marker styles.
- `scatter()` offers more control over individual point colors and sizes, useful for encoding additional variables.
- Histograms are essential for visualizing distributions, and it's important to adjust the number of bins from the default of 10.
- Bar charts are straightforward, but horizontal bar charts are often better for readability with many or long category labels.
- Overplotting, common with large datasets, can obscure data structure; techniques like adjusting alpha (transparency) or using density plots (like hex grids) can help.
- `twinx()` or `twiny()` allows plotting two datasets with different scales on the same axes, useful for comparing related but differently scaled variables.
- Aspect ratios and baselines significantly influence data interpretation; these should be chosen consciously, not just accepted as defaults.
- 3D plots can be visually appealing but are often difficult to interpret, especially without interactivity, and should be used sparingly.
- Invest time in visualization: consciously decide on mappings, color maps, aspect ratios, and baselines.
- Always label all plot elements clearly.
- Each plot should aim to answer a specific question or tell a clear story.
- Avoid manual editing of figures (e.g., in Photoshop); always strive for reproducible plots generated directly from code.
Key takeaways
- Prioritize clarity and accuracy by understanding the perceptual hierarchy of visual channels.
- Choose color maps carefully, opting for perceptually uniform options for continuous data to avoid misleading artifacts.
- Mastering Matplotlib's object-oriented interface provides greater control and leads to more maintainable code.
- Address overplotting in large datasets using density plots or transparency, rather than accepting obscured data.
- Always label plots comprehensively and ensure they are reproducible directly from code.
- Every visualization should serve a purpose, answering a specific question or conveying a clear message.
- Be mindful of how aspect ratios and baselines can influence interpretation and make conscious choices about them.
Key terms
Test your understanding
- Why is understanding the perceptual hierarchy of visual channels crucial when designing a plot?
- What are the potential problems with using older color maps like 'jet' for heatmaps, and how do perceptually uniform maps address this?
- Explain the difference between Matplotlib's stateful and object-oriented interfaces and why the latter is generally preferred.
- What strategies can be employed to mitigate the problem of overplotting in scatter plots with large datasets?
- Why is it important to consciously choose aspect ratios and baselines for plots, rather than relying solely on default settings?