Applied ML 2020 - 02 Visualization and matplotlib
1:07:31

Applied ML 2020 - 02 Visualization and matplotlib

Saylor University

9 chapters7 takeaways19 key terms5 questions

Overview

This video introduces the principles and practices of data visualization, emphasizing its importance for both exploring data and communicating findings. It covers foundational concepts from visualization pioneers like Edward Tufte and William Cleveland, discusses the hierarchy of visual channels for encoding data, and delves into the practical application of these principles using Python libraries, primarily Matplotlib. The lecture highlights common pitfalls such as overplotting and poor color map choices, and offers guidance on creating effective, reproducible, and informative visualizations.

How was this?

Save this permanently with flashcards, quizzes, and AI chat

Chapters

  • Data visualization serves two main purposes: exploring data to understand it better and communicating findings to others.
  • For this course, the focus is on data exploration, but the principles apply to communication as well.
  • Effective communication of results is crucial in data science and often required in capstone projects and professional settings.
Understanding why we visualize data helps focus our efforts on creating plots that are both insightful for ourselves and clear for others.
  • Edward Tufte's principles emphasize 'showing the data' and maximizing the 'data-to-ink ratio' by minimizing non-data elements.
  • William Cleveland highlights that 'tools matter,' stressing the importance of mastering visualization software.
  • A critical, often overlooked principle is to always label all elements of a plot (axes, colors, points) and to invest significant time in creating clear visualizations.
Adhering to these fundamental principles ensures that visualizations are clear, informative, and accurately represent the data, avoiding misinterpretation.
The speaker notes that many student reports and papers fail by not labeling axes or explaining what colors/points represent, rendering the plots useless.
  • Data can be mapped to various visual channels like length, angle, shape, area, volume, color, and position.
  • Human perception follows a hierarchy: position and length are easiest to compare, followed by angle, area, and then color hue.
  • Avoid using color hue for quantitative data as it's difficult for humans to accurately compare or calculate ratios based on hue alone.
Understanding this hierarchy helps in choosing the most effective visual channels to represent data, ensuring that comparisons and interpretations are accurate and intuitive.
Pie charts are often less effective than bar charts because they rely on comparing angles or areas, which are harder to perceive accurately than lengths.
  • Color maps are used to map continuous data values to colors, and their choice significantly impacts readability.
  • There are sequential (light-to-dark), diverging (midpoint to extremes), and qualitative color maps.
  • Perceptually uniform color maps (like 'viridis') ensure that changes in color correspond linearly to changes in perceived brightness, avoiding artifacts seen in older maps like 'jet'.
Using perceptually uniform color maps prevents misleading visual artifacts and ensures that the perceived intensity of colors accurately reflects the underlying data values, especially in heatmaps.
The 'jet' color map can create artificial contour lines or rings in heatmaps that do not exist in the actual data, unlike the perceptually uniform 'viridis' map.
  • Matplotlib is the foundational library, with Pandas plotting built on top of it.
  • Seaborn builds on Matplotlib and Pandas, offering more sophisticated statistical plots.
  • Bokeh and Altair are modern, declarative libraries often used for interactive web-based visualizations.
  • Yellowbrick focuses on model visualization for scikit-learn.
Familiarity with different libraries allows you to choose the best tool for the specific visualization task, whether it's basic plotting, statistical analysis, or interactive exploration.
Pandas plotting is convenient for dataframes, while Seaborn provides ready-made plots like violin plots and facet plots that are more complex to build directly in Matplotlib.
  • A Matplotlib visualization consists of a Figure (the overall window/page) and Axes (the actual plotting area with coordinates).
  • Matplotlib offers two interfaces: a stateful (pyplot-based) interface with global state, and an object-oriented interface that is more explicit and generally preferred.
  • Functions like `plt.subplots()` are useful for creating grids of Axes within a Figure.
Understanding the Figure/Axes structure and the two interfaces is fundamental to controlling how and where plots are drawn, enabling more complex and customized visualizations.
Using `plt.subplots(2, 2)` creates a 2x2 grid of Axes objects, allowing you to plot into each subplot individually using the object-oriented approach (e.g., `ax[0, 0].plot(...)`).
  • `plot()` can create both line plots and scatter plots by changing marker styles.
  • `scatter()` offers more control over individual point colors and sizes, useful for encoding additional variables.
  • Histograms are essential for visualizing distributions, and it's important to adjust the number of bins from the default of 10.
  • Bar charts are straightforward, but horizontal bar charts are often better for readability with many or long category labels.
Knowing the capabilities of different plot types and their customization options allows you to choose the most appropriate visualization for your data and the message you want to convey.
When plotting categorical data with long names, using `plt.barh()` (horizontal bar chart) makes the tick labels much easier to read than a vertical bar chart.
  • Overplotting, common with large datasets, can obscure data structure; techniques like adjusting alpha (transparency) or using density plots (like hex grids) can help.
  • `twinx()` or `twiny()` allows plotting two datasets with different scales on the same axes, useful for comparing related but differently scaled variables.
  • Aspect ratios and baselines significantly influence data interpretation; these should be chosen consciously, not just accepted as defaults.
  • 3D plots can be visually appealing but are often difficult to interpret, especially without interactivity, and should be used sparingly.
Awareness of these techniques and pitfalls helps avoid common mistakes, leading to more accurate interpretations and more effective communication of complex data relationships.
Comparing math PhDs awarded per year with arcade revenue per year requires `twinx()` because their scales (thousands vs. billions) are vastly different, preventing one series from appearing as a flat line.
  • Invest time in visualization: consciously decide on mappings, color maps, aspect ratios, and baselines.
  • Always label all plot elements clearly.
  • Each plot should aim to answer a specific question or tell a clear story.
  • Avoid manual editing of figures (e.g., in Photoshop); always strive for reproducible plots generated directly from code.
These guidelines ensure that visualizations are not only informative and accurate but also reproducible and serve a clear purpose, whether for personal exploration or external communication.
Instead of just showing a histogram of age, a good visualization would state the question (e.g., 'Is the age distribution as expected?') and the conclusion (e.g., 'The distribution shows a concentration of young adults, with a long tail').

Key takeaways

  1. 1Prioritize clarity and accuracy by understanding the perceptual hierarchy of visual channels.
  2. 2Choose color maps carefully, opting for perceptually uniform options for continuous data to avoid misleading artifacts.
  3. 3Mastering Matplotlib's object-oriented interface provides greater control and leads to more maintainable code.
  4. 4Address overplotting in large datasets using density plots or transparency, rather than accepting obscured data.
  5. 5Always label plots comprehensively and ensure they are reproducible directly from code.
  6. 6Every visualization should serve a purpose, answering a specific question or conveying a clear message.
  7. 7Be mindful of how aspect ratios and baselines can influence interpretation and make conscious choices about them.

Key terms

Data VisualizationData ExplorationData CommunicationData-to-Ink RatioVisual ChannelsPerceptual HierarchyColor MapPerceptually UniformMatplotlibFigureAxesStateful InterfaceObject-Oriented InterfaceOverplottingHex GridTwin AxesAspect RatioBaselineReproducibility

Test your understanding

  1. 1Why is understanding the perceptual hierarchy of visual channels crucial when designing a plot?
  2. 2What are the potential problems with using older color maps like 'jet' for heatmaps, and how do perceptually uniform maps address this?
  3. 3Explain the difference between Matplotlib's stateful and object-oriented interfaces and why the latter is generally preferred.
  4. 4What strategies can be employed to mitigate the problem of overplotting in scatter plots with large datasets?
  5. 5Why is it important to consciously choose aspect ratios and baselines for plots, rather than relying solely on default settings?

Turn any lecture into study material

Paste a YouTube URL, PDF, or article. Get flashcards, quizzes, summaries, and AI chat — in seconds.

No credit card required

Applied ML 2020 - 02 Visualization and matplotlib | NoteTube | NoteTube