Lecture 04: Gathering and Collecting Data

MIT OpenCourseWare

7 chapters8 takeaways12 key terms5 questions

Overview

This lecture explores various methods for obtaining data, categorizing them into three main approaches: utilizing existing data, collecting new data, and leveraging internet-based data sources. It emphasizes the importance of understanding data sources, their potential limitations, and the ethical considerations involved in data collection and usage. The discussion covers public databases, academic repositories, web scraping techniques, and the process of designing and implementing one's own data collection efforts, highlighting the need for careful planning, quality control, and ethical approval.

How was this?

Save this permanently with flashcards, quizzes, and AI chat

Chapters

Data can be found in existing libraries, collected by oneself, or accessed from the internet.
Existing data may not always be suitable for specific research needs, necessitating custom data collection.
Internet data, while abundant, often requires processing to be usable.
The lecture will cover finding data, understanding its distribution, and extracting more information from it.

Understanding where and how to access data is the foundational step for any empirical research, enabling the exploration of real-world phenomena.

The lecture mentions that a librarian at MIT can help find and even purchase data, illustrating a resource for accessing existing datasets.

Libraries (like MIT Libraries) and librarians are valuable resources for locating existing datasets.
Government websites (e.g., data.gov) offer a wide range of public data.
IPUMS (Integrated Public Use Microdata Series) provides access to census data and other large-scale surveys like the Current Population Survey and American Time Use Survey.
Academic data repositories like ICPSR (Inter-university Consortium for Political and Social Research) host anonymized and documented datasets from researchers.

Leveraging established repositories saves time and resources, providing access to well-curated data that has often undergone significant cleaning and documentation.

The IPUMS website hosts census data, allowing researchers to access information on a large number of people for free.

The International IPUMS project extends access to census data from numerous countries.
Demographic and Health Surveys (DHS) provide rich data on health and demographic indicators in developing countries.
Living Standards Measurement Surveys (LSMS) from the World Bank offer detailed household data on consumption, income, and education in developing countries.
Rand Corporation maintains valuable household panel surveys for countries like Indonesia, Malaysia, and Mexico.

These international datasets are crucial for comparative research and understanding development challenges and successes across different regions.

DHS data, while not including income, collects information on assets like tables and refrigerators, which can be used to construct a measure of household wealth.

Data quality varies; reputable sources often have quality control measures like back-checking.
Identifying information must be removed or restricted to protect privacy, leading to stricter data access requirements.
Access to sensitive data may require special clearance, commitments to data security, and human subject approval.
Conceptual issues, such as recall error or poorly phrased questions, can affect data accuracy even with technical quality checks.

Understanding data limitations and privacy concerns is essential for ethical research and for interpreting findings accurately.

The aging dataset example illustrates how even seemingly innocuous information like living in a specific state, when combined with other data, could potentially re-identify individuals, necessitating restricted access.

Web scraping involves extracting data from websites, either from single pages, entire sites, or forms.
APIs (Application Programming Interfaces) are often provided by platforms like Twitter and Facebook for structured data access.
Tools like Python with Beautiful Soup or R with `readHTMLTable` can be used for web scraping.
Web scraping can be time-consuming and may face technical limitations or website restrictions.

Web scraping offers a powerful method to collect data that is publicly visible online but not readily available in a structured format.

Researchers collected prices of used books from abebooks.com by inspecting the website's HTML and using scraping tools to extract title, date, and price information.

Collecting original data can involve using online survey tools, mobile apps, or organizing data collection teams.
A data management plan is crucial, outlining data security, encryption, and potential sharing strategies.
Human subject approval from an Institutional Review Board (IRB) is necessary to ensure ethical data collection.
Piloting data collection instruments helps identify and resolve issues before full implementation.

When existing data is insufficient, designing and executing one's own data collection is a viable, albeit complex, alternative to answer specific research questions.

An experiment could involve installing apps on participants' phones to track their movement, or asking them to alter their commute methods to observe behavioral changes.

The Freedom of Information Act (FOIA) allows public access to government-collected data, with exceptions for confidential or unidentifiable information.
FOIA requests can be cumbersome and slow, often involving appeals and legal processes.
Government agencies may deny requests if compiling the data is too costly or if it risks re-identification.
FOIA applies only to government data, not data collected by private companies.

FOIA provides a legal avenue to access potentially valuable government data that might not be publicly advertised, though its practical application can be challenging.

A request for data from the Drug Enforcement Administration via FOIA can be extremely slow and require extensive back-and-forth, illustrating the difficulties in accessing such information.

Key takeaways

1Data acquisition is a critical first step in research, with diverse sources ranging from public archives to self-collection.
2Existing data repositories like IPUMS and ICPSR are invaluable resources, offering structured and documented datasets.
3International datasets (IPUMS International, DHS, LSMS) are essential for global comparative studies.
4Data quality and privacy are paramount; understanding limitations and adhering to ethical guidelines is crucial.
5Web scraping and APIs offer methods to access data from the internet, but require technical skills and awareness of website policies.
6Collecting original data involves rigorous planning, ethical approval (IRB), and data management strategies.
7The Freedom of Information Act provides a mechanism for accessing government data, though it can be a complex and lengthy process.
8Researchers must critically evaluate the source, quality, and potential biases of any data they intend to use.

Key terms

Data RepositoriesIPUMSICPSRDemographic and Health Surveys (DHS)Living Standards Measurement Surveys (LSMS)Data AnonymizationInstitutional Review Board (IRB)Web ScrapingAPI (Application Programming Interface)Panel DataRepeated Cross-Sectional DataFreedom of Information Act (FOIA)

Test your understanding

1What are the three primary categories of data sources discussed in the lecture, and what are the main advantages and disadvantages of each?
2How can researchers leverage existing data repositories like IPUMS or ICPSR for their studies, and what are the potential limitations of using such data?
3What are the key ethical considerations and practical steps involved in collecting one's own data, including the role of the IRB?
4Explain the concept of web scraping and its utility, along with the common tools and potential challenges associated with it.
5Under what circumstances might a researcher utilize the Freedom of Information Act to obtain data, and what are the typical obstacles encountered?