
Lecture 04: Gathering and Collecting Data
MIT OpenCourseWare
Overview
This lecture explores various methods for obtaining data, categorizing them into three main approaches: utilizing existing data, collecting new data, and leveraging internet-based data sources. It emphasizes the importance of understanding data sources, their potential limitations, and the ethical considerations involved in data collection and usage. The discussion covers public databases, academic repositories, web scraping techniques, and the process of designing and implementing one's own data collection efforts, highlighting the need for careful planning, quality control, and ethical approval.
Save this permanently with flashcards, quizzes, and AI chat
Chapters
- Data can be found in existing libraries, collected by oneself, or accessed from the internet.
- Existing data may not always be suitable for specific research needs, necessitating custom data collection.
- Internet data, while abundant, often requires processing to be usable.
- The lecture will cover finding data, understanding its distribution, and extracting more information from it.
- Libraries (like MIT Libraries) and librarians are valuable resources for locating existing datasets.
- Government websites (e.g., data.gov) offer a wide range of public data.
- IPUMS (Integrated Public Use Microdata Series) provides access to census data and other large-scale surveys like the Current Population Survey and American Time Use Survey.
- Academic data repositories like ICPSR (Inter-university Consortium for Political and Social Research) host anonymized and documented datasets from researchers.
- The International IPUMS project extends access to census data from numerous countries.
- Demographic and Health Surveys (DHS) provide rich data on health and demographic indicators in developing countries.
- Living Standards Measurement Surveys (LSMS) from the World Bank offer detailed household data on consumption, income, and education in developing countries.
- Rand Corporation maintains valuable household panel surveys for countries like Indonesia, Malaysia, and Mexico.
- Data quality varies; reputable sources often have quality control measures like back-checking.
- Identifying information must be removed or restricted to protect privacy, leading to stricter data access requirements.
- Access to sensitive data may require special clearance, commitments to data security, and human subject approval.
- Conceptual issues, such as recall error or poorly phrased questions, can affect data accuracy even with technical quality checks.
- Web scraping involves extracting data from websites, either from single pages, entire sites, or forms.
- APIs (Application Programming Interfaces) are often provided by platforms like Twitter and Facebook for structured data access.
- Tools like Python with Beautiful Soup or R with `readHTMLTable` can be used for web scraping.
- Web scraping can be time-consuming and may face technical limitations or website restrictions.
- Collecting original data can involve using online survey tools, mobile apps, or organizing data collection teams.
- A data management plan is crucial, outlining data security, encryption, and potential sharing strategies.
- Human subject approval from an Institutional Review Board (IRB) is necessary to ensure ethical data collection.
- Piloting data collection instruments helps identify and resolve issues before full implementation.
- The Freedom of Information Act (FOIA) allows public access to government-collected data, with exceptions for confidential or unidentifiable information.
- FOIA requests can be cumbersome and slow, often involving appeals and legal processes.
- Government agencies may deny requests if compiling the data is too costly or if it risks re-identification.
- FOIA applies only to government data, not data collected by private companies.
Key takeaways
- Data acquisition is a critical first step in research, with diverse sources ranging from public archives to self-collection.
- Existing data repositories like IPUMS and ICPSR are invaluable resources, offering structured and documented datasets.
- International datasets (IPUMS International, DHS, LSMS) are essential for global comparative studies.
- Data quality and privacy are paramount; understanding limitations and adhering to ethical guidelines is crucial.
- Web scraping and APIs offer methods to access data from the internet, but require technical skills and awareness of website policies.
- Collecting original data involves rigorous planning, ethical approval (IRB), and data management strategies.
- The Freedom of Information Act provides a mechanism for accessing government data, though it can be a complex and lengthy process.
- Researchers must critically evaluate the source, quality, and potential biases of any data they intend to use.
Key terms
Test your understanding
- What are the three primary categories of data sources discussed in the lecture, and what are the main advantages and disadvantages of each?
- How can researchers leverage existing data repositories like IPUMS or ICPSR for their studies, and what are the potential limitations of using such data?
- What are the key ethical considerations and practical steps involved in collecting one's own data, including the role of the IRB?
- Explain the concept of web scraping and its utility, along with the common tools and potential challenges associated with it.
- Under what circumstances might a researcher utilize the Freedom of Information Act to obtain data, and what are the typical obstacles encountered?