Chapter 6 Data Processing and Validation

Data collected by participants must be processed and validated before it can be usefully analyzed. Processing entails, at a minimum, combining reports into a centralized database; ideally it should also include some mechanism for validation and error-correction. Mosquito Alert relies on a central server with a set of Django/Python-based web applications that handle this process as well as providing front-end portals for interface by various types of participants (general public, expert validators, etc.).

6.1 Participant Error-Checking and Revision

One component related to error-checking is ensuring that participants have the ability to change or even delete reports after sending them. Participants may at times accidentally mark the wrong location or enter some other erroneous information and the system should enable them to make changes. Mosquito Alert does this by allowing multiple versions of each report; all are stored for future reference but only the most recent is used in analysis and dissemination.

6.2 Sampling Effort Collection

Another component is the collection and analysis of information about participants’ sampling effort. This information can be used to correct biases resulting from the uneven distribution of sampling activity across space and time. It is important to be able to determine, for example, whether no reports have been received from a particular town because there are no target species there or because there are no participants looking there; conversely, one must determine if a town that has lots of reports has elevated mosquito presence or simply many participants.

In Mosquito Alert, this is done by collecting a small amount of location information from participants. Unless they opt out of this feature, the Mosquito Alert application uses the network and satellite location services on participants’ mobile devices to detect their locations 5 times per day. The times are randomly selected independently by each device each day during the hours when targeted species are most likely to be biting. As noted above in the ethics section, the device does not share the detected location itself with the central server, but instead shares only the identifier of the pre-defined sampling cell into which it falls. For computational efficiency (to reduce battery drain on the device), the sampling cell grid is defined simply by evenly spaced latitude and longitude divisions (initially 0.05 degrees each; currently 0.025 degrees each).

One lesson learned by the Mosquito Alert project is that participant location provides only part of the picture in terms of sampling effort. If it also important to know when participants are actually in a position to observe and report targeted mosquitoes. Our experience is that most people install the application on their device and use it briefly but stop interacting with it after a relatively short period of time. There is also large variation among participants in the amount of time they spend using the application. We therefore model what we call “reporting propensity” as a function of time elapsed since installation of the application and intrinsic motivation. We then adjust sampling effort based on the reporting propensity of each participant. The process is complicated by the fact that we do not link background tracking information with reporting information (for privacy reasons). The results, however, have proven to be effective.

6.3 Report Validation

The validation stage is important as a way to directly check that the reported mosquitoes are targeted species and to provide a basis for assessing participants’ proficiency. The latter outcome improves the possibilities for making accurate inferences about the reliability of reports from these participants that are not validated: Many participants will send some reports that include a photograph that can be used for validation and others that do not (either no photograph or none in which the specimen can be clearly seen). The expert validation of the first category of report facilitates inferences about the second.

There are a number of different ways of carrying out validation, and the choice among them will depend on local circumstances as well as evolving research on what works best. The primary approach of Mosquito Alert is to rely on a team of 9 entomologists who review reports from citizen scientists through a special expert validation portal. Each report is reviewed independently by 3 of these entomologists. In addition to selecting a category indicating their level of confidence in the report being of a targeted species, the entomologists are also able to write internal notes and notes to the citizen scientist. They are also able to flag the report for review by the entomology team’s leader, who can override any final decision.

Another approach to validation is to rely on other citizen scientists to review photographs. Mosquito Alert also uses this approach, sending each photograph to 30 different citizen scientists using the Crowd Crafting platform (which is accessed through the application directly on through a web browser).

Another project that has had success with this type of crowd-based validation approach is iNaturalist. That platform also allows much more interaction between citizen-science validators and the person making the original report (known as an observation) in a threaded conversation at the record level. Validated records arise from agreement between ‘identifiers’ leading the record to gain a data quality classification. It also provides an effective mechanism for cultivating citizen scientists to develop expertise in identifying certain specifies and thus improve their validation proficiency.