Dataset Details

Title:

Accent Classification Dataset (Ghana)

Details:

The Accent Classification Dataset (Ghana) comprises audio recordings collected from native and non-native English speakers across various regions of Ghana. The dataset highlights the rich linguistic diversity and regional variations in spoken English within the country. Each participant contributed by reading the same three predefined scripts, ensuring consistency across samples. The recordings capture distinct accents and speech patterns, providing valuable data for linguistic analysis, accent classification, and speech recognition studies. This dataset is a vital resource for researchers and developers working on language processing, dialect identification, and regional accent modeling in Ghana and beyond.

Methodology

The data collection process for the Accent Classification Dataset (Ghana) followed a structured and inclusive approach to ensure quality, authenticity, and diversity. Below are the steps undertaken:

Platform and Tools

The primary platform for data collection was Telegram, leveraging its robust APIs to streamline communication and data submission.

Custom Telegram bots and scripts were developed to facilitate participant interaction, script dissemination, and audio file submissions.

Public Engagement

The data collection resource was shared widely with the general public via Telegram channels, groups, and other social media platforms to encourage participation.

Clear instructions were provided to participants, including how to access the scripts, record their voices, and submit their recordings through the Telegram bot.

Identity Verification

Participants were required to verify their identities before their data could be included in the pool. Verification included confirming basic demographic details (age, ethnicity, and region) and validating the authenticity of their submitted audio recordings.

Measures were implemented to ensure that each participant’s submission was unique and aligned with the project's requirements.

Script and Recording Process

All participants were provided with the same three predefined scripts to read, ensuring consistency across recordings.

Participants were instructed to record their voices in a quiet environment and submit all three recordings via the Telegram platform.

Data Validation and Registration

Submitted audio files were reviewed for quality, clarity, and adherence to the project guidelines.

Verified submissions, along with the corresponding metadata (e.g., age, ethnicity, and region), were registered into the dataset pool for further processing and analysis.

This methodology ensured that the dataset is representative, authentic, and of high quality, making it a valuable resource for linguistic and speech-related research.

Provenance:
The Accent Classification Dataset (Ghana) was collected from the general public across various regions of Ghana. Contributors were selected at random, ensuring a diverse representation of native and non-native English speakers. Participants were asked to read three predefined scripts, which were recorded and submitted for inclusion in the dataset. Care was taken to include individuals of varying ages, ethnicities, and linguistic backgrounds to capture the rich diversity of accents and speech patterns in the country. This random and inclusive approach enhances the dataset's authenticity and broad applicability in linguistic and speech research.