Finding images with spoken tags

The “Spraaktags” project broadly aims to explore dialogue-driven design for mobile speech interfaces. We refer to ‘Spoken Audio Tags’ (Dutch: spraaktags) as ‘Verbals’. The first year of the project consisted of two phases. The phase-1 comprised of design and implementation of open-source mobile application showcases referred as ‘Verbals mobile system’. The phase-2 consisted of a six-week ethnographic field-research and development of IBM’s Spoken-Web (or World Wide Telecom Web) based mobile systems for the domain of education in rural environments in India.

Overview of Phase-1 (Design and Implementation of Verbals Mobile System):
To showcase dialogue-driven design for mobile speech interfaces, the STITPROnounce project’s phase-1 started with the aim to design and develop dialogue-driven mobile speech applications that facilitates a smart phone user to speech tag and speech search images using speech recognition technology. We have implemented ‘Verbals mobile system’, i.e., open-source mobile application showcases on Android platform (Java) using Google’s Speech Recognition service, Text-to-Speech Engine and Flickr API. The speech recognition technology is still evolving and managing user expectation is crucial for speech based interfaces. During the course of the project we extended dialogue-driven design to narrative-driven design, i.e. overlaying dialogues with a narrative structure. Narratives are used to engage, manage user expectations, to give a personality to a mobile application as a technology that is ‘not-perfect’ but evolving.

Overview of Phase-2 (Speech Interfaces For The Education Sector In Rural India):
The rural areas of developing nations represent a challenging but highly significant environments for speech applications. We conducted an ethnographic and participatory field-research in rural villages of Mewat district of India in collaboration with IBM Research India and SRF Foundation. Mewat district is one of the least developed districts of India but has a high penetration of mobile phones. The field-research assisted in identifying scenarios of use of speech-technology and narrative-structure for mobile applications addressing education in low-literacy environments. We followed a participatory design and rapid prototyping approach to identify and develop two design concepts and application prototypes: ‘Spoken-English Cricket Game’ and ‘Spoken-Web based Data Flow System’.

The project ran from 4 April 2011 till 3 April 2012. The project was carried out by Martha Larson and Abhigyan Singh in the Mulitmedia Information Retrieval Lab (http://dmirlab.tudelft.nl/) at Delft University of Technology (http://www.tudelft.nl/)

Project Results

The code created by the project is freely available on GitHub at:

https://github.com/abhigyan/Verbals-Push
https://github.com/abhigyan/Verbals-Pull

The project results are summarized in two publications:

Abhigyan Singh and Martha Larson. 2013. Narrative-driven Multimedia Tagging and Retrieval: Investigating Design and Practice for Speech-based Mobile Applications. Proceedings of the First Workshop on Speech, Language and Audio in Multimedia (SLAM 2013), Marseille, France, August 22-23, 2013, CEUR-WS.org, online http://ceur-ws.org/Vol-1012/papers/paper-16.pdf

Martha Larson, Nitendra Rajput, Abhigyan Singh, and Saurabh Srivastava (alphabetical) 2013. I want to be Sachin Tendulkar!: a spoken English cricket game for rural students. In Proceedings of the 2013 conference on Computer supported cooperative work (CSCW ’13). ACM, New York, NY, USA, 1353-1364

More details about the project could be found at: http://www.stitpronounce.tudelft.nl/