LTC 2013 Accepted Papers and Demos with Abstracts

Marek Kubis . Demo: Evaluating wordnets using query languages

Abstract: The demo shows how to use the query languages of the WQuery system in order to evaluate and repair the data stored in WordNet-like lexical databases. Various queries useful in this context are discussed and their realizations expressed in the languages of the WQuery system are presented.

András Beke, Mária Gósy and Viktória Horváth . Temporal variability in spontaneous Hungarian speech

Abstract: The aim of this paper is an objective presentation of temporal features of spontaneous Hungarian narratives, as well as a characterization of separable portions of spontaneous speech (thematic units and phrases). An attempt is made at capturing objective temporal properties of boundary marking. Spontaneous narratives produced by ten speakers taken from the BEA Hungarian Spontaneous Speech Data Base were analyzed in terms of hierarchical units of narratives (duration of pauses and of units, speakers’ rates of articulation, number of words produced, and the interrelationships of all these). The results confirm the presence of thematic units and of phrases in narratives by objective numerical values. We conclude that (i) the majority of speakers organize their narratives in similar temporal structures, (ii) thematic units can be identified in terms of certain prosodic criteria, (iii) there are statistically valid correlations between factors like the duration of phrases, the word count of phrases, the rate of articulation of phrases, and pausing characteristics, and (iv) these parameters exhibit extensive variability both across and within speakers.

Ahmed Haddad, Slim Mesfar and Henda Ben Ghezala . Assistive Arabic character recognition for android application

Abstract: This work was integrated within the “Oreillodule” project; a real-time system for the synthesis, recognition and translation of Arabic language. Within this framework, we proceeded to set up an automatic generator of an electronic dictionary within a context geared towards Arabic based manuscript recognition. Our approach recommended for the construction of the lexicon is similar to that advanced on the level of the treatment of electronic documents. It is based on the automatic generation of lexicon from caractheristic morphological and dérivational ones of the Arab language. With an aim of a bearing of the system on machines with reduced capacities (embarked systems, PDA, mobile phones, shelf PC, etc), we distinguish from the time constraints of calculation and obstruction memory. For that, we will adopt modelings of the dictionary with a minimum of redundancy while guaranteeing a strong degree of reduction and precision. This modelling procedure will be based on a soft representation of words called “descrip-tor”. These descriptors are extracted from generic silhouettes of the recognised manuscript words formed out of the concatenation of the fundamental traits, descending and relatively vertical, of each character followed by the elimination of non significant traits. The same descriptors are used for clustering and indexing the sub-lexicons. These sub-lexicons will be used for the calculation of the similarity between the stored forms and those to be recognised. During the assistive recognition phase we will proceed with some new features according to the silhouette descriptor in order to give the opportu-nity of using the system for the people who have some physical problems (handicaps). For that, we include a phase of modeling where the system evaluate the writing style of the user to enhance the recognition.

Shikha Garg and Vishal Goyal . System for Generating Questions Automatically From Given Punjabi Text

Abstract: This paper introduces a system for generating questions automatically for Punjabi. The System transforms a declarative sentence into its interrogative counterpart. It accepts sentences as an input and produces possible set of questions for the given input. Not much work has been done in the field of Question Generation for Indian Languages. The current paper represents the Question Generation System for Punjabi language to generate questions for the given input in Gurmukhi script. For Punjabi, adequate annotated corpora, Part of speech taggers and other Natural Language Processing tools are not yet available in the required measure. Thus, this system relies on the Named Entity Recognition tool. Also, various Punjabi Language dependent rules have been developed to generate output based on the named entity found in the given input sentence.

Anu Sharma Punjabi Sentiment (Opinion) Analyzer

Abstract: The most active growing part of the web is “social media”. Sentiment analysis (opinion mining) plays an important role in determining the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. In this paper, a novel approach is proposed for automatically extracting the news review from web pages, by using basic NLP technique like N-gram (Unigram and Bigram). The author classifies news reviews broadly into two categories positive and negative. Here author experimented machine learning algorithm Naive Bayes. The algorithm we proposed achieved an average accuracy of 80% on multi-category dataset.

Mirjam S. Maučec, Gregor Donaj and Zdravko Kačič . Improving Statistical Machine Translation with Additional Language Models

Abstract: This paper proposes n-best list rescoring in order to improve the translations produced by the statistical machine translation system. Phrase-based statistical machine translation is used as a baseline. We have focused on translation between two morphologically rich languages and have extended phrase-based translation into factored translation. The factored translation model integrates linguistic knowledge into translation in terms of linguistic tags, called factors. The factored translation system is able to produce n-best linguistically annotated translations for each input sentence. We have proposed to re-rank them based on scores produced by additional language models of general domain. Two types of language models were used: language models of word-forms, and language models of MSD tags. Experiments were performed on a Serbian – Slovenian language pairing. The results were evaluated in terms of two standard metrics, BLEU and TER scores. Tests of statistical significance based on approximate randomization were performed for both metrics. The improvements in terms of both metrics were statistically significant. This approach is applicable to any pair of languages, especially to languages with rich morphologies.

Jing Sun, Qing Yu and Yves Lepage . An Iterative Method to Identify Parallel Sentences from Non-parallel Corpora

Abstract: We present a simple yet effective method in identifying bilingual parallel sentences from Web-crawled non-parallel data using translation scores obtained with the EM algorithm. First, a modified IBM Model-1 is used to estimate the word-to-word translation probabilities. Then, these translation probabilities are used to compute translation scores between non-parallel sentence pairs. The above two steps are run iteratively and sentence pairs with higher translation scores are extracted from non-parallel data. Our approach is novel from previous research because we take into account the information in the non-parallel data as well. According to our experiment results, such a method is promising to construct training corpora for data-driven tasks such as Statistical Machine Translation.

Brigitte Bigi. A phonetization approach for the forced-alignment task

Abstract: The phonetization of text corpora requires a sequence of processing steps and resources in order to convert a normalized text in its constituent phones and then to directly exploit it by a given application.
This paper presents a generic approach for text phonetization and concentrates on the aspects of phonetizing unknown words, which serve to develop a phonetizer in the context of forced-alignement application. It is a dictionary-based approach, which is as language-independent as possible: this approach is applied to French, Italian, English, Vietnamese, Khmer, Chinese and Pinyin for Taiwanese. The tool with linked resources are distributed under the terms of the GPL license.

Nicolas Neubauer, Nils Haldenwang and Oliver Vornberger . Differences in Semantic Relatedness as Judged by Humans and Algorithms

Abstract: Quantifying the semantic relatedness of two terms is a field with
vast amounts of research, as the knowledge provided by it has applications for
many Natural Language Processing problems.
While many algorithmic measures have been proposed, it is often hard to say if
one measure outperforms another, since their evaluation often lacks meaningful
comparisons to human judgement on semantic relatedness.
In this paper we present a study using the BLESS data set to compare
the preferences of humans regarding semantic relatedness to popular algorithmic
baselines, PMI and NGD, which shows that significant differences in relationship-type
preferences between humans and algorithms exist.

Nicolas Neubauer , Nils Haldenwang and Oliver Vornberger. The Bidirectional Co-occurrence Measure: A Look at Corpus-based Co-occurence Statistics and Current Test Sets

Abstract: Determining the meaning of or understanding a given input in
natural language is a challenging task for a computer system. This paper discusses
a well known technique to determine semantic relatedness of terms: corpus-based
co-occurrence statistics.
Aside from presenting a new approach for this technique, the Bidirectional Co-occurrence Measure,
we also compare it to two well-established measures, the Pointwise Mutual Information
and the Normalized Google Distance. Taking these as a basis, we discuss a
multitude of popular test sets, their test methodology, strengths and weaknesses while also
providing experimental evaluation results.

Elzbieta Hajnicz. Actualising lexico-semantic annotation of Składnica Treebank to modified versions of source resources

Abstract: In this paper a method of automatic update of the lexico-semantic annotation of Składnica treebank by means of PlWN wordnet senses is described. Both resources undergo intensive development. The method is based on information which is considered invariant between subsequent versions of a resource.

Senem Kumova Metin, Bahar Karaoglan and Tarik Kişla . Using IR Representation in Text Similarity Measurement

Abstract: In text similarity detection, to overcome the syntactic heterogeneity in expressing similar meanings, text expanding by knowledge resources has been tried by previous researchers. In this work, we present a novel method to compare the retrieved data which we call IR (Information Retrieval) representation of the texts rather than the original texts themselves. The abstracts and the titles of the texts are provided as queries to the IR system and among different alternatives in retrieved information; such as documents and snippets; the URL address lists are considered as contributor to the text similarity detection. IR representations of the texts are subjected to pair-wise distance measurements. Two important results obtained from experiments are: 1) IR representations obtained by querying the titles overcome the representations obtained by the use of abstracts in text similarity detection; 2) Querying abstracts does not improve the text similarity measurements.

Krzysztof Jassem. PSI-Toolkit - an Open Architecture Set of NLP Tools

Abstract: The paper describes PSI-Toolkit, a set of NLP tools designed within a grant of Polish Ministry of Science and Higher Education. The main feature of the toolkit is its open architecture, which allows for the integration of NLP tools designed in independent research centres. The toolkit processes texts in various natural languages. The annotators may be run with several options to differentiate the format of input and output. The rules used by PSI-Tools may be supplemented or replaced by the user. External tools may be incorporated into the PSI-Toolkit environment. Corpora annotated by other toolkits may be read into the PSI-Toolkit data structure and further processed.

Jacek Malyszko and Agata Filipowska . Crowdsourcing in creation of resources for the needs of sentiment analysis

Abstract: One of biggest challenges in the field of information extraction is creation of lingustic resources of high quality that enable to develop or validate created methods. These resources are scarce as the cost of their creation is high and they are usually designed to suit specific needs and as such they are hardly reusable. Therefore, the community researches new methods for creation of lingustic resources. This paper is in line with this research, as its goal is to indicate how the opinion of the crowd may influence the process of creation of linguistic resources. The authors propose methods for building the resources for the needs of sentiment analysis together with validation of the proposed approaches.

Thao Phan. Pre-processing as a Solution to Improve French-Vietnamese Named Entity Machine Translation

Abstract: The translation of proper names (PNs) and noun phrases with PNs from a source language into a target language is dealing with a great number of difficulties as well for human translator and current machine translation systems. This paper presents some problems relating to named entity machine translation, namely proper name machine translation (PNMT) from French into Vietnamese. To deal with those difficulties, we propose some pre-processing solutions for reducing certain PNMT errors made by the current French-Vietnamese MT systems in Vietnam. The pre-processing program built and tested with the French-Vietnamese parallel corpus of PNs has obtained significant achievements in the improvement of MT quality.

Zang Yuling, Jin Matsuoka and Yves Lepage . Extracting Semantically Analogical Word Pairs from Language Resources

Abstract: A semantic analogy (e.g., sun is to planet as nucleus is to electron) is a pair of word pairs which have similar semantic relations. Semantic analogies may be useful in Natural Language Processing (NLP) tasks and may be applied in several ways. It should have a great potential in sentence rewriting tasks, reasoning systems or machine translation. In this paper, we combine three methods to extract analogical word pairs by using patterns, clustering word pairs and measuring semantic similarity using vector space model. We show how to produce a large amount of clusters exhibiting various quality from good to poor, with a reasonable number of clusters of good quality.

Maciej Ogrodniczuk, Katarzyna Głowińska, Mateusz Kopeć, Agata Savary and Magdalena Zawisławska . Polish Coreference Corpus

Abstract: This article describes the composition, annotation process and availability of the newly constructed Polish Coreference Corpus – the first substantial Polish corpus of general nominal coreference. The tools used in the process and final linguistic representation formats are also presented.

Stefan Daniel Dumitrescu and Tiberiu Boroş. A unified corpora-based approach to Diacritic Restoration and Word Casing

Abstract: We propose a surface processing system that attempts to solve two important problems: diacritic restoration and word casing. If solved with sufficient accuracy, Internet-extracted text corpora would benefit from a sensible quality increase, leading to better performance for every task using such text corpora. At its core, the system uses a Viterbi algorithm to select the optimal state sequence from a variable number of possible options for each word in a sentence. The system is language independent; regarding external resources, the system uses a language model and a lexicon. The language model is used to estimate a transition scores for the sequential alternatives generated from the lexicon. We obtain targeted Word Accuracy Rates (WAR) over 92%, and all-word WAR of over 99%.

Joro Ny Aina Ranaivoarison, Eric Laporte and Baholisoa Simone Ralalaoherivony . Formalization of Malagasy conjugation

Abstract: This paper reports the core linguistic work performed to construct a dictionary-based morphological analyser for Malagasy simple verbs. It uses the Unitex platform (Paumier, 2003) and comprised the contruction of an electronic dictionary for Malagasy simple verbs. The data is encoded on the basis of morphological features. The morphological variations of verb stems and their combination with affixes are formalized in finite-state transducers represented by editable graphs. 78 transducers allow Unitex to generate a dictionary of allomorphs of stems. 271 transducers are used by the morphological analyser of Unitex to recognize the stem and the affixes in conjugated verbs. The design of the dictionary and transducers prioritizes readability, so that they can be extended and updated by linguists.

Juan Luo, Aurélien Max and Yves Lepage . Using the Productivity of Language is Rewarding for Small Data: Populating SMT Phrase Table by Analogy

Abstract: This paper is a partial report of the work on integrating proportional analogy into statistical machine translation systems. Here we present a preliminary investigation on the application of proportional analogy to generate translations of unseen n-grams from phrase table. We conduct experiments with different sizes of data and implement two methods to integrate n-gram pairs produced by proportional analogy into the state-of-the-art machine translation system Moses. The evaluation results show that n-grams generated by proportional analogy are rewarding for machine translation systems with small data.

Tina Hildenbrandt, Friedrich Neubarth and Sylvia Moosmüller. Orthographic encoding of the Viennese dialect for machine translation

Abstract: Language technology concerned with dialects is confronted with a situation where the target language variety is generally used in spoken form, and, due to a lack of standardization initiatives, educational reinforcement and usage in printed media, written texts often follow an impromptu orthography, resulting in great variation of spelling. A standardized orthographic encoding is, however, a necessary precondition in order to apply methods of language technology, most prominently machine translation. Scripting dialects usually mediates between similarity of a given standard orthography and precision of representing the phonology of the dialect. The generation of uniform resources for language processing necessitates considering additional requirements, such as lexical unambiguousness, consistency, morphological and phonological transparency, which are of higher importance than readability. In the current contribution we propose a set of orthographic conventions for the dialect/sociolect spoken in Vienna. This orthography is mainly based on a thorough phonological analysis, whereas deviations from this principle can be attributed to disambiguation of otherwise homographic forms.

Jędrzej Osiński. Resolving relative spatial relations using common human competences

Abstract: The spatial reasoning is an import field of artificial intelligence with many application in natural language processing. We present a technique for resolving relative spatial relations like in front of or behind. Its main idea is based on the questionnaire experiments performed via Internet. The presented solution was successfully implemented in a system designed for improving the quality of monitoring of complex situations and can be treat as a general method applicable in many areas.

Fumiyo Fukumoto, Yoshimi Suzuki and Atsuhiro Takasu . Multi-Document Summarization based on Event and Topic Detection

Abstract: This paper focuses on continuous news documents and presents a method for extractive multi-document summarization. Our hypothesis
about salient, key sentences in news documents is that they include words related to the target event and topic of a document. Here, an
event and a topic are the same as TDT project: an event is something that occurs at a specific place and time along with all necessary
preconditions and unavoidable consequences, and a topic is defined to be “a seminal event or activity along with all directly related events
and activities.” The difficulty for finding topics is that it has various word distributions, i.e., sometimes it frequently appears in the target
documents, but sometimes not. In addition to the TF-IDF term weighting method to extract event words, we identified topics by using
two models, i.e., Moving Average Convergence Divergence (MACD) for words with high frequencies, and Latent Dirichlet Allocation
(LDA) for low frequency words. The method was tested on two data, NTCIR-3 Japanese news documents and DUC data, and the results
showed the effectiveness of the method.

Imane Nkairi, Irina Illina, Georges Linares and Dominique Fohr . Exploring temporal context in diachronic text documents for automatic OOV proper name retrieval

Abstract: Proper name recognition is a challenging task in information retrieval in large audio/video databases. Proper names are semantically rich and usually are a key to understand the information contained in a document. Our work focuses on increasing the vocabulary coverage of a speech transcription system by automatically retrieving proper names from diachronic contemporary text documents. We proposed methods that dynamically augment the automatic speech recognition system vocabulary, using lexical and temporal features in diachronic documents. We also studied different metrics for proper name selection in order to limit the vocabulary augmenting and therefore the impact on the ASR performances. Recognition results show a significant reduction of the word error rate using augmented vocabulary.

Nives Mikelic Preradovic and Damir Boras . Knowledge-Driven Multilingual Event Detection Using Cross-Lingual Subcategorization Frames

Abstract: In this paper we present a knowledge-driven approach to multilingual event detection using a combination of lexico-syntactic patterns and semantic roles to build cross-lingual event frames, based on the meaning of the verb prefix and the relation of the prefixed verb to its syntactic arguments. We concentrate on SOURCE and GOAL roles, encoding adlative, ablative, intralocative and extralocative meaning in the event frames of five Indo-European languages: Croatian, English, German, Italian and Czech. The lexico-syntactic pattern for each frame was manually developed, covering directional meaning of Croatian prefixes OD-, DO-, IZ- and U-. Apart from the possibility to detect spatial (directional) information in five languages, results presented here suggest a possible approach to improve foreign language learning using a set of cross-lingual verb valency frames.

Frederique Segond, Marie Dupuch, André Bittar, Luca Dini, Lina Soualmia, Stefan Darmoni, Quentin Gicquel and Marie Helene Metzger . Separate the grain from the chaff: make the best use of language and knowledge technologies to model textual medical data extracted from electronic health records

Abstract: Electronic Health Records (EHRs) contain information that is crucial for biomedical research studies. In recent years, there has been an exponential increase in scientific publications about using textual processing of medical data in fields as diverse as medical decision support, epidemiological studies and data and semantic mining. While the use of semantic technologies in this context demonstrates promising results, a first experience with such an approach, shed light on some challenges among which the need for smooth integration of specific terminologies and ontologies into the linguistic processing modules as well as independency of linguistic and expert knowledge. Our works lies at the cross road between natural language processing, knowledge representation and reasoning and aims at providing a truly generic system to support extraction and structuration of medical information contained in EHRs. This paper presents an approach which combines sophisticated linguistic processing with a multi-terminology server and an expert knowledge server focusing on independency of linguistic and expert rules.

Daniele Falavigna, Fabio Brugnara, Roberto Gretter and Diego Giuliani . Dynamic Language Model Focusing for Automatic Transcription of Talk-Show TV Programs

Abstract: In this paper, an approach for unsupervised dynamic adaptation of the language model used in an automatic transcription task is proposed. The approach aims to build language models ”focused” on linguistic content and speaking style of audio documents to transcribe by adapting a general purpose language model on a running window of text derived from automatic recognition hypotheses. The text in each window is used to automatically select documents from the same corpus utilized for training the general purpose language model. In particular, a fast selection approach has been developed and compared with a more traditional one used in the information retrieval area. The new proposed approach allows for a real time selection of documents and, hence, for a frequent language model adaptation on a short (less than 100 words) window of text. Experiments have been carried out on six episodes of two Italian TV talk-shows programs, by varying the size and advancement step of the running window and the corresponding number of words selected for focusing language models. A relative reduction in word error rate of about 5.0% has been obtained using the running window for focusing the language models, to be compared with a corresponding relative reduction of 3.5% achieved using the whole automatic transcription of each talk-show episode for focusing the language models.

Adrien Barbaresi. A one-pass valency-oriented chunker for German

Abstract: The transducer described here consists in a pattern-based matching operation of POS-tags using regular expressions that takes advantage of the characteristics of German grammar. The process aims at finding linguistically relevant phrases with a good precision, which enables in turn an estimation of the actual valency of a given verb. This finite-state chunking approach does not return a tree structure, but rather yields various kinds of linguistic information useful to the language researcher: possible applications include simulation of text comprehension on the syntactical level, creation of selective benchmarks and failure analysis. It is relatively fast, as it reads its input exactly once instead of using cascades, which greatly benefits computational efficiency.

Bartłomiej Nitoń. Evaluation of Uryupina’s coreference resolution features for Polish

Abstract: Automatic coreference resolution is an extremely difficult and complex task. It can be approached in two different ways: using rule-based tools or machine learning. This article describes an evaluation of a set of surface, syntactic and anaphoric features proposed in Uryupina 2007 and their usefulness for coreference resolution in Polish texts.

John Lee. Toward a Digital Library with Search and Visualization Tools

Abstract: We present a digital library prototype with search and visualization capabilities, designed to support both language learning and textual analysis. As in other existing libraries, users can browse texts with a variety of reading aids; in our library, they can also search for complex patterns in syntactically annotated and multilingual corpora, and visualize search results over large corpora. With a web interface that assumes no linguistic or computing background on the part of the user, our library has been deployed for pedagogical and research purposes in diverse languages.

Ralf Kompe, Marion Fischer, Diane Hirschfeld, Uwe Koloska, Ivan Kraljevski, Franziska Kurnot and Mathias Rudolph . AzARc – A Community-based Pronunciation Trainer

Abstract: In this paper a system for automatic pronunciation training is presented. The users perform exercises where their utterances are recorded and pronunciation quality is evaluated. The quality scores are estimated on phonemic, prosodic, and phonation level by comparison to the voice quality parameters of a reference speaker and presented back to the users. They can also directly listen to the voice of a reference speaker and try to repeat it several times until their own pronunciation quality improves. The system is designed in a flexible way such that different user groups (disabled, children, healthy adults, …) can be supported by specific type of exercises and variable degree of complexity on exercise and on GUI output side. A focus is on a client-server based implementation, where the server-side database stores exercises as well as the results. Therefore, therapists (language teachers) can monitor the progress of their patients (pupils), while they train by themselves at home, and, moreover, anybody can introduce his own set of exercises into the community-based database and is free to make them available to other people according defined sharing policy.

Lars Hellan and Dorothee Beermann . A multilingual valence database for less resourced languages

Abstract: We hypothesize that a multilingual aligned valence database can be useful in the development of language resources for less resourced languages (LRL), and present aspects of a methodology that can accomplish the construction of such a database. We define basic desiderata for the content of the database, we describe an implemented example of a database along with a web-demo which illustrates the intended functionalities, and we mention a collection of tools and resources which can be orchestrated to help in the construction of a more encompassing database for LRLs following the same design.

Moses Ekpenyong and Ememobong Udoh . Intelligent Prosody Modelling: A Framework for Tone Language Synthesis

Abstract: The production of high quality synthetic voices largely depends on the accuracy of the language’s prosodic model. We present in this paper, an intelligent framework for modelling prosody in tone language Text-to-Speech (TTS). The proposed framework is fuzzy logicbased (FL-B) and is adopted to offer a flexible, human reasoning approach to the imprecise and complex nature of prosody prediction.
Preliminary results obtained from modelling the tonal aspect of Ibibio (ISO: 639-3, Ethnologue IBB) prosody proves the feasibility of our FL-B framework at predicting precisely, the degree of certainty of fundamental frequency (F0) contour patterns in a set of recorded and synthesised voices. Other aspects of prosody not yet incorporated into this design are currently being investigated.

Andrzej Jarynowski and Amir Rostami. Reading Stockholm Riots 2013 using Internet media

Abstract: The riots in Stockholm in May 2013 were an event that reverberated in the world media for its dimension of violence that had spread through the Swedish capital. In this study we have investigated the role of social media in creating media phenomena via text mining and natural language processing. We have focused on two channels of communication for our analysis: Twitter and Poloniainfo.se (Forum of Polish community in Sweden). Our preliminary results show some hot topics driving discussion related mostly to Swedish Police and Swedish Politics by counting word usage. Typical features for media intervention are presented. We have built networks of most popular phrases, clustered by categories (geography, media institution, etc.). Sentiment analysis shows negative connotation with Police. The aim of this preliminary exploratory quantitative study was to generate questions and hypotheses, which we could carefully follow by deeper more qualitative methods.

Wondwossen Mulugeta, Michael Gasser and Baye Yimam . Automatic Morpheme Slot Identification

Abstract: We introduce an approach to the grouping of morphemes into suffix slots in morphologically complex languages using a genetic algorithm. The method is applied to verbs in Amharic, a morphologically rich Semitic language. We start with a limited set of seg-mented verbs and the set of suffixes themselves, extracted on the basis of our previous work. Each member of the population for the genetic algorithm is an assignment of the morphemes to one of the set of possible slots.
The fitness function combines scores for exact slot position and correct ordering of morphemes. We use mutation but no crossover operator with various combinations of population size, mutation rate, and maximum number of generations, and populations evolve to yield promising morpheme classification results. We evaluate the fittest individuals on the basis of the known morpheme classes for Amharic.

Marcin Karwiński. Living Language Model Optimisation for NL Syntax-based Search Engine

Abstract: This paper presents summarisation of results obtained through a 4 year long study on search engine optimisation through the use of natural languages simile to living organisms. Though many NLP methods have already been used in IR tasks, language seems to be treated as immutable. Juxtaposed with the linguists’ theory of a living language, the model used in text syntax analyses was augmented with evolutionary methods. The theoretical concepts of evolutionary changes were discussed and examined in controlled environments to verify the plausibility of the aforementioned simile paradigm usage in search engines for IR quality improvement. TREC and OHSUMED document collections along with real-life production set were used to assess the approach. The influences and limitations of language features on model’s evolutionary changes were observed.

Prashant Verma, Somnath Chandra and Swaran Lata . Indian languages requirements in ITS (Internationalization Tag set) 2.0

Abstract: ITS 2.0 is a localization framework defined by W3C to add metadata to Web content, for the benefit of localization, language technologies, and internationalization. This paper aims to evolve the with respect to Indian languages in the ITS 2.0 requirements. This paper also de-fines the variation of Named entities in Indian languages, its requirement in ITS 2.0 and suggestion for implementation in ITS 2.0. The challenges with respect to Indian Languages in the data categories of ITS 2.0 and a possible approach to overcome the specific challenges w.r.t. Indian languages are also addressed in the paper.

Ritesh Kumar. Towards automatic identification of linguistic politeness in Hindi texts

Abstract: In this paper I present a classifier for automatic identification of linguistic politeness in Hindi texts. I have used the manually annotated corpus of over 25,000 blog comments to train an SVM which classifies any Hindi text into one of the four classes – appropriate, polite, neutral and impolite. Making use of the discursive and interactional approaches to politeness the paper gives an exposition of the normative, conventionalised politeness structures of Hindi. It is seen that using these manually recognised structures as features in training the SVM significantly improves the performance of the classifier on the test set. The trained system gives a significantly high accuracy of over 77% which is within 2% of the human accuracy of around 79%

Ritesh Kumar. Demo of the corpus of computer-mediated communication in Hindi (CO3H)

Abstract: Corpus of computer-mediated communication in Hindi is one of the largest CMC corpora in any Indian language. The data in this corpus is taken from both asynchronous and synchronous CMC. The asynchronous CMC data is taken from:
Blogs
Web Portals
Youtube Comments
emails
The synchronous data is taken from
Public Chats
Private Chats

The corpus currently has over 240 million words in it. Most parts of the corpus is in both Roman as well as Devanagari script since most of the original data is in Roman whose transliteration is also included in the corpus. A part of the corpus is manually annotated for politeness classes. The corpus is currently maintained in the XML format.
The corpus comes with web-based interfaces for three functions -
a) Searching the Corpus
b) Browsing the Corpus
c) Annotating the Corpus [available online on http://sanskrit.jnu.ac.in/tagit/]
There is also a Java-based API for accessing, using, modifying and also extending the corpus.
The whole corpus is currently not available for public access. Currently only the annotation interface is available online which is accessible only through an username and password. The corpus, along with all the interfaces for interacting with it, will, however, be released under Creative Commons Share Alike License by next year so that it could be used freely for research. It will also be made available for free download in XML format for offline use.

Daniel Hladek, Ján Staš and Jozef Juhár . Correcting Diacritics in the Slovak Texts Using Hidden Markov Model

Abstract: This paper presents fast and accurate method for correcting diacritical markings and guessing original meaning of the word from the context of the word based on a hidden Markov model and the Viterbi algorithm. The proposed algorithm might find usage in any area, where erroneous text might appear., such as a web search engine, e-mail messages, office suite, optical character recognition or helping to type on small mobile device keyboards.

Riccardo Del Gratta and Francesca Frontini . Linking the Geonames ontology toWordNet

Abstract: This paper illustrates the transformation of the GeoNames ontology concepts, with their English labels and glosses into a
GeoDomainWorkdNet-like resource in English, its translation into Italian, and its linking to the existing genericWordNets
of both languages. The paper describes the criteria used for the linking of domain synsets to each other and to the generic
ones and presents the published resource in RDF according to the w3c and lemon schema.

Angelina Gašpar. Multiterm Database Quality Assessment

Abstract: Terminology and translation consistency are the main objectives expected to be met in the area of legal translation known to be linguistically characterized by recurrent and standard expressions, precision and clarity. This paper describes linguistic and statistic approach to MultiTerm database quality assessment. Contrastive analysis reveals the existence of terminological variants and hence a lack of terminological consistency which is taken as a measurement of quality and is calculated by Herfindahl–Hirshman Index. Database quality is verified through Quality assessment statistics obtained by SDL MultiTerm Extract and in comparison with the reference language resources. This research raises concern in regard to content, application, benefits and drawbacks of the reference language resources and highlights the necessity for Computer-Assisted Translation (CAT) tools and sharing of reliable and standardized linguistic resources.

Arseniy Gorin and Denis Jouvet . Efficient constrained parametrization of GMM with class-based mixture weights for Automatic Speech Recognition

Abstract: Acoustic modeling techniques, based on clustering of the training data, have become essential in large vocabulary continuous speech recognition (LVCSR) systems. Clustered data (supervised or unsupervised) is typically used to estimate the sets of parameters by adapting the speaker-independent model on each subset. For Hidden Markov Models with Gaussian mixture observation densities (HMM-GMM) most of the adaptation techniques are focusing on re-estimation of the mean vectors, whereas the mixture weights are typically distributed almost uniformly.
In this work we propose a way of specifying the subspaces of the GMM by associating the sets of Gaussian mixture weights with the speaker classes and sharing the Gaussian parameters across speaker classes. The method allows us to better parametrize GMM without increasing significantly the number of model parameters. Our experiments on French radio broadcast data demonstrate the improvement of the accuracy with such parametrization compared to the models with similar, or even larger number of parameters.

Arianna Bisazza and Roberto Gretter . Building an Arabic news transcription system with web-crawled resources

Abstract: This paper describes our efforts to build an Arabic ASR system with
web-crawled resources. We first describe the processing done to
handle Arabic text in general and more particularly to cope with the
high number of different phonetic transcriptions associated to a
typical Arabic word. Then, we present our experiments to build
acoustic models using only audio data found in the web, in particular
on the Euronews portal. To transcribe the downloaded audio we compare
two approaches: the first uses a baseline trained on manually
transcribed Arabic corpora, while the second uses a universal ASR
system trained on automatically transcribed speech data of 8 languages
(not including Arabic). We demonstrate that with this approach we are
able to obtain recognition performances comparable to the ones
obtained with a fully supervised Arabic baseline.

Cvetana Krstev, Anđelka Zečević, Dusko Vitas and Tita Kyriacopoulou . NERosetta – an Insight into Named Entity Tagging

Abstract: Named Entity Recognition has been a hot topic in Natural Language Processing for more than fifteen years. A number of systems for various languages were developed using different approaches and based on different named entity schemes and tagging strategies. We present a web application NERosetta that can be used for comparison of these various approaches applied to the aligned texts (bitexts). In order to illustrate its functionalities we have used one literary text, its 5 bitexts involving 5 languages and 5 different NER systems. We present some preliminary results and give guidelines for further development.

Abdualfatah Gendila and Abduelbaset Goweder . The Pseudo Relevance Feedback for Expanding Arabic Queries

Abstract: With the explosive growth of the World Wide Web, Information Retrieval Systems (IRS) have recently become a focus of research. Query expansion is defined as the process of supplementing additional terms or phrases to the original query to improve the information retrieval performance. Arabic is highly inflectional and derivational language which makes the query expansion process a hard task.
In this paper, the well known approach, Pseudo Relevance Feedback (PRF) is adopted to be applied on Arabic. Prior to applying PRF, first; datasets (three collections of Arabic documents) are pre-processed to create documents inverted index vocabularies, then, the normal indexing process is carried out. The PRF is applied to create a modified (expanded) query of the original one and the target collection is indexed once more. To judge the enhancement of retrieval process, the results of normal indexing and those of applying PRF are evaluated against each other using precision and recall measures. The results have shown that the PRF method has significantly enhanced the performance of the Arabic Information Retrieval (AIR) System. As the number of expansion terms increases up to a certain extent (35 terms), the performance has been improved. On the other hand, the performance will not be affected, or grow insignificantly as the number of expansion terms exceeds this limit.

Fadoua Ataa Allah and Jamal Frain . Amazigh Converter based on WordprocessingML

Abstract: Since the creation of the Royal Institute of Amazigh Culture, the Amazigh language is undergoing a process of standardization and integration into information and communication technologies. This process is passing through several stages, after the writing system stabilization, the encoding stage and the development of appropriate standards for the keyboard layout; the stage of computational linguistics is undertaking. Thus, in the aim to save the Amazigh cultural heritage, many converters, allowing the Tifinaghe AINSI-Unicode transition and Arabic-Latin-Tifinaghe transliteration, have been developed. However, these converters could not assure the file layout and the processing of all documents’ parts. To overcome these limitations, the new WordprocessingML technology has been used.

Wei Yang, Hao Wang and Yves Lepage . Automatic Acquisition of Rewriting Models for the Generation of Quasi-parallel corpus

Abstract: Bilingual or multilingual parallel corpora are an extremely important resource as they are typically used in data-driven machine translation systems. They allow to improve machine translation systems continuously. There exist many freely available bilingual or multilingual parallel corpora for language pairs that involve English, especially for European languages. However, the constitution of large collections of aligned sentences is a problem for less documented language pairs, such as Chinese and Japanese. In this paper, we show how to construct a free Chinese-Japanese quasi-parallel corpus by using analogical associations based on short sentential resources collected from the Web. We generate numerous new candidate sentences by analogy and filter them by attested N-sequences method to enforce fluency of expression and adequacy of meaning. Finally, we construct a Chinese-Japanese quasi-parallel corpus by computing similarities.

Satoshi Fukuda, Hidetsugu Nanba, Toshiyuki Takezawa and Akiko Aizawa . Classification of Research Papers Focusing on Elemental Technologies and Their Effects

Abstract: We propose a method for the automatic classification of research papers in the CiNii article database in terms of the KAKEN classification index. This index was originally devised for classifying reports for the KAKEN research fund in Japan, and it is also comprised three hierarchical levels: Area, Discipline, and Research Field. Traditionally, research papers have been classified using machine learning algorithms, using the content words in each research paper as features. In addition to these content words, we focus on elemental technologies and their effects, as discussed in each research paper. Examining the use of elemental technology terms used in each research paper and their effects is important for characterizing the research field to which a given research paper belongs. To investigate the effectiveness of our method, we conducted an experiment using KAKEN data. From the results, we obtained average recall scores of 0.6220, 0.7205, and 0.8530 at the Research Field, Discipline, and Area levels, respectively.

Roman Grundkiewicz. ErrAno: a Tool for Semi-Automatic Annotation of Language Errors

Abstract: Error corpora containing annotated naturally-occurring errors are desirable resources in the development of the automatic spelling and grammar correction techniques. Many studies on creating error corpora often concern on error tagging systems, but rarely on the computer tools aiming at improving manual error annotation being very tedious and costly task. In this paper we describe the development of the ErrAno tool for semi-automatic annotation of language errors in texts composed in Polish. The annotator's work is supported on several levels by providing the use of text edition history, assisting error detection process, and by proposing attributes that describe found errors. We present in details the rule-based system for automatic attribute assignment, which contribute to considerable speed-up of the annotation process. We also propose new error taxonomy suitable from the view of automatic text correction and clear multi-level tagging model.

Abraham Hailu and Yaregal Assabie . Itemsets-Based Amharic Document Categorization Using an Extended A Priori Algorithm

Abstract: Document categorization is gaining importance due to the large volume of electronic information which requires automatic organization and pattern identification. Due to the morphological complexity of the language, automatic categorization of Amharic documents has become a difficult talk to carry out. This paper presents a system that categorizes Amharic documents based on the frequency of itemsets obtained after analyzing the morphology of the language. We selected seven categories into which a given document is to be classified. The task of categorization is achieved by employing an extended version of a priori algorithm which had been traditionally used for the purpose of knowledge mining in the form of association rules. The system is tested with a corpus containing Amharic news documents and experimental results are reported.

Yoshimi Suzuki and Fumiyo Fukumoto . Newspaper Article Classification using Topic Sentence Detection and Integration of Classifiers

Abstract: This paper presents a method for a newspaper article classification using topic sentence detection. In particular, we focus on topic sentences, and detect them by using Support Vector Machines(SVMs) based on the features which are related to
a topic of the articles. Then we classified articles using term weighting of the words which appeared in the detected sentences. Using this method, classification accuracy becomes higher than the text classification only by using frequency of words.
We also integrate classification results of the five classifiers. For evaluation of the proposed method, we performed text classification experiments using Japanese newspaper articles. The classification results were compared with the results using conventional classifiers. The experiments showed that the proposed method are effective for newspaper article classification.

Lidija Tepeš Golubić, Damir Boras and Nives Mikelić Preradović . Semi-automatic detection of germanisms in Croatian newspaper texts

Abstract: In this paper we aimed to discover to what extent Germanisms participate in the language of the Croatian daily newspapers (more precisely, newspaper headlines). In order to determine Germanisms in Croatian language, we used digitized copies of a daily newspaper, Večernji list. According to our results, 114 Germanisms were found in 217 headlines. When compared to 3088 titles where the total number of words in a one-month corpus was 21.554, we found out that Germanisms appear in only 1.006% of cases.

Rajeev R R, Jisha P Jayan and Elizhabeth Sherly . Interlingua Data Structure for Malayalam

Abstract: The complexity of Malayalam language poses special challenges to computational natural language processing. The rich morphology like highly agglutinative word formation and inflections, strong grammatical behaviour of the language, and the formation of words with multiple suffixes, all contribute to research in Malayalam language technology making it more challenging. In machine translation, transfer grammar component is very essential which acts as a bridge between the source language and the target language. A grammar of a language is a set of rules, which says how these parts of speech can be put together to make grammatical, or 'well-formed' sentences. This paper describes a hybrid architecture towards building a transfer grammar data structure for Malayalam, by identifying the subject in a sentence with its syntactic and semantic representation while tagging.

Łukasz Degórski. Fine-tuning Chinese Whispers algorithm for a Slavonic language POS tagging task and its evaluation

Abstract: Chris Biemann's robust Chinese Whispers graph clustering algorithm working in the Structure Discovery paradigm has been proven to perform good enough to be used in many applications for Germanic languages. The article presents its application to a Slavonic language (Polish), focusing on fine-tuning the parameters and finding an evaluation method for POS tagging application aiming at getting a very small (coarse-grained) tagset.

Mohamed Elmahdy, Mark Hasegawa-Johnson and Eiman Mustafawi . A Transfer Learning Approach for Under-Resourced Arabic Dialects Speech Recognition

Abstract: A major problem with dialectal Arabic speech recognition is due to the sparsity of speech resources. In this paper, we propose a transfer learning framework to jointly use large amount of Modern Standard Arabic (MSA) data and little amount of dialectal Arabic data to improve acoustic and language modeling. We have chosen the Qatari Arabic (QA) dialect as a typical example for an under-resourced Arabic dialect. A wide-band speech corpus has been collected and transcribed from several Qatari TV series and talk-show programs. A large vocabulary speech recognition baseline system was built using the QA corpus. The proposed MSA-based transfer learning technique was performed by applying orthographic normalization, phone mapping, data pooling, and acoustic model adaptation. The proposed approach can achieves more than 28% relative reduction in WER.

Mohamed Elmahdy, Mark Hasegawa-Johnson and Eiman Mustafawi . A Framework for Conversational Arabic Speech Long Audio Alignment

Abstract: We propose a framework for long audio alignment for conversational Arabic speech. Accurate alignments help in many speech pro-cessing tasks such as audio indexing, speech recognizer acoustic model (AM) training, audio summarizing and retrieving, etc. In this work, we have collected more than 1400 hours of conversational Arabic besides the corresponding non-aligned text transcriptions. For each episode, automatic segmentation is applied using a split and merge approach. A biased language model (LM) is trained using the corresponding text after a pre-processing stage. A graphemic pronunciation model (PM) is utilized because of the dominance of non-standard Arabic in conversational speech. Unsupervised acoustic model adaptation in applied on a generic standard Arabic AM. The adapted AM along with the biased LM and the graphemic PM are used in a fast speech recognition pass applied on the current podcast's segments. The recognizer output is aligned with the processed transcriptions using Levenshtein distance algorithm. The proposed approach resulted in an alignment accuracy of 97% on the evaluation set. A confidence scoring metric is proposed to accept/reject aligner output. Using confidence scores, it was possible to reject the majority of mis-aligned segments resulting in 99% alignment accuracy.

Milan Rusko,Jozef Juhar, Marian Trnka, Jan Stas, Sakhia Darjaa, Daniel Hladek, Robert Sabo, Matus Pleva, Marian Ritomsky and Stanislav Ondas . Recent Advances in the Slovak Dictation System for Judicial Domain

Abstract: The acceptance of speech recognition technology depends on user friendly applications evaluated by professionals in the target field. This paper describes the evaluation and recent advances in application of speech recognition for the judicial domain. The evaluated dictation system enables Slovak speech recognition using plugin for widely used office word processor and it was introduced recently after the first evaluation in the Slovak courts. The system was improved significantly using more acoustic databases for testing and acoustic modeling. The textual language resources were extended meanwhile and the language modeling techniques improved as described in the paper. An end-user questionnaire to the user interface was also evaluated and new functionalities were introduced in the final version. According to the available feedback, it could be concluded that the final dictation system could speed up the court proceedings significantly for experienced users willing to cooperate (acoustic model adaptation, proper names insertion for document, etc.) with this new technology.

Imen Elleuch, Bilel Gargouri and Abdelmajid Ben Hamadou . Syntactic enrichment of Arabic dictionaries normalized LMF using corpora

Abstract: In this paper, we deal with the representation of syntactic knowledge, particularly syntactic behavior of Arabic verbs. In this context, we propose an approach to identify the syntactic behavior from corpora in order to enrich the syntactic extension of LMF normalized Arabic dictionary. Our approach is composed of the following steps: (i) Identification of syntactic patterns, (ii) Construction of a grammar suitable for each syntactic pattern, (iii) Application of the grammar on corpora and (iv) Enrichment of the LMF Arabic dictionary. To validate this approach, we carried out an experiment that focused on the syntactic behavior of Arabic verbs. We used the NOOJ linguistic platform, an Arabic corpus containing about 20500 vowelized words and an LMF Arabic dictionary, available in our Laboratory, that contains 37000 entries and 10700 verbs. The obtained results concerning more than 5000 treated verbs, shows 76% of precision and 83% of recall.

Patrizia Grifoni, Maria Chiara Caschera, Arianna D'Ulizia and Fernando Ferri . Dynamic Building of Multimodal Corpora Model

Abstract: To design a multimodal interaction environment, to combine information from different modalities and to provide a correct user’s input interpretations are crucial issues in order to define an effective communication. Both during the combination of information and interpretation phases, the use of corpora of multimodal sentences is very important because they allow integrating properties and linguistic knowledge which are not formalised in the grammar. This paper provides the dynamic generation of a multimodal corpus by example that allow improving human-computer dialogue because it consists in examples which are able to generate grammatical rules and to train the model for correctly interpreting user’s input.

Munshi Asadullah, Patrick Paroubek and Anne Vilnat . Converting from the French Treebank Dependencies into PASSAGE syntactic annotations

Abstract: We present here a converter for transforming the French Tree-bank Dependency (FTB-DEP) annotations into the PASSAGE format. FTB-DEP is the representation used by several freely available parsers and the PASSAGE annotation was used to hand-annotate a relatively large sized corpus, used as gold-standard in the PASSAGE evaluation campaigns. Our converter will give the means to evaluate these parsers on the PASSAGE corpus. We shall illustrate the mapping of important syntactic phenomena using the corpus made of the examples of the FTB-DEP annotation guidelines, which we have hand-annotated with PASSAGE annotations and used to compute quantitative performance measures on the FTB-DEP guidelines.

Quentin Pradet, Gaël de Chalendar and Guilhem Pujol . Revisiting knowledge-based semantic role labeling

Abstract: Semantic role labeling has seen tremendous progress in the last years, both for supervised and unsupervised approaches. The knowledge-based approaches have been neglected while they have shown to bring the best results to the related word
sense disambiguation task. We contribute a simple knowledge-based approach with an easy to reproduce specification. We also present a novel approach to handle the passive voice in the context of semantic role labeling that reduces the error rate in F1 by 15.7%, showing that significant improvements can be brought to the knowledge-based approaches while retaining key advantages of the approach: a simple approach which facilitates analysis of individual errors, does not need any hand-annotated corpora and which is not domain-specific.

Abeba Ibrahim and Yaregal Assabie . Hierarchical Amharic Base Phrase Chunking Using HMM With Error Pruning

Abstract: Segmentation of a text into non-overlapping syntactic units (chunks) has become an essential component of many applications of natural language processing. This paper presents Amharic base phrase chunker that groups syntactically correlated words at different levels using HMM. Rules are used to correct chunk phrases incorrectly chunked by the HMM. For the identification of the boundary of the phrases IOB2 chunk specification is selected and used in this work. To test the performance of the system, corpus was collected from Amharic news outlets and books. The training and testing datasets were prepared using the 10-fold cross validation technique. Test results on the corpus showed an average accuracy of 85.31% before applying the rule for error correction and an average accuracy of 93.75% after applying rules.

Tugba Yildiz, Banu Diri and Savas Yildirim . Analysis of Lexico-syntactic Patterns for Meronym Extraction from a Turkish Corpus

Abstract: In this paper, we applied lexico-syntactic patterns to disclose meronymy relation from a huge corpus in Turkish. Once, the system takes a huge raw corpus and extract matched cases for a given pattern, it proposes a list of whole-part pairs depending on their co-occur frequency. For the purpose, we exploited and compared a list of pattern clusters. The clusters could fall into three types; general patterns, dictionary-based pattern, and bootstrapped pattern. We examined how these patterns improve the system performance especially within corpus-based approach and distributional feature of words. Finally, we discuss all the experiments with a comparison analysis and we showed advantage and disadvantage of the approaches with promising results.

Sobha Lalitha Devi. Word Boundary Identifier as a Catalyzer and Performance Booster for Tamil Morphological Analyzer

Abstract: In this paper we present an effective and novel approach for the morphological analysis of Tamil using word-boundary identification process. Our main aim is to primarily segment the agglutinated and compound words into a set of simpler words by identifying the constituent words' word-boundaries. Conditional Random Fields, a machine learning technique, is used for word-boundary detection while the morphological analysis is performed by rule-based morphological analyzer developed using Finite State Automaton and Paradigm-based approaches. The whole process is essentially a three step architecture. Initially we use CRFs for word-boundary identification and segmentation. Secondly we perform Sandhi correction using word-level contextual linguistic rules. Finally, the resultant words are analyzed by the morphological analyzer engine. The main advantage of the proposed architecture is that it completely avoids the failure of word analysis due to the reasons of compounding, agglutination, lack of complex orthographic rules and dialectal variation of words. On testing the efficiency of this approach with a randomly collected online-data of 64K words, the analyzer achieved an appreciable accuracy of 98.5% while maintaining an analyzing time of atmost 3ms for any word.

Pinkey Nainwani, Esha Banerjee and Shiv Kaushik . Implementing a Rule-Based Approach for Agreement Checking in Hindi

Abstract: The aim of this paper is to describe the issues and challenges while handling the agreement features of Hindi from the perspective of natural language processing in order to build a robust grammar checker which can be scaled for other Indian languages. Grammar is an integral part of any language where each element of the language depends on the other to generate the correct output. Therefore, the term “agreement” has been evolved. In Hindi, intra-sentential agreement happens at various levels: a) demonstrative and noun, b) adjective and noun, c) noun and verb, d) noun and other particles, and e) verb and other particles. The approach here is to develop a grammar checker with rule - based methods which are proven to provide qualitative results for Indian languages as they are inflecting in nature and have relatively free word order. At present, the work is limited to syntactic structures and does not deal with semantic aspects of the language.

Friedrich Neubarth, Barry Haddow, Adolfo Hernández Huerta and Harald Trost . A hybrid approach to statistical machine translation between standard and dialectal varieties

Abstract: Using statistical machine translation (SMT) for dialectal varieties usually suffers from data sparsity, but combining word-level and character-level models can yield good results even with small training data by exploiting the relative proximity between the two varieties. In this paper, we describe a specific problem and its solution, arising with the translation between standard Austrian German and Viennese dialect. In a phrase-based approach of SMT, complex lexical transformations and syntactic reordering cannot be dealt with. These are typical cases where rule-based preprocessing of the source data is the preferable option, hence the hybrid character of the resulting system. One such case is the transformation between imperfect verb forms to perfect tense, which involves detection of clause boundaries and identification of clause type. We present an approach that utilizes a full parse of the source sentences and discuss the problems that arise with such an approach. Within the developed SMT system, the models trained on preprocessed data unsurprisingly fare better than those trained on the original data, but also unchanged sentences gain slightly better scores. This shows that including a rule-based layer dealing with systematic non-local transformations increases the overall performance of the system, most probably due to a higher accuracy in the alignment.

Aleksander Pohl and Bartosz Ziółko . A Comparison of Polish Taggers in the Application for Automatic Speech Recognition

Abstract: In this paper we investigate the performance of Polish taggers in the context of automatic speech recognition (ASR). We use a morphosyntactic language model to improve speech recognition in an ASR system and seek the best Polish tagger for our needs. Polish is an inflectional language and an n-gram model using morphosyntactic features, which reduces data sparsity seems to be a good choice. We investigate the difference between the morphosyntactic taggers in that context. We compare the results of tagging with respect to the reduction of word error rate as well as speed of tagging. As it turns out at present the taggers using conditional random fields (CRF) models perform the best in the context of ASR. A broader audience might be also interested in the other discussed features of the taggers such as easiness of installation and usage, which are usually not covered in the papers describing such systems.

Pawel Dybala, Rafal Rzepka, Kenji Araki and Kohichi Sayama . Detecting false metaphors in Japanese

Abstract: In this paper we propose how to automatically distinct between two types of formally identical expressions in Japanese: metaphorical similes and metonymical comparisons. Expression like "Kujira no you na chiisai me" can be translated into English as "Eye small as whale's", while in Japanese, due to the lack of possessive case, it literally sounds as "Eye small as whale" (no apostrophe). This makes it impossible to formally distinguish between expressions like this and actual metaphorical similes, as both use the same template. In this work we present a system able to distinguish between these two types of expressions. The system takes Japanese expressions of simile-like forms as input and uses the Internet to check possessive relations between elements constituting the expression. We propose a method of calculating a score based on co-occurrence of source and target pairs in Google (e.g. "whale's eye"). An experimentally set threshold allowed the system to distinguish between metaphors and non-metaphors with the accuracy of 75%. We discuss the results and give some ideas for the future.

Akanksha Bansal, Esha Banerjee and Girish Nath Jha . Corpora creation for Indian Language Technologies – the ILCI project

Abstract: This paper presents an overview of corpus classification and development in electronic format for 16 language-pairs, with Hindi as the source language. In a multi-lingual country like India, the major thrust in language technology lies in providing inter-communication services and direct information access in one’s own language. As a result, language technology in India has seen major developments over the last decade in terms of machine translation and speech synthesis systems. As deeper research advances, the need for high quality standardised corpus is being seen as a primary challenge. To address these needs, the government of India has initiated a mega project called the Indian Languages Corpora Initiative (ILCI) to collect parallel annotated corpus in 17 scheduled languages of the Indian constitution. The project is in its second phase currently, within which it aims to collect 8,50,000 parallel annotated sentences in 17 Indian languages in the domains of Entertainment and Agriculture. Together with the 6,00,000 parallel sentences collected in Phase 1 in the domains of Health and Tourism (Choudhary & Jha, 2011), The corpus being developed is one of the largest known parallel annotated corpora for any Indian language till date. This phase will ultimately also see the development of chunking standards for processing the annotated corpus.

Kumar Nripendra Pathak and Girish Nath Jha . A Generic Search for Heritage Texts

Abstract: At present, quick access to any information is desired in every field with the help of IT research and development. Keeping this in mind, this paper deals with the ongoing research on linguistic resources and a generic search in the Special Centre for Sanskrit Studies, Jawaharlal Nehru University, New Delhi, India. The system is based on lexical resources databases with idiosyncratic structures accessed by configuration files. The search results can be converted into a number of Indian and Roman scripts using a converter. The paper will present the system (http://sanskrit.jnu.ac.in/) and how it can be used to quickly create searchable resources for heritage or other texts.

Vijay Sundar Ram and Sobha Lalitha Devi . Pronominal Resolution in Tamil using Tree CRFs

Abstract: We describe our work on pronominal resolution in Tamil using Tree CRFs. Pronominal resolution is the task of identifying the referent of a pronominal. In this work we have studied third person pronouns in Tamil such as ‘avan’, ‘aval’, ‘athu’, ‘avar’; he, she , it and they respectively. Tamil is a Dravidian language and it is morphologically rich and highly agglutinative language. The features for learning are developed by using the morphological features of the language and from the dependency parsed output. By doing this, we can learn the features used in the salience factor approach and the constrains mentioned in the structural analysis of anaphora resolution. The work is carried out on tourism domain data from the web. We have obtained 70.8% precision and 66.5% recall. The results are encouraging.

Paweł Skórzewski. Gobio and PSI-Toolkit: Adapting a deep parser to an NLP toolkit

Abstract: The paper shows an example how an existing stand-alone linguistic tool may be adapted to a NLP toolkit that operates on different data structures and a different POS tag-set. PSI-Toolkit is an open-source set of natural language processing tools. One of its main features is the possibility of incorporating various independent processors. Gobio is a deep natural language parser that is used in the Translatica machine translation system. The paper describes the process of adapting Gobio to PSI-Toolkit, namely the conversion of Gobio’s data structures into the PSI-Toolkit lattice and opening Gobio’s rule files for edition by a PSI-Toolkit user. The paper also covers the technical issues of substituting Gobio’s tokenizer and lemmatizer by processors used in PSI-Toolkit, e.g. Morfologik.

Jerid Francom and Mans Hulden . Diacritic error detection and restoration via part-of-speech tags

Abstract: In this paper we address the problem of diacritic error detection and restoration—the task of identifying and correcting missing accents in text. In particular, we evaluate the performance of a simple part-of-speech tagger-based technique comparing it to other well-established methods for error detection/restoration: unigram frequency, decision lists and grapheme-based approaches. In languages such as Spanish, the current focus, diacritics play a key role in disambiguation and results show that a straightforward modification to an n-gram tagger can be used to achieve good performance in diacritic error identification without resorting to any specialized machinery. Our method should be applicable to any language where diacritics distribute comparably and perform similar roles of disambiguation.

Kadri Muischnek, Kaili Müürisep and Tiina Puolakainen . Estonian Particle Verbs And Their Syntactic Analysis

Abstract: This article investigates the role of particle verbs in Estonian computational syntax. Estonian is a heavily inflecting language with free word order. The authors are looking at two problems related to the particle verbs in syntactic processing of Estonian, namely, recognizing these multi-word expressions in texts and, second, taking into account the valency patterns of the particle verbs compared to the valency patterns of respective simplex verbs. A lexicon-based and rule-based strategy in Constraint Grammar framework is used for recognizing particle verbs and exploiting the knowledge of their valency patterns for the further dependency analysis.

Piotr Malak. Information searching over Cultural Heritage objects, and press news

Abstract: This paper presents results and conclusion from a subtask of CHiC (Cultural Heritage in CLEF) 2013 campaign, and from realization of Sciex-NMS grant. Within those projects press news, as well as cultural heritage (CH) objects description have been proceeded in order to meet IR, and particularly information searching. Problematic issues which have occurred during automatic text processing of press new. This paper presents results analysis and comments on used IR techniques and their adequacy for information retrieval for Polish.

Filip Graliński. Polish digital libraries as a text corpus

Abstract: A large (71 GB), diachronic Polish text corpus extracted from digital libraries is described in this paper. The corpus is of interest to linguists, historians, sociologists as well as to NLP practitioners. The sources of noise in the corpus are described and assessed.

Elena Yagunova, Anna Savina, Anna Chizhik and Ekaterina Pronoza . Linguistic and technical constituents of the interdisciplinary domain "Intellectual technologies and computational linguistics"

Abstract: We are aimed to determine the specific features of the interdisciplinary domain “Intellectual technologies and computational linguistics” (IT&CL). Our objective is to identify keyword features as the most informative structural elements, describing the scientific domain of the corpus. The data for the investigation this domain is four corpora based on proceedings of the 4 most representative Russian international conferences. The goal of this research is to determine subdomains terminology, linguistic and technical constituents and so on. We assume that each corpus represents its own subdomain of the IT&CL topic. So, the Dialog & CL represent more linguistic topics, while the CAI & RCDL — more technical topics. There are 2 keyword sets (KS): KW1 is based on the TF-iDF, KW2 – on comparison of the local and global frequency (a term frequency in a particular corpus and a term frequency in all the corpora), and weighting using the weirdness measure. Our methods have the main accent on evaluation with assessors: the main step evaluation and verifying of the hypothetic division our conferences onto “more linguistics” and “more technical” and the hypothetic domain. We believe that the other step of evaluation is clustering. And it is quietly important to find features (KW, ratio and so on).

Marcin Woliński and Dominika Rogozińska . First Experiments in PCFG-like Disambiguation of Constituency Parse Forests for Polish

Abstract: The work presented here is the first attempt at creating a probabilistic constituency parser for Polish. The described algorithm disambiguates
parse forests obtained from the Świgra parser in a manner close to Probabilistic Context Free Grammars. The experiment was carried out
and evaluated on the Składnica treebank. The idea behind the experiment was to check what can be achieved with this well known method.
Results are promising, presented approach achieves up to 94.1% PARSEVAL F-measure and 92.1% ULAS. The PCFG-like algorithm can be
evaluated against existing Polish dependency parser which achieves 92.2% ULAS.

Krešimir Šojat, Matea Srebačić, Tin Pavelić and Marko Tadić . From Morphology to Lexical Hierarchies

Abstract: This paper deals with language resources for Croatian and discusses the possibilities of their combining in order to improve their coverage and density of structure. Two resources in focus are Croatian WordNet (CroWN) and CroDeriV – a large database of Croatian verbs with morphological and derivational data. The data from CroDeriV is used for enlargement of CroWN and the enrichment of its lexical hierarchies. It is argued that the derivational relatedness of Croatian verbs plays a crucial role in establishing morphosemantic relations and an important role in detecting semantic relations.

Michał Marcińczuk, Marcin Ptak, Adam Radziszewski and Maciej Piasecki . Open dataset for development of Polish Question Answering systems

Abstract: In the paper we discuss a recent research on the open domain question answering system for Polish. We present an open dataset of quesetions and relevant documents called CzyWiesz. The dataset consists of 4721 questions and assigned Wikipedia articles from the Czy wiesz (Do you know) Wikipedia project. This dataset was used to evaluate four methods for reranking a list of answer candidates: Tf.Idf on the base forms, Tf.Idf on the base forms and information from dependency analysis, Minimal Span Weighting and modified Minimal Span Weighting. The reranking methods improved the accuracy and MRR scores for both dataset: the development and the CzyWiesz datasets. The best improvement was obtained for the Tf.Idf on base forms and dependency information by near 5 percentage points for the development dataset and near 7 percentage points for the CzyWiesz dataset.

Aimilios Chalamandaris, Pirros Tsiakoulis, Sotiris Karabetsos and Spyros Raptis . An Automated Method for Creating New Synthetic Voices from Audiobooks

Abstract: Creating new voices for a TTS system often requires a costly procedure of designing and recording an audio corpus, a time and effort consuming task. Using publicly available audiobooks as the raw material of an audio corpus for such systems creates new perspectives regarding the possibility of creating new synthetic voices. This paper addresses the issue of creating new synthetic voices based on audiobooks, in a fully automated method. As an audiobook includes several types of speech, such as narration, character playing etc, special care is given in identifying the audiobook subset that leads to a neutral and general purpose synthetic voice. Part of the work described in this paper was performed during the participation of our TTS system in the Blizzard Challenge 2013. Subjective experiments (MOS and SUS) were carried out in the framework of the Blizzard Challenge itself and selected results of the listening tests are also provided. The results indicate that the synthetic speech is of high quality for both domain specific and generic speech. Further plans for exploiting the diversity of the speech incorporated in an audiobook are also described in the final section where the conclusions are discussed.

Maciej Piasecki Łukasz Burdka and Marek Maziarz . Wordnet Diagnostics in Development

Abstract: The larger a wordnet is, the more difficult keeping it error free becomes. Thus the need for sophisticated diagnostic tests emerges. In this paper we present a set of diagnostic tests and diagnostic tools dedicated to Polish WordNet, plWordNet. The wordnet has been in steady development for seven years and has recently reached 120k synsets. We propose a typology of the diagnostic levels: describe formal, structural and semantic rules for seeking errors within plWordNet, as well as, new method of automated induction of the diagnostic rules. Finally, we discuss results and benefits of the approach.

Jacek Marciniak . Building wordnet based ontologies with expert knowledge

Abstract: The article presents the principles of creating wordnet based ontologies which contain general knowledge about the world as well as specialist expert knowledge. Ontologies of this type are a new method of organizing lexical resources. They possess a wordnet structure expanded by domain relations and synsets ascribed to general domains and local-context taxonomies. Ontologies of this type are handy tools for indexers and searchers working on massive content resources such as internet services, repositories of digital images or e-learning repositories.

Ladan Baghai-Ravary. Identifying and Discriminating between Coarticulatory Effects

Abstract: This paper uses a data-driven approach to characterise the degree of similarity between realised instances of the same nominal phoneme in different contexts. The results indicate the circumstances that produce consistent deviations from a more generic pronunciation. We utilise information local to the phonemes of interest, independent of the phonetic or acoustic context, cluster the instances, and then identify correlations between the clusters and the phonetic contexts and their acoustic features. The results identify a number of phonetic contexts that have a detectable effect on the pronunciation of their neighbours. The methods employed make minimal use of prior assumptions or phonetic theory. We demonstrate that coarticulation often makes the acoustic properties of vowels more consistent across utterances.

Milos Jakubicek, Vojtech Kovar and Marek Medved . Towards taggers and parsers for Slovak

Abstract: In this paper we present tools prepared for morphological and syntactic processing of Slovak: a~model trained for tagging by the RFTagger and two syntactic analyzers Synt and SET for which we adapted their Czech grammars for Slovak. We describe the training process of RFTagger using the r-mak corpus and modifications of both parsers that have been performed partially in the lexical analysis and mainly in the formal grammars used in both systems. Finally we provide an evaluation of both tagging and parsing, the latter on two datasets -- a~phrasal and dependency treebank of Slovak.

Sardar Jaf. The Hybridisation of a Data-driven parser for Natural Languages

Abstract: Identifying and establishing structural relations between words in natural language sentences is called Parsing. Ambiguities in natural languages make parsing a difficult task. Parsing is more difficult when dealing with a structurally complex natural language such as Arabic, which contains a number of properties that make it particularly difficult to handle. In this paper, we briefly highlight some of the complex structure of Arabic, and we identify different parsing approaches (grammar-driven and data-driven approaches) and briefly discuss their limitations. Our main goal is to combine different parsing approaches and produce a hybrid parser, which retains the advantages of data-driven approaches but is guided by grammatical rules to produce more accurate results. We describe a novel technique for directly combining different parsing approaches. Results for initial experiments that we have conducted in this work, and our plans for future work is also presented.

Ines Boujelben, Salma Jamoussi and Ben Hamadou Abdelmajid . Genetic algorithm for extracting relations between named entities

Abstract: In this paper, we tackle the problem of extracting relations that hold between Arabic named entities. The main objective of our work is to automatically extract interesting rules basing on genetic algorithms. We tend firstly to annotate our training corpus using a set of linguistic tools. Then, a set of rules are generated using association rule mining methods. And finally, a genetic process is applied to discover the more inter-esting rules from a given data set. The experimental results proved the effectiveness of our process in discovering the most interesting rules.

Kais Dukes. Semantic Annotation of Robotic Spatial Commands

Abstract: The Robot Commands Treebank (www.TrainRobots.com) provides semantic annotation for 3,394 spatial commands (41,158 words), for manipulating colored blocks on a simulated discrete 3-dimensional board. The treebank maps sentences collected through an online game-with-a-purpose into a formal Robot Control Language (RCL). We describe the semantic representation used in the treebank, which utilizes semantic categories including qualitative spatial relations and named entities relevant to the domain. Our annotation methodology for constructing RCL statements from natural language constructions models compositional syntax, multiword spatial expressions, anaphoric references, ellipsis, expletives and negation. As a validation step, annotated RCL statements are executed by a spatial planning component to ensure they are semantically correct within the spatial context of the board.

Adam Radziszewski. Evaluation of lemmatisation accuracy of four Polish taggers

Abstract: Last three years have seen an impressive rise of interest in morphosyntactic tagging of the Polish language. Four new taggers have been developed, evaluated and made available. Although all of them are able to perform lemmatisation along tagging (and they are often used for that purpose), it is hard to find any report of their lemmatisation accuracy.
This paper discusses practical issues related to assessment of lemmatisation accuracy and reports results of detailed evaluation of lemmatisation capabilities of four Polish taggers.

Jakub Dutkiewicz and Czeslaw Jedrzejek . Ontology-based event extraction for the Polish language

Abstract: The paper presents information extraction methodology which uses shallow parsing structures and ontology. Methods used in this paper are designed for Polish. The methodology mechanisms create mappings from phrases produced with a shallow parsing into thematic roles. Thematic roles are parts of ontological semantic model of certain events. Besides methodology itself, the paper includes overview of a shallow grammar used to produce phrases, example results and comparison of methods used for English with methods used in this paper.

Łukasz Kobyliński. Improving the Accuracy of Polish POS Tagging by Using Voting Ensembles

Abstract: Recently, several new part-of-speech (POS) taggers for Polish have been presented. This is highly desired, as the quality of morphosyntactic annotation of textual resources (especially reference text corpora) has a direct impact on the accuracy of many other language-related tasks in linguistic engineering. For example, in the case of corpus annotation, most automated methods of producing higher levels of linguistic annotation expect an already POS-analyzed text on their input. In spite of the improvement of Polish tagging quality, the accuracy of even the best-performing taggers is still well below 100\% and the mistakes made in POS tagging propagate to higher layers of annotation. One possible approach to further improving the tagging accuracy is to take advantage of the fact that there are now quite a few taggers available and they are based on different principles of operation. In this paper we investigate this approach experimentally and show improved results of POS tagging accuracy, achieved by combining the output of several state-of-the-art methods.

Esha Banerjee, Shiv Kaushik, Pinkey Nainwani, Akanksha Bansal and Girish Nath Jha . Linking and Referencing Multi-lingual corpora in Indian languages

Abstract: This paper is an attempt to present the ILCI-MULRET (Indian Languages Corpora Initiative – Multilingual Linked Resources Tool) which links annotated data and linguistic resources in Indian languages from the Web for use in NLP. The paper discusses the ILCI project, under which parallel corpora is being created in 17 languages of India, including English. This corpus contains parallel data in 4 domains – Health, Tourism, Agriculture and Entertainment and Monolingual data in more than 10 other domains. The MULRET tool acts as a visualizer for word-level and phrase-level information retrieval from the ILCI corpus in multiple languages and links the search query to other available language resources on the Internet.

Adrien Barbaresi. Challenges in web corpus construction for low-resource languages in a post-BootCat world

Abstract: The state of the art tools of the ’Web as Corpus’ framework rely heavily on URLs obtained from search engines. Recently, this querying process became very slow or impossible to perform on a low budget. Trying to find reliable data sources for Indonesian, we perform a case study of different kinds of URL sources and crawling strategies. First, we classify URLs extracted from the Open Directory Project and Wikipedia for Indonesian, Malay, Danish and Swedish in order to enable comparisons. Then we perform web crawls focusing on Indonesian and using the mentioned sources as start URLs. Our scouting approach using open-source software leads to a URL database with metadata.

Maarten Janssen. POS Tags and Less Resources Languages - The CorpusWiki Project

Abstract: CorpusWiki is an online system for building POS resources for any language. For more less-described languages, part of the problem in creating a POS annotated corpus is that it is not always clear beforehand what the tagset should be. By using explicit feature/value pairs, CorpusWiki attempts to allow the provide the kind of flexibility that is needed for defining the tagset in the process of annotating the corpus.

Marianne Vergez-Couret. Tagging Occitan using French and Spanish Tree Tagger

Abstract: Part-Of-Speech (POS) tagging is the first step in all Natural Language Processing chain. It usually requires substantial efforts to annotate corpora and produce lexicons. However, when these language resources are missing like in Occitan, rather than concentrate the effort in creating them, methods are settled to adapt existing rich-resourced languages tagger. For this to work, these methods exploit the etymologic proximity of the under-resourced language and a rich-resourced language. In this article, we focus on Occitan, which shares similarities with several romance languages including French and Spanish. The method consists in running existing morpho-syntactic tools, here Tree Tagger, on Occitan texts with first a pre-transposition of the frequent words in a rich-resourced language. We performed two distinct experimentations, one exploiting similarities between Occitan and French and the second exploiting similarities between Occitan and Spanish. This method only requires the listing of the 300 most frequent words (based on corpus) to construct two bilingual lexicons (Occitan/French and Occitan/Spanish). Our results are better than those obtained with the Apertium tagger using a larger lexicon.

Carlo Zoli and Silvia Randaccio . Smallcodes and LinMiTech: two faces of the same new business model for the development of LRTs for LRLs

Abstract: LinMiTech and Smallcodes represent a new business model for language resources and tools development with particular focus on endangered and minority languages. The key idea is that small language communities should not be regarded as mere users of NLP tools, but as stakeholders and true co-developers of the tools, enforcing a virtuous cycle that enables minorities and language activists to share and boost the scope of the LRTs. In one word, Smallcodes and LinMiTech have made real and financially sustainable the mantra everybody repeats "best practices should be spread and shared among researchers and users".

Ekaterina Pronoza, Elena Yagunova and Andrey Lyashin . Restaurant Information Extraction for the Recommendation System

Abstract: In this paper a method for Russian reviews corpus analysis (as part of information extraction) is proposed. It is aimed at future information extraction system development. This system is intended to be a module of the recommendation system, and it is to gather restaurants parameters from users’ reviews, structure it and feed the recommendation module with these data. The frames analyzed are service and food quality, cuisine, price level, noise level, etc. In this paper service quality, cuisine type and food quality are considered. Authors’ aim is to develop patterns and to estimate their appropriateness for information extraction.

Velislava Stoykova. Representation of Lexical Information for Related Languages with Universal Networking Language

Abstract: The paper analyses related approaches used to develop lexical database in Universal Networking Language framework for representing Bulgarian and Slovak language. The idea is to use a specific combination of grammar knowledge as well as lexical semantic relations representation techniques to model lexical information representation scheme for related languages. The formal representation is outlined with respect to its multilingual application for machine translation.

Tomasz Obrebski. Deterministic Dependency Parsing with Head Assignment Revision

Abstract: The paper presents a technique for deterministic dependency parsing which allows for making revisions in already made choices as to head assignment. Traditional deterministic algorithms, in both phrase structure and dependency paradigm are incremental and base their decision basically on two kinds of information: the parse history and look-ahead tokens (the yet unparsed part of the input). The typical examples may be well known LR(k) algorithms for a subset of context-free grammars or Nivre's memory-based algorithms for dependency grammars. Our idea is to defer the ultimate decision until more input is read and parsed. While processing i-th token the choice is considered as preliminary and may be changed as soon as a better option appears. This strategy may by incorporated into other algorithms.

Robert Susmaga and Izabela Szczęch . Visualization of Interestingness Measures

Abstract: The paper presents a visualization tool for interestingness measures, which provides useful insights into dierent domain areas of the visualized measure and thus effectively assists measure comprehension and their selection for KDD methods. Assuming a common, 4-dimensional domain form of the measures, the system generates a synthetic set of contingency tables and visualizes them in three dimensions using a tetrahedron-based barycentric coordinate system. At the same time, an additional, scalar function of the data (referred to as the operational function, e.g. any interestingness measure) is rendered using colour. Throughout the paper a particular group of interestingness measures, known as conrmation measures, is used to demonstrate the capabilities of the visualization tool, which range from the determination of specic values (extremes, zeros, etc.) of a single measure, to the localization of pre-dened regions of interest, e.g. such domain areas for which
two/or more measures do not differ at all or differ the most.

Mercedes García Martínez, Michael Carl and Bartolomé Mesa-Lao . Demo of the CASMACAT post-editing workbench - Prototype-II: A research tool to investigate human translation processes for Advanced Computer Aided Translation

Abstract: The CASMACAT workbench is a new browser-based computer-assisted translation workbench for post-editing machine translation outputs. It builds on experience from (I) the Translog-II (http://bridge.cbs.dk/platform/?q=Translog-II), a tool written in C#, designed for studying human reading and writing processes and (II) on Caitra (http://www.caitra.org/), a Computer Assisted Translation (CAT) tool based on AJAX Web.2 technologies and the Moses decoder. The CASMACAT workbench extends Translog's key-logging and eye-tracking abilities with a browser-based front-end and an MT server in the back-end.
The main features of this new workbench are:
1. Web-based technology which allows for easier portability across different machine platforms and versions
2. Interactive translation prediction, suggesting to the human translator how to complete the translation.
3. Interactive editing, providing additional information about the confidence of its assistance.
4. Adaptive translation models, updating and adapting its models instantly based on the translation choices of the user.
The translation field can be pre-filled by machine translation through a server connection and also automatically updated online from an interactive machine translation server. Shortcut keys are used for functions such as navigating between segments.
The main innovation of the CASMACAT workbench is its exhaustive logging function. This allows for completely new possibilities of analyzing translators' behavior both, in a qualitative and quantitative manner. The extensive log file contains all kinds of events, keystrokes, mouse, cursor navigation, as well as gaze information (if an eye-tracker is connected) during the translation session. The logged data can also be replayed for visualizing the moves and choices made by the translator during the post-editing process.

Fernando Ferri, Patrizia Grifoni and Adam Wojciechowski . SNEP: Social Network Exchange Problem

Abstract:
This paper discusses of the new frontiers of using Internet and Social Networks for exchanging goods by swapping, and the Knowledge and optimization algorithm proposed for the emerging business approach. This approach is based on the use of Social Networks (SN) for sharing knowledge, services and exchanging goods. Exchanges are managed according to knowledge on the community members, the domain knowledge (the knowledge on products and goods) and the operational knowledge. The exchange problem involves Social Networks, people and goods involved and defined in form of a graph. The algorithm for optimization is presented

Waldir Edison Farfan Caro, Jose Martin Lozano Aparicio and Juan Cruz . Syntactic Analyser for Quechua Language

Abstract: This demo presents a morphological analyzer for Quechua which makes use of a dynamic programming technique with a context free grammar. The construction of grammar of native languages are very important in order to keep them alive. We focus in Quechua, a native American less-resourced language.

Brigitte Bigi. SPPAS - DEMO

Abstract: SPPAS is a tool to produce automatic annotations which include utterance, word, syllabic and phonemic segmentation from a recorded speech sound and its transcription. Main supported languages are: French, English, Italian, Spanish, Chinese and Taiwanese. The resulting alignments are a set of TextGrid files, the native file format of the Praat software which has become the most popular tool for phoneticians today. An important point for a software which is intended to be widely distributed is its licensing conditions. SPPAS uses only resources and tools which can be distributed under the terms of the GNU Public License.

Aitor García-Pablos, Montse Cuadros and German Rigau. OpeNER demo: Open Polarity Enhanced Named Entity Recognition

Abstract: OpeNER is a project funded by the European Commission under the 7th Framework Programme. Its acronym means Open Polarity Enhanced Named Entity Recognition. OpeNER main goal is to provide a set of open and ready to use tools to perform some NLP tasks in six languages including English, Spanish, Italian, Dutch, German and French. In order to display these OpeNER analysis output in a format suitable for a non-expert human reader we have developed a Web application to display this content in different ways. This Web application should serve as a demonstration of some of the OpeNER modules capabilities after the first year of development.

Włodzimierz Gruszczyński, Bartosz Broda, Bartłomiej Nitoń and Maciej Ogrodniczuk. Jasnopis: a new application for measuring readability of Polish texts

Abstract: In the demo session we present a new application for automatic measuring of readability of Polish texts making use of two most common approaches to the topic: Gunning FOG index and Flesch-based Pisarek method and two novel methods: measuring distributional lexical similarity of a target text and comparing it to reference texts and using statistical language modeling for automation of a Taylor test.

Shakila Shayan, Andre Moreira, Alexander Koenig, Sebastian Drude and Menzo Windhouwer. Lexus, an online encyclopedic lexicon tool

Abstract: Lexus is a flexible web-based lexicon tool, which can be used on any platform that provides a web browser and can run flash player. Lexus does not enforce any pre-described schema structure to the user, however it supports the creation of lexica with the structure of the ISO LMF standard and promotes the usage of concept names and conventions that are proposed by the ISO data categories. We will have a demo presentation of Lexus’ main features which include embedding local and archived multimedia (image, audio, video) within different parts of a lexical entry, customizable visualization of the data, possibility of creating data elements through accessing ISOcat Data Category Registry or an MDF database (toolbox/shoebox), various within and cross-lexica search options, and allowing for shared access collaboration. We will show how these features, which allow for an online shared-access multimedia encyclopedic lexicon, make Lexus a promising and suitable tool for language documentation projects.

Jakub Dutkiewicz and Czesław Jedrzejek. A demo of ontology-based event extraction for the Polish and English languages

Abstract: This paper describes details of a demonstration of the ontology-based event extraction for the Polish and English languages. The steps contain a creation of conceptual, ontological view of the event, a design of an extraction scheme, and the extraction itself with a presentation of results.

Marek Maziarz, Maciej Piasecki, Ewa Rudnicka and Stanisław Szpakowicz . PlWordNet 2.1: DEMO

Abstract: The first ever wordnet (WordNet) was built in the late 1980s at Princeton University. In the past two decades, hundreds of research teams followed in the footsteps of WordNet’s creators, including our team. Notably, plWordNet is one of few such resources built not by translating WordNet, but from the ground up, in a joint effort of lexicographers and computer scientists. In 2009 the first version, with some 27000 lexical units, has been made available on the Internet. Today plWordNet describes 108000 nouns, verbs and adjectives, contains nearly 162000 unique senses and 450000 relation instances. It is by far the largest wordnet for Polish, and one of several largest in the world.
PlWordNet is a semantic network which reflects the Polish lexical system. The nodes in plWordNet are lexical units (LUs, words with their senses), variously interconnected by semantic relations from a well-defined relation set. For instance, synonymous LUs "kot 2" and "kot domowy 1" ‘cat, Felis domesticus’ have a hypernym "kot 1" ‘feline mammal, any member of the family Felidae’ and hyponyms "dachowiec 1" ‘alley cat’ or "angora turecka 1" ‘Turkish Angora’. Any lexical unit acquires its meaning from its relatedness to other lexical units within the system; we can reason about it by considering relations in which it participates. Thus "kot 2" is defined as a kind of animal from the family Felidae and "dachowiec 1" and "angora turecka 1" are kinds of the Felis domesticus. Lexical units which enter the same lexicosemantic relations (but not the same derivational relations) are treated as synonyms and linked into synsets that is synonym sets (this is the case of "kot 2" and "kot domowy 1").
The continued growth of plWordNet has been made possible by grants from the Polish Ministry of Science and Higher Education and from the European Union. Now we work on it in the Clarin Poland Project. We aim to build a conceptual dictionary fully representative of contemporary Polish, comparable with the largest wordnets in the world. We have made an effort to ensure that version 2.1 has the same high quality as the best wordnets out there – Princeton WordNet, EuroWordNet (a joint initiative of a dozen or so members of the European Union) or GermaNet from Tübingen University. The plWordNet is available free of charge for any applications (including commercial applications) based on a licence modelled on that for Princeton WordNet.

Piotr Bański, Joachim Bingel, Nils Diewald, Elena Frick, Michael Hanl, Marc Kupietz, Piotr Pęzik, Carsten Schnober and Andreas Witt . KorAP: the new corpus analysis platform at IDS Mannheim

Abstract: The KorAP project (“Korpusanalyseplattform der nächste Generation”, “Corpus-analysis platform of the next generation”), carried out at the Institut für Deutsche Sprache (IDS) in Mannheim, Germany, has as its goal to develop a modern, state-of-the-art corpus-analysis platform, capable of handling very large corpora and opening the perspectives for innovative linguistic research. The platform will facilitate new linguistic findings by making it possible to manage and analyse primary data and annotations in the petabyte range, while at the same time allowing an undistorted view of the primary linguistic data, and thus fully satisfying the demands of a scientific tool. The proposed demo will present the project deliverable in the current stage, when we switch from the creation phase to the testing and fine-tuning phase.

Maarten Janssen . CorpusWiki: an online, language independent POS tag corpus builder

Abstract: In this demonstration we will we present the key aspects of CorpusWiki. CorpusWiki is an online environment for the creation of morphosyntatically annotated corpora, and can be used for any written language. The system attempts to keep the required (computational) knowledge of the users to a bare minimum, and to maximize the output of the efforts put in by the users. Given its language independent design, CorpusWiki primarily aims at allowing users to create basic resources for less-resourced languages (and dialects).

Zygmunt Vetulani, Marek Kubis and Bartłomiej Kochanowski . Collocations in PolNet 2.0. New release of "PolNet - Polish Wordnet"

Abstract: The recent development evolution from PolNet 1.0 to PolNet 2.0 consisted mainly in further development of the verbal component with the inclusion of concepts (synsets) represented (in many cases uniquely) by compound construction in form of verb-noun collocations. This extension brings to PolNet new synsets, some of which closely related to the already existing verb synsets of the PolNet 1.0. Adding collocations to PolNet appeared to be non trivial because of specific syntactic phenomena related with collocations in Polish.

Massimo Moneglia . IMAGACT. A Multilingual Ontology of Action based on visual representations (DEMO)

Abstract: Automatic translation systems often have difficulty choosing the appropriate verb when the translation of a simple sentence is required, since one verb can refer, in its own meaning, to many different actions and there is no certainty that the same set of alternatives is allowed in another language. The problem is a significant one, because reference to action is very frequent in ordinary communication and high-frequency verbs are general; i.e. they refer to different action types. IMAGACT has delivered a cross-linguistic ontology of action. Using spoken corpora, we have identified 1010 high-frequency action concepts and visually represented them with prototypical scenes. The ontology allows the definition of cross-linguistic correspondences between verbs and actions in English, Italian, Chinese and Spanish. Thanks to the visual representation it can potentially be extended to any language.

Andrea Marchetti, Maurizio Tesconi, Stefano Abbate, Angelica Lo Duca, Andrea D'Errico, Francesca Frontini and Monica Monachini . Tour-pedia: a web application for the analysis and visualization of opinions for tourism domain

Abstract: We present Tour-pedia an interactive web application that extracts opinions from reviews of accommodations from different sources available on-line. Polarity markers display on a map the different opinions. This tool is intended to help business operators to manage reputation on-line.

Ewa Lukasik and Magdalena Sroczan . Innovation of Technology and Innovation of Meaning: Assessing Websites of Companies

Abstract: The research reported in this paper has been inspired by the concept of design driven innovation introduced by Roberto Verganti. He claimed, that the rules of innovation should be changed by radically changing the meaning of things. The word design is etymologically derived from latin expression that relates to distinguishing things by signs (de signum). Therefore the sign and the language of signs play an important role in catching user’s interest in the product. The Internet is one of the most important marketing medium for a firm; therefore the meaning of its web image, a website, should be a matter of a company’s great concern to attract the clients. The paper tries to answer the question if websites of firms are innovative or they are only correctly designed according to user centered rules. The design driven innovation of websites of selected firms has been assessed by student testers. Three groups of Internet portals were taken into account: electricity suppliers, banks and portals of cities. The test results showed, that the design of most of the assessed websites was placed in the first quadrant of Verganti’s technology-meaning, i.e. neither the radical innovation of technology, nor the innovation of meaning were observed.

Rajaram Belakawadi, Shiva Kumar H R and Ramakrishnan A G . An Accessible Translation System between Simple Kannada and Tamil Sentences

Abstract: A first level, rule based machine translation system is designed and developed for words and simple sentences of the language pair Kannada – Tamil. These languages are Dravidian languages and have the status of classical languages. Both grammatical and colloquial translations are made available to the user. One can also give English words and sentences as input and the system returns both Kannada and Tamil equivalents. With accessibility as the key focus, the system has an integrated Text-To-Speech system and gives transliterated output in Roman script for both Kannada and Tamil languages. This makes the tool accessible to the visually or hearing challenged. The system has been tested by 5 native users each of Tamil and Kannada for isolated words and sentences of length up to three words and found to be user friendly and acceptable. The system handles sentences of the following types: greetings, introduction, enquiry, directions and other general ones useful for a new comer.

Adam Wojciechowski, Krzysztof Gorzynski . A Method for Measuring Similarity of Books: a Step Towards an Objective Recommender System for Readers

Abstract: In the paper we propose a method for book comparison based on graphical radar chart intersection area method. The method was designed as a universal tool and its most important parameter is document feature vector (DFV), which defines a set of text descriptors used to measure particular properties of analyzed text. Numerical values of the DFV that define book characteristic are stretched on radar chart and intersection area drawn for two books is interpreted as a measure of bilateral similarity in sense of defined DFV. Experiment conducted on relatively simple definition of the DFV gave promising results in recognition of books’ similarity (in sense of author and literature domain). Such an approach may be used for building a recommender system for readers willing to select a book matching their preferences recognized by objective properties of a reference book.