LTC 2013 Accepted Papers and Demos with Abstracts
Marek Kubis
.
Demo: Evaluating wordnets using query languages
Abstract:
The demo shows how to use the query languages of the WQuery system in order to evaluate and repair the data stored in WordNet-like
lexical databases. Various queries useful in this context are discussed and their realizations expressed in the languages of the WQuery
system are presented.
András Beke, Mária Gósy and Viktória Horváth
.
Temporal variability in spontaneous Hungarian speech
Abstract:
The aim of this paper is an objective presentation of temporal features
of spontaneous Hungarian narratives, as well as a characterization of
separable portions of spontaneous speech (thematic units and phrases).
An attempt is made at capturing objective temporal properties of
boundary marking. Spontaneous narratives produced by ten speakers taken
from the BEA Hungarian Spontaneous Speech Data Base were analyzed in
terms of hierarchical units of narratives (duration of pauses and of
units, speakers’ rates of articulation, number of words produced, and
the interrelationships of all these). The results confirm the presence
of thematic units and of phrases in narratives by objective numerical
values. We conclude that (i) the majority of speakers organize their
narratives in similar temporal structures, (ii) thematic units can be
identified in terms of certain prosodic criteria, (iii) there are
statistically valid correlations between factors like the duration of
phrases, the word count of phrases, the rate of articulation of
phrases, and pausing characteristics, and (iv) these parameters exhibit
extensive variability both across and within speakers.
Ahmed Haddad, Slim Mesfar and Henda Ben Ghezala
.
Assistive Arabic character recognition for android application
Abstract:
This work was integrated within the “Oreillodule” project; a real-time system for the synthesis, recognition and translation of Arabic language. Within this framework, we proceeded to set up an automatic generator of an electronic dictionary within a context geared towards Arabic based manuscript recognition.
Our approach recommended for the construction of the lexicon is similar to that advanced on the level of the treatment of electronic documents. It is based on the automatic generation of lexicon from caractheristic morphological and dérivational ones of the Arab language. With an aim of a bearing of the system on machines with reduced capacities (embarked systems, PDA, mobile phones, shelf PC, etc), we distinguish from the time constraints of calculation and obstruction memory. For that, we will adopt modelings of the dictionary with a minimum of redundancy while guaranteeing a strong degree of reduction and precision.
This modelling procedure will be based on a soft representation of words called “descrip-tor”. These descriptors are extracted from generic silhouettes of the recognised manuscript words formed out of the concatenation of the fundamental traits, descending and relatively vertical, of each character followed by the elimination of non significant traits. The same descriptors are used for clustering and indexing the sub-lexicons. These sub-lexicons will be used for the calculation of the similarity between the stored forms and those to be recognised.
During the assistive recognition phase we will proceed with some new features according to the silhouette descriptor in order to give the opportu-nity of using the system for the people who have some physical problems (handicaps). For that, we include a phase of modeling where the system evaluate the writing style of the user to enhance the recognition.
Shikha Garg and Vishal Goyal
.
System for Generating Questions Automatically From
Given Punjabi Text
Abstract:
This paper introduces a system for generating questions automatically
for Punjabi. The System transforms a declarative sentence into its
interrogative counterpart. It accepts sentences as an input and
produces possible set of questions for the given input. Not much work
has been done in the field of Question Generation for Indian Languages.
The current paper represents the Question Generation System for Punjabi
language to generate questions for the given input in Gurmukhi script.
For Punjabi, adequate annotated corpora, Part of speech taggers and
other Natural Language Processing tools are not yet available in the
required measure. Thus, this system relies on the Named Entity
Recognition tool. Also, various Punjabi Language dependent rules have
been developed to generate output based on the named entity found in
the given input sentence.
Anu Sharma
Punjabi Sentiment (Opinion) Analyzer
Abstract:
The most active growing part of the web is “social media”. Sentiment
analysis (opinion mining) plays an important role in determining the
attitude of a speaker or a writer with respect to some topic or the
overall contextual polarity of a document. In this paper, a novel
approach is proposed for automatically extracting the news review from
web pages, by using basic NLP technique like N-gram (Unigram and
Bigram). The author classifies news reviews broadly into two categories
positive and negative. Here author experimented machine learning
algorithm Naive Bayes. The algorithm we proposed achieved an average
accuracy of 80% on multi-category dataset.
Mirjam S. Maučec, Gregor Donaj and Zdravko Kačič
.
Improving Statistical Machine Translation with
Additional Language Models
Abstract:
This paper proposes n-best list rescoring in order to improve the
translations produced by the statistical machine translation system.
Phrase-based statistical machine translation is used as a baseline. We
have focused on translation between two morphologically rich languages
and have extended phrase-based translation into factored translation.
The factored translation model integrates linguistic knowledge into
translation in terms of linguistic tags, called factors. The factored
translation system is able to produce n-best linguistically annotated
translations for each input sentence. We have proposed to re-rank them
based on scores produced by additional language models of general
domain. Two types of language models were used: language models of
word-forms, and language models of MSD tags. Experiments were performed
on a Serbian – Slovenian language pairing. The results were evaluated
in terms of two standard metrics, BLEU and TER scores. Tests of
statistical significance based on approximate randomization were
performed for both metrics. The improvements in terms of both metrics
were statistically significant. This approach is applicable to any pair
of languages, especially to languages with rich morphologies.
Jing Sun, Qing Yu and Yves Lepage
.
An Iterative Method to Identify Parallel Sentences
from Non-parallel Corpora
Abstract:
We present a simple yet effective method in identifying bilingual
parallel sentences from Web-crawled non-parallel data using translation
scores obtained with the EM algorithm. First, a modified IBM Model-1 is
used to estimate the word-to-word translation probabilities. Then,
these translation probabilities are used to compute translation scores
between non-parallel sentence pairs. The above two steps are run
iteratively and sentence pairs with higher translation scores are
extracted from non-parallel data. Our approach is novel from previous
research because we take into account the information in the
non-parallel data as well. According to our experiment results, such a
method is promising to construct training corpora for data-driven tasks
such as Statistical Machine Translation.
Brigitte Bigi.
A phonetization approach for the forced-alignment task
Abstract:
The phonetization of text corpora requires a sequence of processing
steps and resources in order to convert a normalized text in its
constituent phones and then to directly exploit it by a given
application.
This paper presents a generic approach for text phonetization and
concentrates on the aspects of phonetizing unknown words, which serve
to develop a phonetizer in the context of forced-alignement
application. It is a dictionary-based approach, which is as
language-independent as possible: this approach is applied to French,
Italian, English, Vietnamese, Khmer, Chinese and Pinyin for Taiwanese.
The tool with linked resources are distributed under the terms of the
GPL license.
Nicolas Neubauer, Nils Haldenwang and Oliver Vornberger
.
Differences in Semantic Relatedness as Judged by
Humans and Algorithms
Abstract:
Quantifying the semantic relatedness of two terms is a field with
vast amounts of research, as the knowledge provided by it has
applications for
many Natural Language Processing problems.
While many algorithmic measures have been proposed, it is often hard to
say if
one measure outperforms another, since their evaluation often lacks
meaningful
comparisons to human judgement on semantic relatedness.
In this paper we present a study using the BLESS data set to compare
the preferences of humans regarding semantic relatedness to popular
algorithmic
baselines, PMI and NGD, which shows that significant differences in
relationship-type
preferences between humans and algorithms exist.
Nicolas Neubauer
, Nils Haldenwang and Oliver Vornberger.
The Bidirectional Co-occurrence Measure: A Look at
Corpus-based Co-occurence Statistics and Current Test Sets
Abstract:
Determining the meaning of or understanding a given input in
natural language is a challenging task for a computer system. This
paper discusses
a well known technique to determine semantic relatedness of terms:
corpus-based
co-occurrence statistics.
Aside from presenting a new approach for this technique, the
Bidirectional Co-occurrence Measure,
we also compare it to two well-established measures, the Pointwise
Mutual Information
and the Normalized Google Distance. Taking these as a basis, we discuss
a
multitude of popular test sets, their test methodology, strengths and
weaknesses while also
providing experimental evaluation results.
Elzbieta Hajnicz.
Actualising lexico-semantic annotation of Składnica
Treebank to modified versions of source resources
Abstract:
In this paper a method of automatic update of the lexico-semantic
annotation of Składnica treebank by means of PlWN wordnet senses is
described. Both resources undergo intensive development. The method is
based on information which is considered invariant between subsequent
versions of a resource.
Senem Kumova Metin, Bahar Karaoglan and Tarik Kişla
.
Using IR Representation in Text Similarity Measurement
Abstract:
In text similarity detection, to overcome the syntactic heterogeneity
in expressing similar meanings, text expanding by knowledge resources
has been tried by previous researchers. In this work, we present a
novel method to compare the retrieved data which we call IR
(Information Retrieval) representation of the texts rather than the
original texts themselves. The abstracts and the titles of the texts
are provided as queries to the IR system and among different
alternatives in retrieved information; such as documents and snippets;
the URL address lists are considered as contributor to the text
similarity detection. IR representations of the texts are subjected to
pair-wise distance measurements. Two important results obtained from
experiments are: 1) IR representations obtained by querying the titles
overcome the representations obtained by the use of abstracts in text
similarity detection; 2) Querying abstracts does not improve the text
similarity measurements.
Krzysztof Jassem.
PSI-Toolkit - an Open Architecture Set of NLP Tools
Abstract:
The paper describes PSI-Toolkit, a set of NLP tools designed within a
grant of Polish Ministry of Science and Higher Education. The main
feature of the toolkit is its open architecture, which allows for the
integration of NLP tools designed in independent research centres. The
toolkit processes texts in various natural languages. The annotators
may be run with several options to differentiate the format of input
and output. The rules used by PSI-Tools may be supplemented or replaced
by the user. External tools may be incorporated into the PSI-Toolkit
environment. Corpora annotated by other toolkits may be read into the
PSI-Toolkit data structure and further processed.
Jacek Malyszko and Agata Filipowska
.
Crowdsourcing in creation of resources for the needs
of sentiment analysis
Abstract:
One of biggest challenges in the field of information extraction is
creation of lingustic resources of high quality that enable to develop
or validate created methods. These resources are scarce as the cost of
their creation is high and they are usually designed to suit specific
needs and as such they are hardly reusable. Therefore, the community
researches new methods for creation of lingustic resources. This paper
is in line with this research, as its goal is to indicate how the
opinion of the crowd may influence the process of creation of
linguistic resources. The authors propose methods for building the
resources for the needs of sentiment analysis together with validation
of the proposed approaches.
Thao Phan.
Pre-processing as a Solution to Improve
French-Vietnamese Named Entity Machine Translation
Abstract:
The translation of proper names (PNs) and noun phrases with PNs from a
source language into a target language is dealing with a great number
of difficulties as well for human translator and current machine
translation systems. This paper presents some problems relating to
named entity machine translation, namely proper name machine
translation (PNMT) from French into Vietnamese. To deal with those
difficulties, we propose some pre-processing solutions for reducing
certain PNMT errors made by the current French-Vietnamese MT systems in
Vietnam. The pre-processing program built and tested with the
French-Vietnamese parallel corpus of PNs has obtained significant
achievements in the improvement of MT quality.
Zang Yuling, Jin Matsuoka and Yves Lepage
.
Extracting Semantically Analogical Word Pairs from
Language Resources
Abstract:
A semantic analogy (e.g., sun is to planet as nucleus is to electron)
is a pair of word pairs which have similar semantic relations. Semantic
analogies may be useful in Natural Language Processing (NLP) tasks and
may be applied in several ways. It should have a great potential in
sentence rewriting tasks, reasoning systems or machine translation. In
this paper, we combine three methods to extract analogical word pairs
by using patterns, clustering word pairs and measuring semantic
similarity using vector space model. We show how to produce a large
amount of clusters exhibiting various quality from good to poor, with a
reasonable number of clusters of good quality.
Maciej Ogrodniczuk, Katarzyna Głowińska, Mateusz Kopeć,
Agata Savary
and Magdalena Zawisławska
.
Polish Coreference Corpus
Abstract:
This article describes the composition, annotation process and
availability of the newly constructed Polish Coreference Corpus – the
first substantial Polish corpus of general nominal coreference. The
tools used in the process and final linguistic representation formats
are also presented.
Stefan Daniel Dumitrescu and Tiberiu Boroş.
A unified corpora-based approach to Diacritic
Restoration and Word Casing
Abstract:
We propose a surface processing system that attempts to solve two
important problems: diacritic restoration and word casing. If solved
with sufficient accuracy, Internet-extracted text corpora would benefit
from a sensible quality increase, leading to better performance for
every task using such text corpora. At its core, the system uses a
Viterbi algorithm to select the optimal state sequence from a variable
number of possible options for each word in a sentence. The system is
language independent; regarding external resources, the system uses a
language model and a lexicon. The language model is used to estimate a
transition scores for the sequential alternatives generated from the
lexicon. We obtain targeted Word Accuracy Rates (WAR) over 92%, and
all-word WAR of over 99%.
Joro Ny Aina Ranaivoarison, Eric Laporte and Baholisoa Simone
Ralalaoherivony
.
Formalization of Malagasy conjugation
Abstract:
This paper reports the core linguistic work performed to construct a
dictionary-based morphological analyser for Malagasy simple verbs. It
uses the Unitex platform (Paumier, 2003) and comprised the contruction
of an electronic dictionary for Malagasy simple verbs. The data is
encoded on the basis of morphological features. The morphological
variations of verb stems and their combination with affixes are
formalized in finite-state transducers represented by editable graphs.
78 transducers allow Unitex to generate a dictionary of allomorphs of
stems. 271 transducers are used by the morphological analyser of Unitex
to recognize the stem and the affixes in conjugated verbs. The design
of the dictionary and transducers prioritizes readability, so that they
can be extended and updated by linguists.
Juan Luo, Aurélien Max and Yves Lepage
.
Using the Productivity of Language is Rewarding for
Small Data: Populating SMT Phrase Table by Analogy
Abstract:
This paper is a partial report of the work on integrating proportional
analogy into statistical machine translation systems. Here we present a
preliminary investigation on the application of proportional analogy to
generate translations of unseen n-grams from phrase table. We conduct
experiments with different sizes of data and implement two methods to
integrate n-gram pairs produced by proportional analogy into the
state-of-the-art machine translation system Moses. The evaluation
results show that n-grams generated by proportional analogy are
rewarding for machine translation systems with small data.
Tina Hildenbrandt, Friedrich Neubarth and Sylvia Moosmüller.
Orthographic encoding of the Viennese dialect for
machine translation
Abstract:
Language technology concerned with dialects is confronted with a
situation where the target language variety is generally used in spoken
form, and, due to a lack of standardization initiatives, educational
reinforcement and usage in printed media, written texts often follow an
impromptu orthography, resulting in great variation of spelling. A
standardized orthographic encoding is, however, a necessary
precondition in order to apply methods of language technology, most
prominently machine translation. Scripting dialects usually mediates
between similarity of a given standard orthography and precision of
representing the phonology of the dialect. The generation of uniform
resources for language processing necessitates considering additional
requirements, such as lexical unambiguousness, consistency,
morphological and phonological transparency, which are of higher
importance than readability. In the current contribution we propose a
set of orthographic conventions for the dialect/sociolect spoken in
Vienna. This orthography is mainly based on a thorough phonological
analysis, whereas deviations from this principle can be attributed to
disambiguation of otherwise homographic forms.
Jędrzej Osiński.
Resolving relative spatial relations using common
human competences
Abstract:
The spatial reasoning is an import field of artificial intelligence
with many application in natural language processing. We present a
technique for resolving relative spatial relations like in front of or
behind. Its main idea is based on the questionnaire experiments
performed via Internet. The presented solution was successfully
implemented in a system designed for improving the quality of
monitoring of complex situations and can be treat as a general method
applicable in many areas.
Fumiyo Fukumoto, Yoshimi Suzuki and Atsuhiro Takasu
.
Multi-Document Summarization based on Event and Topic
Detection
Abstract:
This paper focuses on continuous news documents and presents a method
for extractive multi-document summarization. Our hypothesis
about salient, key sentences in news documents is that they include
words related to the target event and topic of a document. Here, an
event and a topic are the same as TDT project: an event is something
that occurs at a specific place and time along with all necessary
preconditions and unavoidable consequences, and a topic is defined to
be “a seminal event or activity along with all directly related events
and activities.” The difficulty for finding topics is that it has
various word distributions, i.e., sometimes it frequently appears in
the target
documents, but sometimes not. In addition to the TF-IDF term weighting
method to extract event words, we identified topics by using
two models, i.e., Moving Average Convergence Divergence (MACD) for
words with high frequencies, and Latent Dirichlet Allocation
(LDA) for low frequency words. The method was tested on two data,
NTCIR-3 Japanese news documents and DUC data, and the results
showed the effectiveness of the method.
Imane Nkairi, Irina Illina, Georges Linares and Dominique Fohr
.
Exploring temporal context in diachronic text
documents for automatic OOV proper name retrieval
Abstract:
Proper name recognition is a challenging task in information retrieval
in large audio/video databases. Proper names are semantically rich and
usually are a key to understand the information contained in a
document. Our work focuses on increasing the vocabulary coverage of a
speech transcription system by automatically retrieving proper names
from diachronic contemporary text documents. We proposed methods that
dynamically augment the automatic speech recognition system vocabulary,
using lexical and temporal features in diachronic documents. We also
studied different metrics for proper name selection in order to limit
the vocabulary augmenting and therefore the impact on the ASR
performances. Recognition results show a significant reduction of the
word error rate using augmented vocabulary.
Nives Mikelic Preradovic and Damir Boras
.
Knowledge-Driven Multilingual Event Detection Using
Cross-Lingual Subcategorization Frames
Abstract:
In this paper we present a knowledge-driven approach to multilingual
event detection using a combination of lexico-syntactic patterns and
semantic roles to build cross-lingual event frames, based on the
meaning of the verb prefix and the relation of the prefixed verb to its
syntactic arguments. We concentrate on SOURCE and GOAL roles, encoding
adlative, ablative, intralocative and extralocative meaning in the
event frames of five Indo-European languages: Croatian, English,
German, Italian and Czech. The lexico-syntactic pattern for each frame
was manually developed, covering directional meaning of Croatian
prefixes OD-, DO-, IZ- and U-. Apart from the possibility to detect
spatial (directional) information in five languages, results presented
here suggest a possible approach to improve foreign language learning
using a set of cross-lingual verb valency frames.
Frederique Segond, Marie Dupuch, André Bittar, Luca Dini, Lina
Soualmia, Stefan Darmoni, Quentin Gicquel and Marie Helene Metzger
.
Separate the grain from the chaff: make the best use
of language and knowledge technologies to model textual medical data
extracted from electronic health records
Abstract:
Electronic Health Records (EHRs) contain information that is crucial
for biomedical research studies. In recent years, there has been an
exponential increase in scientific publications about using textual
processing of medical data in fields as diverse as medical decision
support, epidemiological studies and data and semantic mining. While
the use of semantic technologies in this context demonstrates promising
results, a first experience with such an approach, shed light on some
challenges among which the need for smooth integration of specific
terminologies and ontologies into the linguistic processing modules as
well as independency of linguistic and expert knowledge. Our works lies
at the cross road between natural language processing, knowledge
representation and reasoning and aims at providing a truly generic
system to support extraction and structuration of medical information
contained in EHRs. This paper presents an approach which combines
sophisticated linguistic processing with a multi-terminology server and
an expert knowledge server focusing on independency of linguistic and
expert rules.
Daniele Falavigna, Fabio Brugnara, Roberto Gretter
and Diego Giuliani
.
Dynamic Language Model Focusing for Automatic
Transcription of Talk-Show TV Programs
Abstract:
In this paper, an approach for unsupervised dynamic adaptation of the
language model used in an automatic transcription task is proposed. The
approach aims to build language models ”focused” on linguistic content
and speaking style of audio documents to transcribe by adapting a
general purpose language model on a running window of text derived from
automatic recognition hypotheses. The text in each window is used to
automatically select documents from the same corpus utilized for
training the general purpose language model. In particular, a fast
selection approach has been developed and compared with a more
traditional one used in the information retrieval area. The new
proposed approach allows for a real time selection of documents and,
hence, for a frequent language model adaptation on a short (less than
100 words) window of text. Experiments have been carried out on six
episodes of two Italian TV talk-shows programs, by varying the size and
advancement step of the running window and the corresponding number of
words selected for focusing language models. A relative reduction in
word error rate of about 5.0% has been obtained using the running
window for focusing the language models, to be compared with a
corresponding relative reduction of 3.5% achieved using the whole
automatic transcription of each talk-show episode for focusing the
language models.
Adrien Barbaresi.
A one-pass valency-oriented chunker for German
Abstract:
The transducer described here consists in a pattern-based matching
operation of POS-tags using regular expressions that takes advantage of
the characteristics of German grammar. The process aims at finding
linguistically relevant phrases with a good precision, which enables in
turn an estimation of the actual valency of a given verb. This
finite-state chunking approach does not return a tree structure, but
rather yields various kinds of linguistic information useful to the
language researcher: possible applications include simulation of text
comprehension on the syntactical level, creation of selective
benchmarks and failure analysis. It is relatively fast, as it reads its
input exactly once instead of using cascades, which greatly benefits
computational efficiency.
Bartłomiej Nitoń.
Evaluation of Uryupina’s coreference resolution
features for Polish
Abstract:
Automatic coreference resolution is an extremely difficult and complex
task. It can be approached in two different ways: using rule-based
tools or machine learning. This article describes an evaluation of a
set of surface, syntactic and anaphoric features proposed in Uryupina
2007 and their usefulness for coreference resolution in Polish texts.
John Lee.
Toward a Digital Library with Search and Visualization
Tools
Abstract:
We present a digital library prototype with search and visualization
capabilities, designed to support both language learning and textual
analysis. As in other existing libraries, users can browse texts with a
variety of reading aids; in our library, they can also search for
complex patterns in syntactically annotated and multilingual corpora,
and visualize search results over large corpora. With a web interface
that assumes no linguistic or computing background on the part of the
user, our library has been deployed for pedagogical and research
purposes in diverse languages.
Ralf Kompe, Marion Fischer, Diane Hirschfeld, Uwe Koloska, Ivan Kraljevski, Franziska Kurnot and Mathias Rudolph
.
AzARc – A Community-based Pronunciation Trainer
Abstract:
In this paper a system for automatic pronunciation training is
presented. The users perform exercises where their utterances are
recorded and pronunciation quality is evaluated. The quality scores are
estimated on phonemic, prosodic, and phonation level by comparison to
the voice quality parameters of a reference speaker and presented back
to the users. They can also directly listen to the voice of a reference
speaker and try to repeat it several times until their own
pronunciation quality improves. The system is designed in a flexible
way such that different user groups (disabled, children, healthy
adults, …) can be supported by specific type of exercises and variable
degree of complexity on exercise and on GUI output side. A focus is on
a client-server based implementation, where the server-side database
stores exercises as well as the results. Therefore, therapists
(language teachers) can monitor the progress of their patients
(pupils), while they train by themselves at home, and, moreover,
anybody can introduce his own set of exercises into the community-based
database and is free to make them available to other people according
defined sharing policy.
Lars Hellan and Dorothee Beermann
.
A multilingual valence database for less resourced
languages
Abstract:
We hypothesize that a multilingual aligned valence database can be
useful in the development of language resources for less resourced
languages (LRL), and present aspects of a methodology that can
accomplish the construction of such a database. We define basic
desiderata for the content of the database, we describe an implemented
example of a database along with a web-demo which illustrates the
intended functionalities, and we mention a collection of tools and
resources which can be orchestrated to help in the construction of a
more encompassing database for LRLs following the same design.
Moses Ekpenyong and Ememobong Udoh
.
Intelligent Prosody Modelling: A Framework for Tone
Language Synthesis
Abstract:
The production of high quality synthetic voices largely depends on the
accuracy of the language’s prosodic model. We present in this paper, an
intelligent framework for modelling prosody in tone language
Text-to-Speech (TTS). The proposed framework is fuzzy logicbased (FL-B)
and is adopted to offer a flexible, human reasoning approach to the
imprecise and complex nature of prosody prediction.
Preliminary results obtained from modelling the tonal aspect of Ibibio
(ISO: 639-3, Ethnologue IBB) prosody proves the feasibility of our FL-B
framework at predicting precisely, the degree of certainty of
fundamental frequency (F0) contour patterns in a set of recorded and
synthesised voices. Other aspects of prosody not yet incorporated into
this design are currently being investigated.
Andrzej Jarynowski and Amir Rostami.
Reading Stockholm Riots 2013 using Internet media
Abstract:
The riots in Stockholm in May 2013 were an event that reverberated in
the world media for its dimension of violence that had spread through
the Swedish capital. In this study we have investigated the role of
social media in creating media phenomena via text mining and natural
language processing. We have focused on two channels of communication
for our analysis: Twitter and Poloniainfo.se (Forum of Polish community
in Sweden). Our preliminary results show some hot topics driving
discussion related mostly to Swedish Police and Swedish Politics by
counting word usage. Typical features for media intervention are
presented. We have built networks of most popular phrases, clustered by
categories (geography, media institution, etc.). Sentiment analysis
shows negative connotation with Police. The aim of this preliminary
exploratory quantitative study was to generate questions and
hypotheses, which we could carefully follow by deeper more qualitative
methods.
Wondwossen Mulugeta, Michael Gasser and Baye Yimam
.
Automatic Morpheme Slot Identification
Abstract:
We introduce an approach to the grouping of morphemes into suffix slots
in morphologically complex languages using a genetic algorithm. The
method is applied to verbs in Amharic, a morphologically rich Semitic
language. We start with a limited set of seg-mented verbs and the set
of suffixes themselves, extracted on the basis of our previous work.
Each member of the population for the genetic algorithm is an
assignment of the morphemes to one of the set of possible slots.
The fitness function combines scores for exact slot position and
correct ordering of morphemes. We use mutation but no crossover
operator with various combinations of population size, mutation rate,
and maximum number of generations, and populations evolve to yield
promising morpheme classification results. We evaluate the fittest
individuals on the basis of the known morpheme classes for Amharic.
Marcin Karwiński.
Living Language Model Optimisation for NL Syntax-based
Search Engine
Abstract:
This paper presents summarisation of results obtained through a 4 year
long study on search engine optimisation through the use of natural
languages simile to living organisms. Though many NLP methods have
already been used in IR tasks, language seems to be treated as
immutable. Juxtaposed with the linguists’ theory of a living language,
the model used in text syntax analyses was augmented with evolutionary
methods. The theoretical concepts of evolutionary changes were
discussed and examined in controlled environments to verify the
plausibility of the aforementioned simile paradigm usage in search
engines for IR quality improvement. TREC and OHSUMED document
collections along with real-life production set were used to assess the
approach. The influences and limitations of language features on
model’s evolutionary changes were observed.
Prashant Verma, Somnath Chandra and Swaran Lata
.
Indian languages requirements in ITS
(Internationalization Tag set) 2.0
Abstract:
ITS 2.0 is a localization framework defined by W3C to add metadata to
Web content, for the benefit of localization, language technologies,
and internationalization. This paper aims to evolve the with respect to
Indian languages in the ITS 2.0 requirements. This paper also de-fines
the variation of Named entities in Indian languages, its requirement in
ITS 2.0 and suggestion for implementation in ITS 2.0. The challenges
with respect to Indian Languages in the data categories of ITS 2.0 and
a possible approach to overcome the specific challenges w.r.t. Indian
languages are also addressed in the paper.
Ritesh Kumar.
Towards automatic identification of linguistic
politeness in Hindi texts
Abstract:
In this paper I present a classifier for automatic identification of
linguistic politeness in Hindi texts. I have used the manually
annotated corpus of over 25,000 blog comments to train an SVM which
classifies any Hindi text into one of the four classes – appropriate,
polite, neutral and impolite. Making use of the discursive and
interactional approaches to politeness the paper gives an exposition of
the normative, conventionalised politeness structures of Hindi. It is
seen that using these manually recognised structures as features in
training the SVM significantly improves the performance of the
classifier on the test set. The trained system gives a significantly
high accuracy of over 77% which is within 2% of the human accuracy of
around 79%
Ritesh Kumar.
Demo of the corpus of computer-mediated communication
in Hindi (CO3H)
Abstract:
Corpus of computer-mediated communication in Hindi is one of the
largest CMC corpora in any Indian language. The data in this corpus is
taken from both asynchronous and synchronous CMC. The asynchronous CMC
data is taken from:
Blogs
Web Portals
Youtube Comments
emails
The synchronous data is taken from
Public Chats
Private Chats
The corpus currently has over 240 million words in it. Most parts of
the corpus is in both Roman as well as Devanagari script since most of
the original data is in Roman whose transliteration is also included in
the corpus. A part of the corpus is manually annotated for politeness
classes. The corpus is currently maintained in the XML format.
The corpus comes with web-based interfaces for three functions -
a) Searching the Corpus
b) Browsing the Corpus
c) Annotating the Corpus [available online on
http://sanskrit.jnu.ac.in/tagit/]
There is also a Java-based API for accessing, using, modifying and also
extending the corpus.
The whole corpus is currently not available for public access.
Currently only the annotation interface is available online which is
accessible only through an username and password. The corpus, along
with all the interfaces for interacting with it, will, however, be
released under Creative Commons Share Alike License by next year so
that it could be used freely for research. It will also be made
available for free download in XML format for offline use.
Daniel Hladek, Ján Staš and Jozef Juhár
.
Correcting Diacritics in the Slovak Texts Using Hidden
Markov Model
Abstract:
This paper presents fast and accurate method for correcting diacritical
markings and guessing original meaning of the word from the context of
the word based on a hidden Markov model and the Viterbi algorithm. The
proposed algorithm might find usage in any area, where erroneous text
might appear., such as a web search engine, e-mail messages, office
suite, optical character recognition or helping to type on small mobile
device keyboards.
Riccardo Del Gratta and Francesca Frontini
.
Linking the Geonames ontology toWordNet
Abstract:
This paper illustrates the transformation of the GeoNames ontology
concepts, with their English labels and glosses into a
GeoDomainWorkdNet-like resource in English, its translation into
Italian, and its linking to the existing genericWordNets
of both languages. The paper describes the criteria used for the
linking of domain synsets to each other and to the generic
ones and presents the published resource in RDF according to the w3c
and lemon schema.
Angelina Gašpar.
Multiterm Database Quality Assessment
Abstract:
Terminology and translation consistency are the main objectives
expected to be met in the area of legal translation known to be
linguistically characterized by recurrent and standard expressions,
precision and clarity. This paper describes linguistic and statistic
approach to MultiTerm database quality assessment. Contrastive analysis
reveals the existence of terminological variants and hence a lack of
terminological consistency which is taken as a measurement of quality
and is calculated by Herfindahl–Hirshman Index. Database quality is
verified through Quality assessment statistics obtained by SDL
MultiTerm Extract and in comparison with the reference language
resources. This research raises concern in regard to content,
application, benefits and drawbacks of the reference language resources
and highlights the necessity for Computer-Assisted Translation (CAT)
tools and sharing of reliable and standardized linguistic resources.
Arseniy Gorin and Denis Jouvet
.
Efficient constrained parametrization of GMM with
class-based mixture weights for Automatic Speech Recognition
Abstract:
Acoustic modeling techniques, based on clustering of the training data,
have become essential in large vocabulary continuous speech recognition
(LVCSR) systems. Clustered data (supervised or unsupervised) is
typically used to estimate the sets of parameters by adapting the
speaker-independent model on each subset. For Hidden Markov Models with
Gaussian mixture observation densities (HMM-GMM) most of the adaptation
techniques are focusing on re-estimation of the mean vectors, whereas
the mixture weights are typically distributed almost uniformly.
In this work we propose a way of specifying the subspaces of the GMM by
associating the sets of Gaussian mixture weights with the speaker
classes and sharing the Gaussian parameters across speaker classes. The
method allows us to better parametrize GMM without increasing
significantly the number of model parameters. Our experiments on French
radio broadcast data demonstrate the improvement of the accuracy with
such parametrization compared to the models with similar, or even
larger number of parameters.
Arianna Bisazza and Roberto Gretter
.
Building an Arabic news transcription system with
web-crawled resources
Abstract:
This paper describes our efforts to build an Arabic ASR system with
web-crawled resources. We first describe the processing done to
handle Arabic text in general and more particularly to cope with the
high number of different phonetic transcriptions associated to a
typical Arabic word. Then, we present our experiments to build
acoustic models using only audio data found in the web, in particular
on the Euronews portal. To transcribe the downloaded audio we compare
two approaches: the first uses a baseline trained on manually
transcribed Arabic corpora, while the second uses a universal ASR
system trained on automatically transcribed speech data of 8 languages
(not including Arabic). We demonstrate that with this approach we are
able to obtain recognition performances comparable to the ones
obtained with a fully supervised Arabic baseline.
Cvetana Krstev, Anđelka Zečević, Dusko Vitas and Tita Kyriacopoulou
.
NERosetta – an Insight into Named Entity Tagging
Abstract:
Named Entity Recognition has been a hot topic in Natural Language
Processing for more than fifteen years. A number of systems for various
languages were developed using different approaches and based on
different named entity schemes and tagging strategies. We present a web
application NERosetta that can be used for comparison of these various
approaches applied to the aligned texts (bitexts). In order to
illustrate its functionalities we have used one literary text, its 5
bitexts involving 5 languages and 5 different NER systems. We present
some preliminary results and give guidelines for further development.
Abdualfatah Gendila and Abduelbaset Goweder
.
The Pseudo Relevance Feedback for Expanding Arabic
Queries
Abstract:
With the explosive growth of the World Wide Web, Information Retrieval
Systems (IRS) have recently become a focus of research. Query expansion
is defined as the process of supplementing additional terms or phrases
to the original query to improve the information retrieval performance.
Arabic is highly inflectional and derivational language which makes the
query expansion process a hard task.
In this paper, the well known approach, Pseudo Relevance Feedback (PRF)
is adopted to be applied on Arabic. Prior to applying PRF, first;
datasets (three collections of Arabic documents) are pre-processed to
create documents inverted index vocabularies, then, the normal indexing
process is carried out. The PRF is applied to create a modified
(expanded) query of the original one and the target collection is
indexed once more. To judge the enhancement of retrieval process, the
results of normal indexing and those of applying PRF are evaluated
against each other using precision and recall measures. The results
have shown that the PRF method has significantly enhanced the
performance of the Arabic Information Retrieval (AIR) System. As the
number of expansion terms increases up to a certain extent (35 terms),
the performance has been improved. On the other hand, the performance
will not be affected, or grow insignificantly as the number of
expansion terms exceeds this limit.
Fadoua Ataa Allah and Jamal Frain
.
Amazigh Converter based on WordprocessingML
Abstract:
Since the creation of the Royal Institute of Amazigh Culture, the
Amazigh language is undergoing a process of standardization and
integration into information and communication technologies. This
process is passing through several stages, after the writing system
stabilization, the encoding stage and the development of appropriate
standards for the keyboard layout; the stage of computational
linguistics is undertaking. Thus, in the aim to save the Amazigh
cultural heritage, many converters, allowing the Tifinaghe
AINSI-Unicode transition and Arabic-Latin-Tifinaghe transliteration,
have been developed. However, these converters could not assure the
file layout and the processing of all documents’ parts. To overcome
these limitations, the new WordprocessingML technology has been used.
Wei Yang, Hao Wang and Yves Lepage
.
Automatic Acquisition of Rewriting Models for the
Generation of Quasi-parallel corpus
Abstract:
Bilingual or multilingual parallel corpora are an extremely important
resource as they are typically used in data-driven machine translation
systems. They allow to improve machine translation systems
continuously. There exist many freely available bilingual or
multilingual parallel corpora for language pairs that involve English,
especially for European languages. However, the constitution of large
collections of aligned sentences is a problem for less documented
language pairs, such as Chinese and Japanese. In this paper, we show
how to construct a free Chinese-Japanese quasi-parallel corpus by using
analogical associations based on short sentential resources collected
from the Web. We generate numerous new candidate sentences by analogy
and filter them by attested N-sequences method to enforce fluency of
expression and adequacy of meaning. Finally, we construct a
Chinese-Japanese quasi-parallel corpus by computing similarities.
Satoshi Fukuda, Hidetsugu Nanba, Toshiyuki Takezawa and Akiko
Aizawa
.
Classification of Research Papers Focusing on
Elemental Technologies and Their Effects
Abstract:
We propose a method for the automatic classification of research papers
in the CiNii article database in terms of the KAKEN classification
index. This index was originally devised for classifying reports for
the KAKEN research fund in Japan, and it is also comprised three
hierarchical levels: Area, Discipline, and Research Field.
Traditionally, research papers have been classified using machine
learning algorithms, using the content words in each research paper as
features. In addition to these content words, we focus on elemental
technologies and their effects, as discussed in each research paper.
Examining the use of elemental technology terms used in each research
paper and their effects is important for characterizing the research
field to which a given research paper belongs. To investigate the
effectiveness of our method, we conducted an experiment using KAKEN
data. From the results, we obtained average recall scores of 0.6220,
0.7205, and 0.8530 at the Research Field, Discipline, and Area levels,
respectively.
Roman Grundkiewicz.
ErrAno: a Tool for Semi-Automatic Annotation of
Language Errors
Abstract:
Error corpora containing annotated naturally-occurring errors are
desirable resources in the development of the automatic spelling and
grammar correction techniques. Many studies on creating error corpora
often concern on error tagging systems, but rarely on the computer
tools aiming at improving manual error annotation being very tedious
and costly task. In this paper we describe the development of the
ErrAno tool for semi-automatic annotation of language errors in texts
composed in Polish. The annotator's work is supported on several levels
by providing the use of text edition history, assisting error detection
process, and by proposing attributes that describe found errors. We
present in details the rule-based system for automatic attribute
assignment, which contribute to considerable speed-up of the annotation
process. We also propose new error taxonomy suitable from the view of
automatic text correction and clear multi-level tagging model.
Abraham Hailu and Yaregal Assabie
.
Itemsets-Based Amharic Document Categorization Using
an Extended A Priori Algorithm
Abstract:
Document categorization is gaining importance due to the large volume
of electronic information which requires automatic organization and
pattern identification. Due to the morphological complexity of the
language, automatic categorization of Amharic documents has become a
difficult talk to carry out. This paper presents a system that
categorizes Amharic documents based on the frequency of itemsets
obtained after analyzing the morphology of the language. We selected
seven categories into which a given document is to be classified. The
task of categorization is achieved by employing an extended version of
a priori algorithm which had been traditionally used for the purpose of
knowledge mining in the form of association rules. The system is tested
with a corpus containing Amharic news documents and experimental
results are reported.
Yoshimi Suzuki and Fumiyo Fukumoto
.
Newspaper Article Classification using Topic Sentence
Detection and Integration of Classifiers
Abstract:
This paper presents a method for a newspaper article classification
using topic sentence detection. In particular, we focus on topic
sentences, and detect them by using Support Vector Machines(SVMs) based
on the features which are related to
a topic of the articles. Then we classified articles using term
weighting of the words which appeared in the detected sentences. Using
this method, classification accuracy becomes higher than the text
classification only by using frequency of words.
We also integrate classification results of the five classifiers. For
evaluation of the proposed method, we performed text classification
experiments using Japanese newspaper articles. The classification
results were compared with the results using conventional classifiers.
The experiments showed that the proposed method are effective for
newspaper article classification.
Lidija Tepeš Golubić, Damir Boras and Nives Mikelić Preradović
.
Semi-automatic detection of germanisms in Croatian
newspaper texts
Abstract:
In this paper we aimed to discover to what extent Germanisms
participate in the language of the Croatian daily newspapers (more
precisely, newspaper headlines). In order to determine Germanisms in
Croatian language, we used digitized copies of a daily newspaper,
Večernji list. According to our results, 114 Germanisms were found in
217 headlines. When compared to 3088 titles where the total number of
words in a one-month corpus was 21.554, we found out that Germanisms
appear in only 1.006% of cases.
Rajeev R R, Jisha P Jayan and Elizhabeth Sherly
.
Interlingua Data Structure for Malayalam
Abstract:
The complexity of Malayalam language poses special challenges to
computational natural language processing. The rich morphology like
highly agglutinative word formation and inflections, strong grammatical
behaviour of the language, and the formation of words with multiple
suffixes, all contribute to research in Malayalam language technology
making it more challenging. In machine translation, transfer grammar
component is very essential which acts as a bridge between the source
language and the target language. A grammar of a language is a set of
rules, which says how these parts of speech can be put together to make
grammatical, or 'well-formed' sentences. This paper describes a hybrid
architecture towards building a transfer grammar data structure for
Malayalam, by identifying the subject in a sentence with its syntactic
and semantic representation while tagging.
Łukasz Degórski.
Fine-tuning Chinese Whispers algorithm for a Slavonic
language POS tagging task and its evaluation
Abstract:
Chris Biemann's robust Chinese Whispers graph clustering algorithm
working in the Structure Discovery paradigm has been proven to perform
good enough to be used in many applications for Germanic languages. The
article presents its application to a Slavonic language (Polish),
focusing on fine-tuning the parameters and finding an evaluation method
for POS tagging application aiming at getting a very small
(coarse-grained) tagset.
Mohamed Elmahdy, Mark Hasegawa-Johnson and Eiman Mustafawi
.
A Transfer Learning Approach for Under-Resourced
Arabic Dialects Speech Recognition
Abstract:
A major problem with dialectal Arabic speech recognition is due to the
sparsity of speech resources. In this paper, we propose a transfer
learning framework to jointly use large amount of Modern Standard
Arabic (MSA) data and little amount of dialectal Arabic data to improve
acoustic and language modeling. We have chosen the Qatari Arabic (QA)
dialect as a typical example for an under-resourced Arabic dialect. A
wide-band speech corpus has been collected and transcribed from several
Qatari TV series and talk-show programs. A large vocabulary speech
recognition baseline system was built using the QA corpus. The proposed
MSA-based transfer learning technique was performed by applying
orthographic normalization, phone mapping, data pooling, and acoustic
model adaptation. The proposed approach can achieves more than 28%
relative reduction in WER.
Mohamed Elmahdy, Mark Hasegawa-Johnson and Eiman Mustafawi
.
A Framework for Conversational Arabic Speech Long
Audio Alignment
Abstract:
We propose a framework for long audio alignment for conversational
Arabic speech. Accurate alignments help in many speech pro-cessing
tasks such as audio indexing, speech recognizer acoustic model (AM)
training, audio summarizing and retrieving, etc. In this work, we have
collected more than 1400 hours of conversational Arabic besides the
corresponding non-aligned text transcriptions. For each episode,
automatic segmentation is applied using a split and merge approach. A
biased language model (LM) is trained using the corresponding text
after a pre-processing stage. A graphemic pronunciation model (PM) is
utilized because of the dominance of non-standard Arabic in
conversational speech. Unsupervised acoustic model adaptation in
applied on a generic standard Arabic AM. The adapted AM along with the
biased LM and the graphemic PM are used in a fast speech recognition
pass applied on the current podcast's segments. The recognizer output
is aligned with the processed transcriptions using Levenshtein distance
algorithm. The proposed approach resulted in an alignment accuracy of
97% on the evaluation set. A confidence scoring metric is proposed to
accept/reject aligner output. Using confidence scores, it was possible
to reject the majority of mis-aligned segments resulting in 99%
alignment accuracy.
Milan Rusko,Jozef Juhar, Marian Trnka, Jan Stas, Sakhia Darjaa, Daniel Hladek, Robert Sabo, Matus Pleva, Marian Ritomsky and Stanislav Ondas
.
Recent Advances in the Slovak Dictation System for
Judicial Domain
Abstract:
The acceptance of speech recognition technology depends on user
friendly applications evaluated by professionals in the target field.
This paper describes the evaluation and recent advances in application
of speech recognition for the judicial domain. The evaluated dictation
system enables Slovak speech recognition using plugin for widely used
office word processor and it was introduced recently after the first
evaluation in the Slovak courts. The system was improved significantly
using more acoustic databases for testing and acoustic modeling. The
textual language resources were extended meanwhile and the language
modeling techniques improved as described in the paper. An end-user
questionnaire to the user interface was also evaluated and new
functionalities were introduced in the final version. According to the
available feedback, it could be concluded that the final dictation
system could speed up the court proceedings significantly for
experienced users willing to cooperate (acoustic model adaptation,
proper names insertion for document, etc.) with this new technology.
Imen Elleuch, Bilel Gargouri and Abdelmajid Ben Hamadou
.
Syntactic enrichment of Arabic dictionaries normalized
LMF using corpora
Abstract:
In this paper, we deal with the representation of syntactic knowledge,
particularly syntactic behavior of Arabic verbs. In this context, we
propose an approach to identify the syntactic behavior from corpora in
order to enrich the syntactic extension of LMF normalized Arabic
dictionary. Our approach is composed of the following steps: (i)
Identification of syntactic patterns, (ii) Construction of a grammar
suitable for each syntactic pattern, (iii) Application of the grammar
on corpora and (iv) Enrichment of the LMF Arabic dictionary. To
validate this approach, we carried out an experiment that focused on
the syntactic behavior of Arabic verbs. We used the NOOJ linguistic
platform, an Arabic corpus containing about 20500 vowelized words and
an LMF Arabic dictionary, available in our Laboratory, that contains
37000 entries and 10700 verbs. The obtained results concerning more
than 5000 treated verbs, shows 76% of precision and 83% of recall.
Patrizia Grifoni, Maria Chiara Caschera, Arianna D'Ulizia and
Fernando Ferri
.
Dynamic Building of Multimodal Corpora Model
Abstract:
To design a multimodal interaction environment, to combine information
from different modalities and to provide a correct user’s input
interpretations are crucial issues in order to define an effective
communication. Both during the combination of information and
interpretation phases, the use of corpora of multimodal sentences is
very important because they allow integrating properties and linguistic
knowledge which are not formalised in the grammar. This paper provides
the dynamic generation of a multimodal corpus by example that allow
improving human-computer dialogue because it consists in examples which
are able to generate grammatical rules and to train the model for
correctly interpreting user’s input.
Munshi Asadullah, Patrick Paroubek and Anne Vilnat
.
Converting from the French Treebank Dependencies into
PASSAGE syntactic annotations
Abstract:
We present here a converter for transforming the French Tree-bank
Dependency (FTB-DEP) annotations into the PASSAGE format. FTB-DEP is
the representation used by several freely available parsers and the
PASSAGE annotation was used to hand-annotate a relatively large sized
corpus, used as gold-standard in the PASSAGE evaluation campaigns. Our
converter will give the means to evaluate these parsers on the PASSAGE
corpus. We shall illustrate the mapping of important syntactic
phenomena using the corpus made of the examples of the FTB-DEP
annotation guidelines, which we have hand-annotated with PASSAGE
annotations and used to compute quantitative performance measures on
the FTB-DEP guidelines.
Quentin Pradet, Gaël de Chalendar and Guilhem Pujol
.
Revisiting knowledge-based semantic role labeling
Abstract:
Semantic role labeling has seen tremendous progress in the last years,
both for supervised and unsupervised approaches. The knowledge-based
approaches have been neglected while they have shown to bring the best
results to the related word
sense disambiguation task. We contribute a simple knowledge-based
approach with an easy to reproduce specification. We also present a
novel approach to handle the passive voice in the context of semantic
role labeling that reduces the error rate in F1 by 15.7%, showing that
significant improvements can be brought to the knowledge-based
approaches while retaining key advantages of the approach: a simple
approach which facilitates analysis of individual errors, does not need
any hand-annotated corpora and which is not domain-specific.
Abeba Ibrahim and Yaregal Assabie
.
Hierarchical Amharic Base Phrase Chunking Using HMM
With Error Pruning
Abstract:
Segmentation of a text into non-overlapping syntactic units (chunks)
has become an essential component of many applications of natural
language processing. This paper presents Amharic base phrase chunker
that groups syntactically correlated words at different levels using
HMM. Rules are used to correct chunk phrases incorrectly chunked by the
HMM. For the identification of the boundary of the phrases IOB2 chunk
specification is selected and used in this work. To test the
performance of the system, corpus was collected from Amharic news
outlets and books. The training and testing datasets were prepared
using the 10-fold cross validation technique. Test results on the
corpus showed an average accuracy of 85.31% before applying the rule
for error correction and an average accuracy of 93.75% after applying
rules.
Tugba Yildiz, Banu Diri and Savas Yildirim
.
Analysis of Lexico-syntactic Patterns for Meronym
Extraction from a Turkish Corpus
Abstract:
In this paper, we applied lexico-syntactic patterns to disclose
meronymy relation from a huge corpus in Turkish. Once, the system takes
a huge raw corpus and extract matched cases for a given pattern, it
proposes a list of whole-part pairs depending on their co-occur
frequency. For the purpose, we exploited and compared a list of pattern
clusters. The clusters could fall into three types; general patterns,
dictionary-based pattern, and bootstrapped pattern. We examined how
these patterns improve the system performance especially within
corpus-based approach and distributional feature of words. Finally, we
discuss all the experiments with a comparison analysis and we showed
advantage and disadvantage of the approaches with promising results.
Sobha Lalitha Devi.
Word Boundary Identifier as a Catalyzer and
Performance Booster for Tamil Morphological Analyzer
Abstract:
In this paper we present an effective and novel approach for the
morphological analysis of Tamil using word-boundary identification
process. Our main aim is to primarily segment the agglutinated and
compound words into a set of simpler words by identifying the
constituent words' word-boundaries. Conditional Random Fields, a
machine learning technique, is used for word-boundary detection while
the morphological analysis is performed by rule-based morphological
analyzer developed using Finite State Automaton and Paradigm-based
approaches. The whole process is essentially a three step architecture.
Initially we use CRFs for word-boundary identification and
segmentation. Secondly we perform Sandhi correction using word-level
contextual linguistic rules. Finally, the resultant words are analyzed
by the morphological analyzer engine. The main advantage of the
proposed architecture is that it completely avoids the failure of word
analysis due to the reasons of compounding, agglutination, lack of
complex orthographic rules and dialectal variation of words. On testing
the efficiency of this approach with a randomly collected online-data
of 64K words, the analyzer achieved an appreciable accuracy of 98.5%
while maintaining an analyzing time of atmost 3ms for any word.
Pinkey Nainwani, Esha Banerjee and Shiv Kaushik
.
Implementing a Rule-Based Approach for Agreement
Checking in Hindi
Abstract:
The aim of this paper is to describe the issues and challenges while
handling the agreement features of Hindi from the perspective of
natural language processing in order to build a robust grammar checker
which can be scaled for other Indian languages. Grammar is an integral
part of any language where each element of the language depends on the
other to generate the correct output. Therefore, the term “agreement”
has been evolved. In Hindi, intra-sentential agreement happens at
various levels: a) demonstrative and noun, b) adjective and noun, c)
noun and verb, d) noun and other particles, and e) verb and other
particles. The approach here is to develop a grammar checker with rule
- based methods which are proven to provide qualitative results for
Indian languages as they are inflecting in nature and have relatively
free word order. At present, the work is limited to syntactic
structures and does not deal with semantic aspects of the language.
Friedrich Neubarth, Barry Haddow, Adolfo Hernández Huerta and Harald Trost
.
A hybrid approach to statistical machine translation
between standard and dialectal varieties
Abstract:
Using statistical machine translation (SMT) for dialectal varieties
usually suffers from data sparsity, but combining word-level and
character-level models can yield good results even with small training
data by exploiting the relative proximity between the two varieties. In
this paper, we describe a specific problem and its solution, arising
with the translation between standard Austrian German and Viennese
dialect. In a phrase-based approach of SMT, complex lexical
transformations and syntactic reordering cannot be dealt with. These
are typical cases where rule-based preprocessing of the source data is
the preferable option, hence the hybrid character of the resulting
system. One such case is the transformation between imperfect verb
forms to perfect tense, which involves detection of clause boundaries
and identification of clause type. We present an approach that utilizes
a full parse of the source sentences and discuss the problems that
arise with such an approach. Within the developed SMT system, the
models trained on preprocessed data unsurprisingly fare better than
those trained on the original data, but also unchanged sentences gain
slightly better scores. This shows that including a rule-based layer
dealing with systematic non-local transformations increases the overall
performance of the system, most probably due to a higher accuracy in
the alignment.
Aleksander Pohl and Bartosz Ziółko
.
A Comparison of Polish Taggers
in the Application for Automatic Speech Recognition
Abstract:
In this paper we investigate the performance of Polish taggers in the
context of automatic speech recognition (ASR). We use a morphosyntactic
language model to improve speech recognition in an ASR system and seek
the best Polish tagger for our needs. Polish is an inflectional
language and an n-gram model using morphosyntactic features, which
reduces data sparsity seems to be a good choice. We investigate the
difference between the morphosyntactic taggers in that context. We
compare the results of tagging with respect to the reduction of word
error rate as well as speed of tagging. As it turns out at present the
taggers using conditional random fields (CRF) models perform the best
in the context of ASR. A broader audience might be also interested in
the other discussed features of the taggers such as easiness of
installation and usage, which are usually not covered in the papers
describing such systems.
Pawel Dybala, Rafal Rzepka, Kenji Araki and Kohichi Sayama
.
Detecting false metaphors in Japanese
Abstract:
In this paper we propose how to automatically distinct between two
types of formally identical expressions in Japanese: metaphorical
similes and metonymical comparisons. Expression like "Kujira no you na
chiisai me" can be translated into English as "Eye small as whale's",
while in Japanese, due to the lack of possessive case, it literally
sounds as "Eye small as whale" (no apostrophe). This makes it
impossible to formally distinguish between expressions like this and
actual metaphorical similes, as both use the same template. In this
work we present a system able to distinguish between these two types of
expressions. The system takes Japanese expressions of simile-like forms
as input and uses the Internet to check possessive relations between
elements constituting the expression. We propose a method of
calculating a score based on co-occurrence of source and target pairs
in Google (e.g. "whale's eye"). An experimentally set threshold allowed
the system to distinguish between metaphors and non-metaphors with the
accuracy of 75%. We discuss the results and give some ideas for the
future.
Akanksha Bansal, Esha Banerjee and Girish Nath Jha
.
Corpora creation for Indian Language Technologies –
the ILCI project
Abstract: This
paper presents an overview of corpus classification and development in
electronic format for 16 language-pairs, with Hindi as the source
language. In a multi-lingual country like India, the major thrust in
language technology lies in providing inter-communication services and
direct information access in one’s own language. As a result, language
technology in India has seen major developments over the last decade in
terms of machine translation and speech synthesis systems. As deeper
research advances, the need for high quality standardised corpus is
being seen as a primary challenge. To address these needs, the
government of India has initiated a mega project called the Indian
Languages Corpora Initiative (ILCI) to collect parallel annotated
corpus in 17 scheduled languages of the Indian constitution. The
project is in its second phase currently, within which it aims to
collect 8,50,000 parallel annotated sentences in 17 Indian languages in
the domains of Entertainment and Agriculture. Together with the
6,00,000 parallel sentences collected in Phase 1 in the domains of
Health and Tourism (Choudhary & Jha, 2011), The corpus being
developed is one of the largest known parallel annotated corpora for
any Indian language till date. This phase will ultimately also see the
development of chunking standards for processing the annotated corpus.
Kumar Nripendra Pathak and Girish Nath Jha
.
A Generic Search for Heritage Texts
Abstract:
At present, quick access to any information is desired in every field
with the help of IT research and development. Keeping this in mind,
this paper deals with the ongoing research on linguistic resources and
a generic search in the Special Centre for Sanskrit Studies, Jawaharlal
Nehru University, New Delhi, India. The system is based on lexical
resources databases with idiosyncratic structures accessed by
configuration files. The search results can be converted into a number
of Indian and Roman scripts using a converter. The paper will present
the system (http://sanskrit.jnu.ac.in/) and how it can be used to
quickly create searchable resources for heritage or other texts.
Vijay Sundar Ram and Sobha Lalitha Devi
.
Pronominal Resolution in Tamil using Tree CRFs
Abstract:
We describe our work on pronominal resolution in Tamil using Tree CRFs.
Pronominal resolution is the task of identifying the referent of a
pronominal. In this work we have studied third person pronouns in Tamil
such as ‘avan’, ‘aval’, ‘athu’, ‘avar’; he, she , it and they
respectively. Tamil is a Dravidian language and it is morphologically
rich and highly agglutinative language. The features for learning are
developed by using the morphological features of the language and from
the dependency parsed output. By doing this, we can learn the features
used in the salience factor approach and the constrains mentioned in
the structural analysis of anaphora resolution. The work is carried out
on tourism domain data from the web. We have obtained 70.8% precision
and 66.5% recall. The results are encouraging.
Paweł Skórzewski.
Gobio and PSI-Toolkit: Adapting a deep parser to an
NLP toolkit
Abstract:
The paper shows an example how an existing stand-alone linguistic tool
may be adapted to a NLP toolkit that operates on different data
structures and a different POS tag-set. PSI-Toolkit is an open-source
set of natural language processing tools. One of its main features is
the possibility of incorporating various independent processors. Gobio
is a deep natural language parser that is used in the Translatica
machine translation system. The paper describes the process of adapting
Gobio to PSI-Toolkit, namely the conversion of Gobio’s data structures
into the PSI-Toolkit lattice and opening Gobio’s rule files for edition
by a PSI-Toolkit user. The paper also covers the technical issues of
substituting Gobio’s tokenizer and lemmatizer by processors used in
PSI-Toolkit, e.g. Morfologik.
Jerid Francom and Mans Hulden
.
Diacritic error detection and restoration via
part-of-speech tags
Abstract:
In this paper we address the problem of diacritic error detection and
restoration—the task of identifying and correcting missing accents in
text. In particular, we evaluate the performance of a simple
part-of-speech tagger-based technique comparing it to other
well-established methods for error detection/restoration: unigram
frequency, decision lists and grapheme-based approaches. In languages
such as Spanish, the current focus, diacritics play a key role in
disambiguation and results show that a straightforward modification to
an n-gram tagger can be used to achieve good performance in diacritic
error identification without resorting to any specialized machinery.
Our method should be applicable to any language where diacritics
distribute comparably and perform similar roles of disambiguation.
Kadri Muischnek, Kaili Müürisep and Tiina Puolakainen
.
Estonian Particle Verbs And Their Syntactic Analysis
Abstract:
This article investigates the role of particle verbs in Estonian
computational syntax. Estonian is a heavily inflecting language with
free word order. The authors are looking at two problems related to the
particle verbs in syntactic processing of Estonian, namely, recognizing
these multi-word expressions in texts and, second, taking into account
the valency patterns of the particle verbs compared to the valency
patterns of respective simplex verbs. A lexicon-based and rule-based
strategy in Constraint Grammar framework is used for recognizing
particle verbs and exploiting the knowledge of their valency patterns
for the further dependency analysis.
Piotr Malak.
Information searching over Cultural Heritage objects,
and press news
Abstract:
This paper presents results and conclusion from a subtask of CHiC
(Cultural Heritage in CLEF) 2013 campaign, and from realization of
Sciex-NMS grant. Within those projects press news, as well as cultural
heritage (CH) objects description have been proceeded in order to meet
IR, and particularly information searching. Problematic issues which
have occurred during automatic text processing of press new. This paper
presents results analysis and comments on used IR techniques and their
adequacy for information retrieval for Polish.
Filip Graliński.
Polish digital libraries as a text corpus
Abstract:
A large (71 GB), diachronic Polish text corpus extracted from digital
libraries is described in this paper. The corpus is of interest to
linguists, historians, sociologists as well as to NLP practitioners.
The sources of noise in the corpus are described and assessed.
Elena Yagunova, Anna Savina, Anna Chizhik and Ekaterina Pronoza
.
Linguistic and technical constituents of the
interdisciplinary domain "Intellectual technologies and computational
linguistics"
Abstract: We
are aimed to determine the specific features of the interdisciplinary
domain “Intellectual technologies and computational linguistics”
(IT&CL). Our objective is to identify keyword features as the most
informative structural elements, describing the scientific domain of
the corpus. The data for the investigation this domain is four corpora
based on proceedings of the 4 most representative Russian international
conferences. The goal of this research is to determine subdomains
terminology, linguistic and technical constituents and so on. We assume
that each corpus represents its own subdomain of the IT&CL topic.
So, the Dialog & CL represent more linguistic topics, while the CAI
& RCDL — more technical topics. There are 2 keyword sets (KS): KW1
is based on the TF-iDF, KW2 – on comparison of the local and global
frequency (a term frequency in a particular corpus and a term frequency
in all the corpora), and weighting using the weirdness measure. Our
methods have the main accent on evaluation with assessors: the main
step evaluation and verifying of the hypothetic division our
conferences onto “more linguistics” and “more technical” and the
hypothetic domain. We believe that the other step of evaluation is
clustering. And it is quietly important to find features (KW, ratio and
so on).
Marcin Woliński and Dominika Rogozińska
.
First Experiments in PCFG-like Disambiguation of
Constituency Parse Forests for Polish
Abstract:
The work presented here is the first attempt at creating a
probabilistic constituency parser for Polish. The described algorithm
disambiguates
parse forests obtained from the Świgra parser in a manner close to
Probabilistic Context Free Grammars. The experiment was carried out
and evaluated on the Składnica treebank. The idea behind the experiment
was to check what can be achieved with this well known method.
Results are promising, presented approach achieves up to 94.1% PARSEVAL
F-measure and 92.1% ULAS. The PCFG-like algorithm can be
evaluated against existing Polish dependency parser which achieves
92.2% ULAS.
Krešimir Šojat, Matea Srebačić, Tin Pavelić and Marko Tadić
.
From Morphology to Lexical Hierarchies
Abstract:
This paper deals with language resources for Croatian and discusses the
possibilities of their combining in order to improve their coverage and
density of structure. Two resources in focus are Croatian WordNet
(CroWN) and CroDeriV – a large database of Croatian verbs with
morphological and derivational data. The data from CroDeriV is used for
enlargement of CroWN and the enrichment of its lexical hierarchies. It
is argued that the derivational relatedness of Croatian verbs plays a
crucial role in establishing morphosemantic relations and an important
role in detecting semantic relations.
Michał Marcińczuk, Marcin Ptak, Adam Radziszewski and Maciej Piasecki
.
Open dataset for development of Polish Question
Answering systems
Abstract:
In the paper we discuss a recent research on the open domain question
answering system for Polish. We present an open dataset of quesetions
and relevant documents called CzyWiesz. The dataset consists of 4721
questions and assigned Wikipedia articles from the Czy wiesz (Do you
know) Wikipedia project. This dataset was used to evaluate four methods
for reranking a list of answer candidates: Tf.Idf on the base forms,
Tf.Idf on the base forms and information from dependency analysis,
Minimal Span Weighting and modified Minimal Span Weighting. The
reranking methods improved the accuracy and MRR scores for both
dataset: the development and the CzyWiesz datasets. The best
improvement was obtained for the Tf.Idf on base forms and dependency
information by near 5 percentage points for the development dataset and
near 7 percentage points for the CzyWiesz dataset.
Aimilios Chalamandaris, Pirros Tsiakoulis, Sotiris Karabetsos and Spyros Raptis
.
An Automated Method for Creating New Synthetic Voices
from Audiobooks
Abstract:
Creating new voices for a TTS system often requires a costly procedure
of designing and recording an audio corpus, a time and effort consuming
task. Using publicly available audiobooks as the raw material of an
audio corpus for such systems creates new perspectives regarding the
possibility of creating new synthetic voices. This paper addresses the
issue of creating new synthetic voices based on audiobooks, in a fully
automated method. As an audiobook includes several types of speech,
such as narration, character playing etc, special care is given in
identifying the audiobook subset that leads to a neutral and general
purpose synthetic voice. Part of the work described in this paper was
performed during the participation of our TTS system in the Blizzard
Challenge 2013. Subjective experiments (MOS and SUS) were carried out
in the framework of the Blizzard Challenge itself and selected results
of the listening tests are also provided. The results indicate that the
synthetic speech is of high quality for both domain specific and
generic speech. Further plans for exploiting the diversity of the
speech incorporated in an audiobook are also described in the final
section where the conclusions are discussed.
Maciej Piasecki Łukasz Burdka and Marek Maziarz
.
Wordnet Diagnostics in Development
Abstract:
The larger a wordnet is, the more difficult keeping it error free
becomes. Thus the need for sophisticated diagnostic tests emerges. In
this paper we present a set of diagnostic tests and diagnostic tools
dedicated to Polish WordNet, plWordNet. The wordnet has been in steady
development for seven years and has recently reached 120k synsets. We
propose a typology of the diagnostic levels: describe formal,
structural and semantic rules for seeking errors within plWordNet, as
well as, new method of automated induction of the diagnostic rules.
Finally, we discuss results and benefits of the approach.
Jacek Marciniak
.
Building wordnet based ontologies with expert knowledge
Abstract:
The article presents the principles of creating wordnet based ontologies which contain general knowledge about the world as well as specialist expert knowledge. Ontologies of this type are a new method of organizing lexical resources. They possess a wordnet structure expanded by domain relations and synsets ascribed to general domains and local-context taxonomies. Ontologies of this type are handy tools for indexers and searchers working on massive content resources such as internet services, repositories of digital images or e-learning repositories.
Ladan Baghai-Ravary.
Identifying and Discriminating between Coarticulatory
Effects
Abstract:
This paper uses a data-driven approach to characterise the degree of
similarity between realised instances of the same nominal phoneme in
different contexts. The results indicate the circumstances that produce
consistent deviations from a more generic pronunciation. We utilise
information local to the phonemes of interest, independent of the
phonetic or acoustic context, cluster the instances, and then identify
correlations between the clusters and the phonetic contexts and their
acoustic features. The results identify a number of phonetic contexts
that have a detectable effect on the pronunciation of their neighbours.
The methods employed make minimal use of prior assumptions or phonetic
theory. We demonstrate that coarticulation often makes the acoustic
properties of vowels more consistent across utterances.
Milos Jakubicek, Vojtech Kovar and Marek Medved
.
Towards taggers and parsers for Slovak
Abstract:
In this paper we present tools prepared for morphological and syntactic
processing of Slovak: a~model trained for tagging by the RFTagger and
two syntactic analyzers Synt and SET for which we adapted their Czech
grammars for Slovak. We describe the training process of RFTagger using
the r-mak corpus and modifications of both parsers that have been
performed partially in the lexical analysis and mainly in the formal
grammars used in both systems. Finally we provide an evaluation of both
tagging and parsing, the latter on two datasets -- a~phrasal and
dependency treebank of Slovak.
Sardar Jaf.
The Hybridisation of a Data-driven parser for Natural
Languages
Abstract:
Identifying and establishing structural relations between words in
natural language sentences is called Parsing. Ambiguities in natural
languages make parsing a difficult task. Parsing is more difficult when
dealing with a structurally complex natural language such as Arabic,
which contains a number of properties that make it particularly
difficult to handle. In this paper, we briefly highlight some of the
complex structure of Arabic, and we identify different parsing
approaches (grammar-driven and data-driven approaches) and briefly
discuss their limitations. Our main goal is to combine different
parsing approaches and produce a hybrid parser, which retains the
advantages of data-driven approaches but is guided by grammatical rules
to produce more accurate results. We describe a novel technique for
directly combining different parsing approaches. Results for initial
experiments that we have conducted in this work, and our plans for
future work is also presented.
Ines Boujelben, Salma Jamoussi and Ben Hamadou Abdelmajid
.
Genetic algorithm for extracting relations between
named entities
Abstract:
In this paper, we tackle the problem of extracting relations that hold
between Arabic named entities. The main objective of our work is to
automatically extract interesting rules basing on genetic algorithms.
We tend firstly to annotate our training corpus using a set of
linguistic tools. Then, a set of rules are generated using association
rule mining methods. And finally, a genetic process is applied to
discover the more inter-esting rules from a given data set. The
experimental results proved the effectiveness of our process in
discovering the most interesting rules.
Kais Dukes.
Semantic Annotation of Robotic Spatial Commands
Abstract:
The Robot Commands Treebank (www.TrainRobots.com) provides semantic
annotation for 3,394 spatial commands (41,158 words), for manipulating
colored blocks on a simulated discrete 3-dimensional board. The
treebank maps sentences collected through an online game-with-a-purpose
into a formal Robot Control Language (RCL). We describe the semantic
representation used in the treebank, which utilizes semantic categories
including qualitative spatial relations and named entities relevant to
the domain. Our annotation methodology for constructing RCL statements
from natural language constructions models compositional syntax,
multiword spatial expressions, anaphoric references, ellipsis,
expletives and negation. As a validation step, annotated RCL statements
are executed by a spatial planning component to ensure they are
semantically correct within the spatial context of the board.
Adam Radziszewski.
Evaluation of lemmatisation accuracy of four Polish
taggers
Abstract:
Last three years have seen an impressive rise of interest in
morphosyntactic tagging of the Polish language. Four new taggers have
been developed, evaluated and made available. Although all of them are
able to perform lemmatisation along tagging (and they are often used
for that purpose), it is hard to find any report of their lemmatisation
accuracy.
This paper discusses practical issues related to assessment of
lemmatisation accuracy and reports results of detailed evaluation of
lemmatisation capabilities of four Polish taggers.
Jakub Dutkiewicz and Czeslaw Jedrzejek
.
Ontology-based event extraction for the Polish
language
Abstract:
The paper presents information extraction methodology which uses
shallow parsing structures and ontology. Methods used in this paper are
designed for Polish. The methodology mechanisms create mappings from
phrases produced with a shallow parsing into thematic roles. Thematic
roles are parts of ontological semantic model of certain events.
Besides methodology itself, the paper includes overview of a shallow
grammar used to produce phrases, example results and comparison of
methods used for English with methods used in this paper.
Łukasz Kobyliński.
Improving the Accuracy of Polish POS Tagging by Using
Voting Ensembles
Abstract:
Recently, several new part-of-speech (POS) taggers for Polish have been
presented. This is highly desired, as the quality of morphosyntactic
annotation of textual resources (especially reference text corpora) has
a direct impact on the accuracy of many other language-related tasks in
linguistic engineering. For example, in the case of corpus annotation,
most automated methods of producing higher levels of linguistic
annotation expect an already POS-analyzed text on their input. In spite
of the improvement of Polish tagging quality, the accuracy of even the
best-performing taggers is still well below 100\% and the mistakes made
in POS tagging propagate to higher layers of annotation. One possible
approach to further improving the tagging accuracy is to take advantage
of the fact that there are now quite a few taggers available and they
are based on different principles of operation. In this paper we
investigate this approach experimentally and show improved results of
POS tagging accuracy, achieved by combining the output of several
state-of-the-art methods.
Esha Banerjee, Shiv Kaushik, Pinkey Nainwani, Akanksha Bansal
and Girish Nath Jha
.
Linking and Referencing Multi-lingual corpora in
Indian languages
Abstract:
This paper is an attempt to present the ILCI-MULRET (Indian Languages
Corpora Initiative – Multilingual Linked Resources Tool) which links
annotated data and linguistic resources in Indian languages from the
Web for use in NLP. The paper discusses the ILCI project, under which
parallel corpora is being created in 17 languages of India, including
English. This corpus contains parallel data in 4 domains – Health,
Tourism, Agriculture and Entertainment and Monolingual data in more
than 10 other domains. The MULRET tool acts as a visualizer for
word-level and phrase-level information retrieval from the ILCI corpus
in multiple languages and links the search query to other available
language resources on the Internet.
Adrien Barbaresi.
Challenges in web corpus construction for low-resource
languages in a post-BootCat world
Abstract:
The state of the art tools of the ’Web as Corpus’ framework rely
heavily on URLs obtained from search engines. Recently, this querying
process became very slow or impossible to perform on a low budget.
Trying to find reliable data sources for Indonesian, we perform a case
study of different kinds of URL sources and crawling strategies. First,
we classify URLs extracted from the Open Directory Project and
Wikipedia for Indonesian, Malay, Danish and Swedish in order to enable
comparisons. Then we perform web crawls focusing on Indonesian and
using the mentioned sources as start URLs. Our scouting approach using
open-source software leads to a URL database with metadata.
Maarten Janssen.
POS Tags and Less Resources Languages - The CorpusWiki
Project
Abstract:
CorpusWiki is an online system for building POS resources for any
language. For more less-described languages, part of the problem in
creating a POS annotated corpus is that it is not always clear
beforehand what the tagset should be. By using explicit feature/value
pairs, CorpusWiki attempts to allow the provide the kind of flexibility
that is needed for defining the tagset in the process of annotating the
corpus.
Marianne Vergez-Couret.
Tagging Occitan using French and Spanish Tree Tagger
Abstract:
Part-Of-Speech (POS) tagging is the first step in all Natural Language
Processing chain. It usually requires substantial efforts to annotate
corpora and produce lexicons. However, when these language resources
are missing like in Occitan, rather than concentrate the effort in
creating them, methods are settled to adapt existing rich-resourced
languages tagger. For this to work, these methods exploit the
etymologic proximity of the under-resourced language and a
rich-resourced language. In this article, we focus on Occitan, which
shares similarities with several romance languages including French and
Spanish. The method consists in running existing morpho-syntactic
tools, here Tree Tagger, on Occitan texts with first a
pre-transposition of the frequent words in a rich-resourced language.
We performed two distinct experimentations, one exploiting similarities
between Occitan and French and the second exploiting similarities
between Occitan and Spanish. This method only requires the listing of
the 300 most frequent words (based on corpus) to construct two
bilingual lexicons (Occitan/French and Occitan/Spanish). Our results
are better than those obtained with the Apertium tagger using a larger
lexicon.
Carlo Zoli and Silvia Randaccio
.
Smallcodes and LinMiTech: two faces of the same new
business model for the development of LRTs for LRLs
Abstract:
LinMiTech and Smallcodes represent a new business model for language
resources and tools development with particular focus on endangered and
minority languages. The key idea is that small language communities
should not be regarded as mere users of NLP tools, but as stakeholders
and true co-developers of the tools, enforcing a virtuous cycle that
enables minorities and language activists to share and boost the scope
of the LRTs. In one word, Smallcodes and LinMiTech have made real and
financially sustainable the mantra everybody repeats "best practices
should be spread and shared among researchers and users".
Ekaterina Pronoza, Elena Yagunova and Andrey Lyashin
.
Restaurant Information Extraction for the
Recommendation System
Abstract:
In this paper a method for Russian reviews corpus analysis (as part of
information extraction) is proposed. It is aimed at future information
extraction system development. This system is intended to be a module
of the recommendation system, and it is to gather restaurants
parameters from users’ reviews, structure it and feed the
recommendation module with these data. The frames analyzed are service
and food quality, cuisine, price level, noise level, etc. In this paper
service quality, cuisine type and food quality are considered. Authors’
aim is to develop patterns and to estimate their appropriateness for
information extraction.
Velislava Stoykova.
Representation of Lexical Information for Related
Languages with Universal Networking Language
Abstract:
The paper analyses related approaches used to develop lexical database
in Universal Networking Language framework for representing Bulgarian
and Slovak language. The idea is to use a specific combination of
grammar knowledge as well as lexical semantic relations representation
techniques to model lexical information representation scheme for
related languages. The formal representation is outlined with respect
to its multilingual application for machine translation.
Tomasz Obrebski.
Deterministic Dependency Parsing with Head Assignment
Revision
Abstract:
The paper presents a technique for deterministic dependency parsing
which allows for making revisions in already made choices as to head
assignment. Traditional deterministic algorithms, in both phrase
structure and dependency paradigm are incremental and base their
decision basically on two kinds of information: the parse history and
look-ahead tokens (the yet unparsed part of the input). The typical
examples may be well known LR(k) algorithms for a subset of
context-free grammars or Nivre's memory-based algorithms for dependency
grammars. Our idea is to defer the ultimate decision until more input
is read and parsed. While processing i-th token the choice is
considered as preliminary and may be changed as soon as a better option
appears. This strategy may by incorporated into other algorithms.
Robert Susmaga and Izabela Szczęch
.
Visualization of Interestingness Measures
Abstract:
The paper presents a visualization tool for interestingness measures,
which provides useful insights into dierent domain areas of the
visualized measure and thus effectively assists measure comprehension
and their selection for KDD methods. Assuming a common, 4-dimensional
domain form of the measures, the system generates a synthetic set of
contingency tables and visualizes them in three dimensions using a
tetrahedron-based barycentric coordinate system. At the same time, an
additional, scalar function of the data (referred to as the operational
function, e.g. any interestingness measure) is rendered using colour.
Throughout the paper a particular group of interestingness measures,
known as conrmation measures, is used to demonstrate the capabilities
of the visualization tool, which range from the determination of specic
values (extremes, zeros, etc.) of a single measure, to the localization
of pre-dened regions of interest, e.g. such domain areas for which
two/or more measures do not differ at all or differ the most.
Mercedes García Martínez, Michael Carl and Bartolomé Mesa-Lao
.
Demo of the CASMACAT post-editing workbench -
Prototype-II: A research tool to investigate human translation
processes for Advanced Computer Aided Translation
Abstract:
The CASMACAT workbench is a new browser-based computer-assisted
translation workbench for post-editing machine translation outputs. It
builds on experience from (I) the Translog-II
(http://bridge.cbs.dk/platform/?q=Translog-II), a tool written in C#,
designed for studying human reading and writing processes and (II) on
Caitra (http://www.caitra.org/), a Computer Assisted Translation (CAT)
tool based on AJAX Web.2 technologies and the Moses decoder. The
CASMACAT workbench extends Translog's key-logging and eye-tracking
abilities with a browser-based front-end and an MT server in the
back-end.
The main features of this new workbench are:
1. Web-based technology which allows for easier portability across
different machine platforms and versions
2. Interactive translation prediction, suggesting to the human
translator how to complete the translation.
3. Interactive editing, providing additional information about the
confidence of its assistance.
4. Adaptive translation models, updating and adapting its models
instantly based on the translation choices of the user.
The translation field can be pre-filled by machine translation through
a server connection and also automatically updated online from an
interactive machine translation server. Shortcut keys are used for
functions such as navigating between segments.
The main innovation of the CASMACAT workbench is its exhaustive logging
function. This allows for completely new possibilities of analyzing
translators' behavior both, in a qualitative and quantitative manner.
The extensive log file contains all kinds of events, keystrokes, mouse,
cursor navigation, as well as gaze information (if an eye-tracker is
connected) during the translation session. The logged data can also be
replayed for visualizing the moves and choices made by the translator
during the post-editing process.
Fernando Ferri, Patrizia Grifoni and Adam Wojciechowski
.
SNEP: Social Network Exchange Problem
Abstract:
This paper discusses of the new frontiers of using Internet and Social
Networks for exchanging goods by swapping, and the Knowledge and
optimization algorithm proposed for the emerging business approach.
This approach is based on the use of Social Networks (SN) for sharing
knowledge, services and exchanging goods. Exchanges are managed
according to knowledge on the community members, the domain knowledge
(the knowledge on products and goods) and the operational knowledge.
The exchange problem involves Social Networks, people and goods
involved and defined in form of a graph. The algorithm for optimization
is presented
Waldir Edison Farfan Caro, Jose Martin Lozano Aparicio and Juan
Cruz
.
Syntactic Analyser for Quechua Language
Abstract:
This demo presents a morphological analyzer for Quechua which makes use
of a dynamic programming technique with a context free grammar. The
construction of grammar of native languages are very important in order
to keep them alive. We focus in Quechua, a native American
less-resourced language.
Brigitte Bigi.
SPPAS - DEMO
Abstract:
SPPAS is a tool to produce automatic annotations which include
utterance, word, syllabic and phonemic segmentation from a recorded
speech sound and its transcription. Main supported languages are:
French, English, Italian, Spanish, Chinese and Taiwanese. The resulting
alignments are a set of TextGrid files, the native file format of the
Praat software which has become the most popular tool for phoneticians
today. An important point for a software which is intended to be widely
distributed is its licensing conditions. SPPAS uses only resources and
tools which can be distributed under the terms of the GNU Public
License.
Aitor García-Pablos, Montse Cuadros and German Rigau.
OpeNER demo: Open Polarity Enhanced Named Entity Recognition
Abstract:
OpeNER is a project funded by the European Commission under the 7th Framework Programme. Its acronym means Open Polarity Enhanced Named Entity Recognition. OpeNER main goal is to provide a set of open and ready to use tools to perform some NLP tasks in six languages including English, Spanish, Italian, Dutch, German and French. In order to display these OpeNER analysis output in a format suitable for a non-expert human reader we have developed a Web application to display this content in different ways. This Web application should serve as a demonstration of some of the OpeNER modules capabilities after the first year of development.
Włodzimierz Gruszczyński, Bartosz Broda, Bartłomiej Nitoń and Maciej Ogrodniczuk.
Jasnopis: a new application for measuring readability of Polish texts
Abstract:
In the demo session we present a new application for automatic measuring of readability of Polish texts making use of two most common approaches to the topic: Gunning FOG index and Flesch-based Pisarek method and two novel methods: measuring distributional lexical similarity of a target text and comparing it to reference texts and using statistical language modeling for automation of a Taylor test.
Shakila Shayan, Andre Moreira, Alexander Koenig, Sebastian Drude and Menzo Windhouwer.
Lexus, an online encyclopedic lexicon tool
Abstract:
Lexus is a flexible web-based lexicon tool, which can be used on any platform that provides a web browser and can run flash player. Lexus does not enforce any pre-described schema structure to the user, however it supports the creation of lexica with the structure of the ISO LMF standard and promotes the usage of concept names and conventions that are proposed by the ISO data categories.
We will have a demo presentation of Lexus’ main features which include embedding local and archived multimedia (image, audio, video) within different parts of a lexical entry, customizable visualization of the data, possibility of creating data elements through accessing ISOcat Data Category Registry or an MDF database (toolbox/shoebox), various within and cross-lexica search options, and allowing for shared access collaboration.
We will show how these features, which allow for an online shared-access multimedia encyclopedic lexicon, make Lexus a promising and suitable tool for language documentation projects.
Jakub Dutkiewicz and Czesław Jedrzejek.
A demo of ontology-based event extraction for the Polish and English languages
Abstract:
This paper describes details of a demonstration of the ontology-based event extraction for the Polish and English languages. The steps contain a creation of conceptual, ontological view of the event, a design of an extraction scheme, and the extraction itself with a presentation of results.
Marek Maziarz, Maciej Piasecki, Ewa Rudnicka and Stanisław Szpakowicz
.
PlWordNet 2.1: DEMO
Abstract:
The first ever wordnet (WordNet) was built in the late 1980s at Princeton University. In the past two decades, hundreds of research teams followed in the footsteps of WordNet’s creators, including our team. Notably, plWordNet is one of few such resources built not by translating WordNet, but from the ground up, in a joint effort of lexicographers and computer scientists. In 2009 the first version, with some 27000 lexical units, has been made available on the Internet. Today plWordNet describes 108000 nouns, verbs and adjectives, contains nearly 162000 unique senses and 450000 relation instances. It is by far the largest wordnet for Polish, and one of several largest in the world.
PlWordNet is a semantic network which reflects the Polish lexical system. The nodes in plWordNet are lexical units (LUs, words with their senses), variously interconnected by semantic relations from a well-defined relation set. For instance, synonymous LUs "kot 2" and "kot domowy 1" ‘cat, Felis domesticus’ have a hypernym "kot 1" ‘feline mammal, any member of the family Felidae’ and hyponyms "dachowiec
1" ‘alley cat’ or "angora turecka 1" ‘Turkish Angora’. Any lexical unit acquires its meaning from its relatedness to other lexical units within the system; we can reason about it by considering relations in which it participates. Thus "kot 2" is defined as a kind of animal from the family Felidae and "dachowiec 1" and "angora turecka 1" are kinds of the Felis domesticus. Lexical units which enter the same lexicosemantic
relations (but not the same derivational relations) are treated as synonyms and linked into synsets that is synonym sets (this is the case of "kot 2" and "kot domowy 1").
The continued growth of plWordNet has been made possible by grants from the Polish Ministry of Science and Higher Education and from the European Union. Now we
work on it in the Clarin Poland Project. We aim to build a conceptual dictionary fully representative of contemporary Polish, comparable with the largest wordnets in the world. We have made an effort to ensure that version 2.1 has the same high quality as the best wordnets out there – Princeton WordNet, EuroWordNet (a joint initiative of a dozen or so members of the European Union) or GermaNet from
Tübingen University. The plWordNet is available free of charge for any applications (including commercial applications) based on a licence modelled on that for Princeton WordNet.
Piotr Bański, Joachim Bingel, Nils Diewald, Elena Frick, Michael Hanl, Marc Kupietz, Piotr Pęzik, Carsten Schnober and Andreas Witt
.
KorAP: the new corpus analysis platform at IDS Mannheim
Abstract:
The KorAP project (“Korpusanalyseplattform der nächste Generation”, “Corpus-analysis platform of the next generation”), carried out at the Institut für Deutsche Sprache (IDS) in Mannheim, Germany, has as its goal to develop a modern, state-of-the-art corpus-analysis platform, capable of handling very large corpora and opening the perspectives for innovative linguistic research. The platform will facilitate new linguistic findings by making it possible to manage and analyse primary data and annotations in the petabyte range, while at the same time allowing an undistorted view of the primary linguistic data, and thus fully satisfying the demands of a scientific tool. The proposed demo will present the project deliverable in the current stage, when we switch from the creation phase to the testing and fine-tuning phase.
Maarten Janssen
.
CorpusWiki: an online, language independent POS tag corpus builder
Abstract:
In this demonstration we will we present the key aspects of CorpusWiki. CorpusWiki is an online environment for the creation of morphosyntatically annotated corpora, and can be used for any written language. The system attempts to keep the required (computational) knowledge of the users to a bare minimum, and to maximize the output of the efforts put in by the users. Given its language independent design, CorpusWiki primarily aims at allowing users to create basic resources for less-resourced languages (and dialects).
Zygmunt Vetulani, Marek Kubis and Bartłomiej Kochanowski
.
Collocations in PolNet 2.0. New release of "PolNet - Polish Wordnet"
Abstract:
The recent development evolution from PolNet 1.0 to PolNet 2.0 consisted mainly in further development of the verbal component with the inclusion of concepts (synsets) represented (in many cases uniquely) by compound construction in form of verb-noun collocations. This extension brings to PolNet new synsets, some of which closely related to the already existing verb synsets of the PolNet 1.0. Adding collocations to PolNet appeared to be non trivial because of specific syntactic phenomena related with collocations in Polish.
Massimo Moneglia
.
IMAGACT. A Multilingual Ontology of Action based on visual representations (DEMO)
Abstract:
Automatic translation systems often have difficulty choosing the appropriate verb when the translation of a simple sentence is required, since one verb can refer, in its own meaning, to many different actions and there is no certainty that the same set of alternatives is allowed in another language. The problem is a significant one, because reference to action is very frequent in ordinary communication and high-frequency verbs are general; i.e. they refer to different action types. IMAGACT has delivered a cross-linguistic ontology of action. Using spoken corpora, we have identified 1010 high-frequency action concepts and visually represented them with prototypical scenes. The ontology allows the definition of cross-linguistic correspondences between verbs and actions in English, Italian, Chinese and Spanish. Thanks to the visual representation it can potentially be extended to any language.
Andrea Marchetti, Maurizio Tesconi, Stefano Abbate, Angelica Lo Duca, Andrea D'Errico, Francesca Frontini and Monica Monachini
.
Tour-pedia: a web application for the analysis and visualization of opinions for tourism domain
Abstract:
We present Tour-pedia an interactive web application that extracts opinions from reviews of accommodations from different sources available on-line. Polarity markers display on a map the different opinions. This tool is intended to help business operators to manage reputation on-line.
Ewa Lukasik and Magdalena Sroczan
.
Innovation of Technology and Innovation of Meaning:
Assessing Websites of Companies
Abstract:
The research reported in this paper has been inspired by the concept of
design driven innovation introduced by Roberto Verganti. He claimed,
that the rules of innovation should be changed by radically changing
the meaning of things. The word design is etymologically derived from
latin expression that relates to distinguishing things by signs (de
signum). Therefore the sign and the language of signs play an important
role in catching user’s interest in the product. The Internet is one of
the most important marketing medium for a firm; therefore the meaning
of its web image, a website, should be a matter of a company’s great
concern to attract the clients. The paper tries to answer the question
if websites of firms are innovative or they are only correctly designed
according to user centered rules. The design driven innovation of
websites of selected firms has been assessed by student testers. Three
groups of Internet portals were taken into account: electricity
suppliers, banks and portals of cities. The test results showed, that
the design of most of the assessed websites was placed in the first
quadrant of Verganti’s technology-meaning, i.e. neither the radical
innovation of technology, nor the innovation of meaning were observed.
Rajaram Belakawadi, Shiva Kumar H R and Ramakrishnan A G
.
An Accessible Translation System between Simple
Kannada and Tamil Sentences
Abstract:
A first level, rule based machine translation system is designed and
developed for words and simple sentences of the language pair Kannada –
Tamil. These languages are Dravidian languages and have the status of
classical languages. Both grammatical and colloquial translations are
made available to the user. One can also give English words and
sentences as input and the system returns both Kannada and Tamil
equivalents. With accessibility as the key focus, the system has an
integrated Text-To-Speech system and gives transliterated output in
Roman script for both Kannada and Tamil languages. This makes the tool
accessible to the visually or hearing challenged. The system has been
tested by 5 native users each of Tamil and Kannada for isolated words
and sentences of length up to three words and found to be user friendly
and acceptable. The system handles sentences of the following types:
greetings, introduction, enquiry, directions and other general ones
useful for a new comer.
Adam Wojciechowski, Krzysztof Gorzynski
.
A Method for Measuring Similarity of Books:
a Step Towards an Objective Recommender System for Readers
Abstract:
In the paper we propose a method for book comparison based on graphical radar chart intersection area method. The method was designed as a universal tool and its most important parameter is document feature vector (DFV), which defines a set of text descriptors used to measure particular properties of analyzed text. Numerical values of the DFV that define book characteristic are stretched on radar chart and intersection area drawn for two books is interpreted as a measure of bilateral similarity in sense of defined DFV. Experiment conducted on relatively simple definition of the DFV gave promising results in recognition of books’ similarity (in sense of author and literature domain). Such an approach may be used for building a recommender system for readers willing to select a book matching their preferences recognized by objective properties of a reference book.