The North Sámi active learning morphological segmentation annotations
======================================================================

This is a corpus of hand-annotated morphological segmentations for actively selected North Sámi words.
The words have been selected using Morfessor, from the UIT-SME-TTS corpus kindly provided to us by the University of Tromsø.
The annotations were produced by Katri Hiovain, a trained linguist, who is not a native speaker of Sámi.

If you use this data in a scientific publication, please cite the following paper:

Stig-Arne Grönroos, Kristiina Jokinen, Katri Hiovain, Mikko Kurimo, and Sami Virpioja.
Low-resource active learning of North Sami morphological segmentation.
In proceedings of IWCLUL 2015.


Data format and annotation conventions
======================================

The files ending in .analysis contain surface segmentations with category tags appended to each morph.
Two morph categories occur in the data: STM for word stems, and SUF for suffixes.
The format is:
<word><tab><morph><slash><category>[<space><morph><slash><category>]*
For example:
galmmihanrusttegat	galmmi/STM ha/SUF n/SUF rusttega/STM t/SUF

Hyphens and colons are separated from stems for technical reasons, and marked as belonging to the suffix category:
aa-joavkku	aa/STM -/SUF joavkku/STM

Most of the annotated word tokens had an unambiguous segmentation agreeing
with established linguistic interpretation. These words contain only easily separated
suffixes: markers for case and person, and derivational endings. However, some
words required the annotator to make choices on where to place the boundary.

One challenge was posed by the extensive stem alternation and fusion in Sámi.
To maximize consistency, the segmentation boundary was usually placed so that all
of the morphophonological alternation remains in the stem. Exceptions include the
passive derivational suffix, which is found as variants -ojuvvo- and -juvvo- depending
on the inflectional category and stem type. Another challenge were lexicalized stems.
These stems appear to end with a derivational suffix, but removal of the suffix does not
yield a morpheme at all, or results in a morpheme with very weak semantic relation
to the lexicalized stem. An example is ráhkadit (make, produce).


Contents of the package
=======================

The package sme_al_annotations.tar.gz contains the following files:

words   file
------------------------------------
  100   development.analysis
------------------------------------
  237   train.selected.analysis
  197   train.firstapproach.analysis
  311   train.random.analysis
  643   train.full.analysis
------------------------------------
  357   test.analysis
------------------------------------
    5   nonwords
------------------------------------

The collected words are divided into 4 sets: development, training, testing and nonwords.
The development set (development.analysis) is intended for hyper-parameter optimization.
It consists of 100 randomly selected word types.

The training set contains a total of 643 unique word types, selected in different ways.
train.firstapproach.analysis contains actively selected words from the first active learning setup.
train.selected.analysis contains actively selected words from the restarted active learning setup.
train.random.analysis contains randomly selected words used as an evaluation baseline.
The three sets described above are NOT disjoint.
No words were re-elicited, but words selected again in later phases were extracted from the previous annotations.

train.full.analysis is a set of all unique training words, regardless of collection method.
It was created by issuing this command:
cat train.firstapproach.analysis train.random.analysis train.selected.analysis | sort | uniq > train.full.analysis

The unannotated training data word list is found in the separate package unannotated.gz
It is a tokenized, lightly cleaned word list extracted from Den Samiske tekstbanken.
Note that it still contains some amount of nonwords.
Nonwords encountered during the annotation elicitation are listed in the file nonwords.


License
=======

This data is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license
http://creativecommons.org/licenses/by/4.0/

To fulfill the attribution clause, please cite the following paper:

Stig-Arne Grönroos, Kristiina Jokinen, Katri Hiovain, Mikko Kurimo, and Sami Virpioja.
Low-resource active learning of North Sami morphological segmentation.
In proceedings of IWCLUL 2015.
