The North Sámi active learning morphological segmentation annotations
======================================================================

This is a corpus of hand-annotated morphological segmentations for actively selected North Sámi words.
The words have been selected using Morfessor, from the UIT-SME-TTS corpus kindly provided to us by the University of Tromsø.
The annotations were produced by Katri Hiovain, a trained linguist, who is not a native speaker of Sámi.

If you use this data in a scientific publication, please cite the following papers:

Stig-Arne Grönroos, Kristiina Jokinen, Katri Hiovain, Mikko Kurimo, and Sami Virpioja.
Low-resource active learning of North Sami morphological segmentation.
In proceedings of IWCLUL 2015.

Stig-Arne Grönroos, Katri Hiovain, Peter Smit, Ilona Rauhala, Kristiina Jokinen, Mikko Kurimo, and Sami Virpioja.
Low-Resource Active Learning of Morphological Segmentation.
Northern European Journal of Language Technology. 2016.


Data format and annotation conventions
======================================

The files ending in .analysis contain surface segmentations with category tags appended to each morph.
Two morph categories occur in the data: STM for word stems, and SUF for suffixes.
The format is:
<word><tab><morph><slash><category>[<space><morph><slash><category>]*
For example:
galmmihanrusttegat      galmmi/STM ha/SUF n/SUF rusttega/STM t/SUF

Hyphens and colons are separated from stems for technical reasons, and marked as belonging to the suffix category:
aa-joavkku      aa/STM -/SUF joavkku/STM

Most of the annotated word tokens had an unambiguous segmentation agreeing
with established linguistic interpretation. These words contain only easily separated
suffixes: markers for case and person, and derivational endings. However, some
words required the annotator to make choices on where to place the boundary.

One challenge was posed by the extensive stem alternation and fusion in Sámi.
To maximize consistency, the segmentation boundary was usually placed so that all
of the morphophonological alternation remains in the stem. Exceptions include the
passive derivational suffix, which is found as variants -ojuvvo- and -juvvo- depending
on the inflectional category and stem type. Another challenge were lexicalized stems.
These stems appear to end with a derivational suffix, but removal of the suffix does not
yield a morpheme at all, or results in a morpheme with very weak semantic relation
to the lexicalized stem. An example is ráhkadit (make, produce).


Versions of the package
=======================

There are two versions of the data set
Version 1 (sme_al_annotations.tar.gz) contains the data collected for the paper published in IWCLUL 2015.
Version 2 (sme_al_annotations_v2.tar.gz) contains an extended data set collected for the paper published in NEJLT.

The third package (systems_agree.tar.gz) contains additional data for the paper published in IWCLUL 2019.


Contents of the v1 package
==========================

The package sme_al_annotations.tar.gz contains the following files:

words   file
------------------------------------
  100   development.analysis
------------------------------------
  237   train.selected.analysis
  197   train.firstapproach.analysis
  311   train.random.analysis
  643   train.full.analysis
------------------------------------
  357   test.analysis
------------------------------------
    5   nonwords
------------------------------------

The collected words are divided into 4 sets: development, training, testing and nonwords.
The development set (development.analysis) is intended for hyper-parameter optimization.
It consists of 100 randomly selected word types.

The training set contains a total of 643 unique word types, selected in different ways.
train.firstapproach.analysis contains actively selected words from the first active learning setup.
train.selected.analysis contains actively selected words from the restarted active learning setup.
train.random.analysis contains randomly selected words used as an evaluation baseline.
The three sets described above are NOT disjoint.
No words were re-elicited, but words selected again in later phases were extracted from the previous annotations.

train.full.analysis is a set of all unique training words, regardless of collection method.
It was created by issuing this command:
cat train.firstapproach.analysis train.random.analysis train.selected.analysis | sort | uniq > train.full.analysis


Contents of the v2 package
==========================

The package sme_al_annotations_v2.tar.gz contains the following files:

words   file

------------------------------------
  2311 all.analysis
------------------------------------
   199 development.analysis
------------------------------------
   300 train.ifsubstrings.analysis
   297 train.uncertainty_rs.analysis
   500 train.random.analysis
------------------------------------
   796 test.analysis
------------------------------------

The development set has been expanded to 199 words, and the test set to 796 words.

There are 3 training sets, according to the method of selection.
train.ifsubstrings.analysis contains words actively selected using the Initial/Final Substrings query strategy.
train.uncertainty_rs.analysis contains words actively selected using the Uncertainty + Representative sampling query strategy.
train.random.analysis contains randomly selected words used as an evaluation baseline.

all.analysis is a set of all unique words, regardless of collection method.
This set also contains the words from the v1 package which were not reused in the second experiment.

Unannotated training data
=========================

The unannotated training data word list is found in the separate package unannotated.gz
It is a tokenized, lightly cleaned word list extracted from Den Samiske tekstbanken.
Note that it still contains some amount of nonwords.

Contents of the systems_agree.tar.gz package
============================================

The package systems_agree.tar.gz contains the following files:

words     file

----------------------------------------------
  352606  systems_agree
  310224  systems_disagree
     300  systems_disagree.300.ifsubstrings_5n
----------------------------------------------

The files systems_agree and systems_disagree divide a further cleaned subset of the unannotated training data,
into two disjoint subsets.
The words in systems_agree were segmented the same by the FlatCat + CRF and FlatCat + Neural Sequence Tagger systems.
For the words in systems_disagree, the systems produce different output.
The third file, systems_disagree.300.ifsubstrings_5n,
contains 300 words selected using the Initial/Final Substrings query strategy from the words for which the systems disagree.

The systems_agree file may be useful if you are looking for a large number of segmentations with relatively high reliability.
The systems_disagree files may be useful if you want to contribute to extending the set of annotations.


License
=======

This data is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license
http://creativecommons.org/licenses/by/4.0/

To fulfill the attribution clause, please cite the following paper:

Stig-Arne Grönroos, Kristiina Jokinen, Katri Hiovain, Mikko Kurimo, and Sami Virpioja.
Low-resource active learning of North Sami morphological segmentation.
In proceedings of IWCLUL 2015.