overscore icon indicating copy to clipboard operation
overscore copied to clipboard

Automatic score reader -- OMR system and text-based musical notation

  • Overview Overscore is an automatic score reader, that recognizes musical sheets and plays them using [[http://overtone.github.com/][Overtone]]. Overscore consists of three main stages:
    • an /Optical Music Recognition/ (OMR) system to recognize partitions and compile them into MusicXML files
    • a MusicXML parser that compiles MusicXML files into Clojure files
    • multiple helper macros and functions built over Overtone to have a system more adapted to encode classical music

This project is the result of work done at the ULB (Université Libre de Bruxelles), for the course PROJ-H-402.

  • Dependencies Aside from the dependencies listed in the =project.clj=, OpenCV is also needed for the classification step, and the classifier (which is written in C++) should be compiled:

#+BEGIN_SRC shell cd src/overscore/recognition/classification g++ opencv_knn.cpp -o opencv_knn pkg-config opencv --libs --cflags #+END_SRC

The Gamera python plugin is also needed for staffline detection.

  • Usage The application can be launched using [[http://leiningen.org/][leiningen]]:

#+BEGIN_SRC shell $ lein2 run help Possible arguments: convert Convert symbols from Audiveris training set to png images in the directory (from the directory)

preprocessing <in> <out> <out-ref>
    Preprocess the image <in>, saving the output image to
    <out> and the reference lengths descriptions in
    <out-ref>

staffline <in>
    Isolate the systems and find the staffline positions
    on each system, from the image <in>. Saves the output
    for each system to <in>-n.png and <in>-n.txt, where n
    is an integer

segmentation <in-img> <in-refs> <out-segs>
    Segment the image <in-img>, with reference lengths
    described in <in-refs>, isolating each symbol. Save
    the segments descriptions in <out-segs>

classification <training-set> <in-img> <in-segs> <out-classes>
    Classify the segments of <in-img> described by the
    file <in-segs> (generated by the segmentation step),
    using the training set at location <training-set>
    (created by the convert step). Save the segments along
    with their classes in <out-classes>

semantics <in-classes> <in-refs> <in-stafflines> <out-xml>

    Convert the recognized score to a MusicXML
    document (<out-xml>), from the classes (<in-classes>) output
    by the classification step, the reference lengths (<in-refs>)
    output by the preprocessing step, and the staff lines
    positions (<in-stafflines>, the .txt file) output by the
    staffline step.

generate <in> <out> <name>
    parse the <in> MusicXML file, and generate the song
    <name> in the clojure file <out>

play <file> <ns>/<name>
    play the song called <name> defined in the file <file>, which
    defines the namespace <ns>. If no namespace is defined, only
    <name> can be given

#+END_SRC

Some example data is provided in the [[example]] directory, which contains both input data and (almost) perfect output data. For example, after running the following command:

#+BEGIN_SRC shell $ lein2 run preprocessing example/input/furelise.png /tmp/furelise.png /tmp/refs.txt #+END_SRC

The image =/tmp/furelise.png= should (roughly) correspond to the image =example/preprocessed/furelise.png=, and the file =/tmp/refs.txt= to [[example/preprocessed/refs.txt]].

The other possible commands, on the example files, are:

#+BEGIN_SRC shell $ lein2 run staffline example/preprocessed/furelise.png $ lein2 run segmentation example/staffline/furelise-0.png
example/preprocessed/refs.txt /tmp/segs.txt $ lein2 run classification training-set/
example/staffline/furelise-0.png
example/segmentation/segs.txt
/tmp/classes.txt $ lein2 run semantics example/classification/classes.txt
example/preprocessed/refs.txt
example/staffline/furelise-0.txt
/tmp/furelise-0.xml $ lein2 run generate example/semantics/furelise-0.xml
/tmp/furelise-0.clj
furelise $ lein2 run play example/generate/furelise-0.clj
furelise/furelise #+END_SRC

Note: the =staffline= step stores its output in the same directory as its input.

  • How it works ** OMR The initial plan for the OMR part is documented in [[doc/design/design.pdf]]. In this section we will briefly describe how it is currently implemented and discuss some results. *** Part 1: Image Preprocessing **** Binarization Since the input image might be in grayscale or color, a binarization step is sometimes necessary. However, most musical sheets found on the internet are already binarized and this step is thus performed only if necessary.

It is done in two steps:

  1. If the image is in color, convert it into grayscale using the [[http://en.wikipedia.org/wiki/Grayscale#Converting_color_to_grayscale][luma component]], implemented in [[src/overscore/preprocessing/gray.clj][preprocessing/gray.clj]].
  2. Convert the grayscale image into binary using [[http://en.wikipedia.org/wiki/Otsu%27s_method][Otsu's method]], implemented in [[src/overscore/preprocessing/otsu.clj][preprocessing/otsu.clj]]. **** Reference lengths Once the image is binarized, the reference lengths (staff line and staff space height) are found by analyzing the most common vertical runs (black for staff line height, white for staff space height) as described in [[RebeloFujinaga2012][Rebelo, Fujinaga et. al., 2012]] **** Improvements Multiple steps could be added to the preprocessing part. The most useful would be a skew detection and correction step, since it is common to have skewed images among scanned documents. Another possibly useful step would be a noise reduction step, depending on the quality of the scanned documents. *** Part 2: Staff line processing **** System isolation The first step done by this part is to isolate the systems to process them independently. This is done by a technique similar to the one described in [[Fujinaga1988][Fujinaga, 1988]].

First, a y-projection of the pixel is done (see an example result in [[overscore/tree/master/doc/yproj.png][doc/yproj.png]]), and the systems are located by looking for five distinct peaks of black pixels. Then, the boundaries of the system are found by looking around the system for the lines where the number of black pixel is minimal.

Once each system is found, they are each saved in an image.

image from doc/yproj.png generated by

(def img (ImageIO/read (File. "/home/quentin/p/overscore/data/furelise.png")))

(def p (projection img :y))

(def chart

(set-background-alpha

(bar-chart (range (length p))

p :vertical false ) 0))

(.setVisible (.getRangeAxis (.getCategoryPlot chart)) false)

(.setVisible (.getDomainAxis (.getCategoryPlot chart)) false)

(save chart "foo.png" :width 2745 :height 3611)

then assembled with the png of the sheet

This step is implemented in [[src/overscore/staffline/identification.clj][staffline/identification.clj]]. **** Staff line identification and removal The staff line removal is done by using [[http://gamera.informatik.hsnr.de/index.html][Gamera]]'s [[http://music-staves.sf.net/][music-staves]] plugin. A python script is simply called with the input image, and outputs the same image without the staffline (in an image), as well as the staff line positions (in a text file).

If no staff line were found in an image, it can be discarded since it is most likely not a relevant image (eg. some text, like the title of the partition).

This step is implemented in [[src/overscore/staffline/removal.clj][staffline/removal.clj]] and calls the script [[src/overscore/staffline/removal.py][staffline/removal.py]]. **** Improvements If the staff line removal does not work as expected because the image is skewed, a skew correction algorithm should be implemented in the [[Part 1: Image Preprocessing][preprocessing]]. *** Part 3: Symbol Recognition **** Symbol Segmentation The segmentation is done in a similar way as done in [[OpenOMR][OpenOMR]]. All the sub parts of the segmentation process are assembled in [[src/recognition/segmentation/segmentation.clj][recognition/segmentation/segmentation.clj]], and takes as input a path to an image (output by the preprocessing step), a path to a text files containing the reference lengths (also part of the output of the preprocessing step). It outputs all the segments represented as a vector of 4-element vectors in a text file. ***** Level-0 Segmentation The first segmentation is done by finding consecutive columns which contains black pixels. The results are refined by:

  1. Grouping close segments, which can happen for example in the case of a dotted note
  2. Not taking small segments, which are probably due to noise in the scanned image.

The level-0 segmentation is implemented in [[src/overscore/recognition/segmentation/level0.clj][recognition/segmentation/level0.clj]]. An example of level-0 segmentation result can be found in [[overscore/tree/master/doc/level0-segments.png][doc/level0-segments.png]]. ***** Note Head Detection For each level-0 segment, we need to know if it contains a note head or not.

To detect if a segment contains note heads, the following algorithm is used (taken from [[OpenOMR][OpenOMR]]):

  • For each column, find (if present) the biggest black run that is:
    • Smaller than 3/2 of the staffspace height
    • Bigger than 2 times the staffline height Remember the columns where such runs are present.
  • Find segments of columns having those black runs, such that the lengths of the segment is at least half of the staffspace height. Those segments correspond to the note heads.

Segments having note heads in it are further decomposed into multiple level-1 segments. The others can directly be used as level-1 segments without further decomposition.

The note head detection is implemented in [[src/overscore/recognition/segmentation/notehead.clj][recognition/segmentation/notehead.clj]]. ***** Level-1 Segmentation Level-1 segmentation use the data computed by the note head detection: for each note head found, it creates a level-1 segment. The space between the note heads is also saved in a level-1 segment.

Level-1 segmentation is implemented in [[src/overscore/recognition/segmentation/level1.clj][recognition/segmentation/level1.clj]] and an example output on level-0 segments that contains notes can be found in [[overscore/tree/master/doc/level1-segments.png][doc/level1-segments.png]]. ***** Level-2 Segmentation The level-2 segmentation separates the symbol contained in each level-1 segment vertically. The resulting segments should then correspond to the musical features (eg. a note head, a sharp, ...) and can then be classified.

Level-2 segmentation is implemented in [[src/recognition/segmentation/level2.clj][recognition/segmentation/level2.clj]]. **** Symbol recognition Multiple symbol recognition methods are implemented. The one used by default uses the [[https://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm][k nearest neighbors algorithm]] provided by [[http://opencv.org/][OpenCV]], using [[http://audiveris.kenai.com/][Audiveris]]' training set.

Since Audiveris store its training set as xml files describing vertical runs for each image, we need to convert it to "normal" (2-bit PNG) images (for easier manipulation). This is done in [[src/overscore/tools/audiveris.clj][tools/audiveris.clj]].

The training set is then loaded in [[src/recognition/classification/training.clj][recognition/classification/training.clj]], each image being resized to a 20x20 image and represented by a vector of 400 integer (1 meaning the pixel is on (ie. black), 0 meaning it is off).

OpenCV's k nearest neighbor method is called directly from a C++ program, [[src/recognition/classification/opencv_knn.cpp][recognition/classification/opencv_knn.cpp]], and the resulting program is called from clojure in [[src/recognition/classification/opencv_knn.clj][recognition/classification/opencv_knn.clj]]. Once OpenCV is installed, the C++ program can be compiled with:

#+BEGIN_SRC shell $ g++ opencv_knn.cpp -o opencv_knn pkg-config opencv --libs --cflags #+END_SRC

Another simple classifier using kNN (implemented by hand) is implemented in [[src/recognition/classification/knn.clj][recognition/classification/knn.clj]], and can use the [[http://en.wikipedia.org/wiki/Hausdorff_distance][Hausdorff distance]] or the [[https://en.wikipedia.org/wiki/Euclidean_distance][Euclidian distance]]to compute the distance between two images. The Hausdorff distance is implemented in [[src/recognition/classification/hausdorff.clj][recognition/classification/hausdorff.clj]], and the Euclidian distance in [[src/recognition/classification/euclidian.clj][recognition/classification/euclidian.clj]]. However, this implementation of the kNN algorithm is really slow, and that is the reason why OpenCV's kNN is used by default.

A neural network classifier using [[http://www.heatonresearch.com/][Encog]] is also implemented, in [[src/recognition/classification/nn.clj][recognition/classification/nn.clj]].

All the parts of the classification step are assembled in [[src/recognition/classification/classification.clj][recognition/classification/classification.clj]], and takes as input the image (output by the preprocessing step) and a file describing the segments (output by the segmentation step), and outputs a file describing the class of each segment (as a vector of 5-element vectors, where the 4 first elements are the segment description and the last element is the class (as a symbol) of the vector) **** Improvements The segmentation might be improved by fine tuning the parameters. The level-0 and level-1 segmentation works quite accurately, but the level-2 segmentation performs really poorly at the moment.

The symbol recognition process is currently not accurate enough. It might be because a big part (around 25%) of the training set consists of black noteheads. This part could be reduced, and the rest of the training set could be improved.

The kNN algorithm implemented by hand also suffers from huge performance issues.

*** TODO Part 4: Musical Semantics The musical semantics are defined by a set of rule, as the following LL(1) grammar:

#+BEGIN_SRC text

<P'> →

ε

 <note_body> 

 → sharp
        flat
        natural


         dot_set