Motivic pattern classification of music audio signals combining residual and LSTM networks

Aitor Arronte Alvarez; Francisco Gómez

doi:10.9781/IJIMAI.2021.01.003

Motivic pattern classification of music audio signals combining residual and LSTM networks

Aitor Arronte Alvarez ¹
Francisco Gómez ²

1 Universidad Politécnica de Madrid

Universidad Politécnica de Madrid

Madrid, España

ROR https://ror.org/03n6nwv02
2 University of Hawaii at Manoa

University of Hawaii at Manoa

Honolulu, Estados Unidos

ROR https://ror.org/01wspgy28

Revista:

IJIMAI

ISSN: 1989-1660

Año de publicación: 2021

Volumen: 6

Número: 6

Páginas: 208-214

Tipo: Artículo

DOI: 10.9781/IJIMAI.2021.01.003 DIALNET GOOGLE SCHOLAR Dialnet editor

Otras publicaciones en: IJIMAI

Resumen

Motivic pattern classification from music audio recordings is a challenging task. More so in the case of a cappella flamenco cantes, characterized by complex melodic variations, pitch instability, timbre changes, extreme vibrato oscillations, microtonal ornamentations, and noisy conditions of the recordings. Convolutional Neural Networks (CNN) have proven to be very effective algorithms in image classification. Recent work in large-scale audio classification has shown that CNN architectures, originally developed for image problems, can be applied successfully to audio event recognition and classification with little or no modifications to the networks. In this paper, CNN architectures are tested in a more nuanced problem: flamenco cantes intra-style classification using small motivic patterns. A new architecture is proposed that uses the advantages of residual CNN as feature extractors, and a bidirectional LSTM layer to exploit the sequential nature of musical audio data. We present a full end-to-end pipeline for audio music classification that includes a sequential pattern mining technique and a contour simplification method to extract relevant motifs from audio recordings. Mel-spectrograms of the extracted motifs are then used as the input for the different architectures tested. We investigate the usefulness of motivic patterns for the automatic classification of music recordings and the effect of the length of the audio and corpus size on the overall classification accuracy. Results show a relative accuracy improvement of up to 20.4% when CNN architectures are trained using acoustic representations from motivic patterns.

Referencias bibliográficas

Dannenberg, R. B., and Hu, N. “Pattern discovery techniques for music audio,” Journal of New Music Research, vol. 32, no.2, pp. 153-163, 2003.
Pikrakis, A., Gómez, F., Oramas, S., Díaz-Báñez, J. M., Mora, J., EscobarBorrego, F., Gómez, E., and Salamon, J. “Tracking Melodic Patterns in Flamenco Singing by Analyzing Polyphonic Music Recordings,” in International Society for Music Information Retrieval Conference, ISMIR, Porto, Portugal, 2012, pp. 421-426.
Gulati, S., Serra, J., Ishwar, V., and Serra, X. “Mining melodic patterns in large audio collections of Indian art music,” in 2014 Tenth International Conference on Signal-Image Technology and Internet-Based Systems, Marrakech, Morocco, 2014, pp. 264-271.
Volk, A., Haas, W. B., and Kranenburg, P. “Towards modelling variation in music as foundation for similarity,” in Proceedings of the 12th International Conference on Music Perception and Cognition and the 8th Triennial Conference of the European Society for the Cognitive Sciences of Music, Thessaloniki, Greece, 2012, pp. 1085-1094.
Mora, J., Gomez Martin, F., Gómez, E., Escobar-Borrego, F. J., and DíazBáñez, J. M. “Characterization and melodic similarity of a cappella flamenco cantes,” in International Society for Music Information Retrieval Conference, ISMIR, Utrecht, The Netherlands, 2016, pp. 9-13.
Kroher, N., and Díaz-Báñez, J. M. “Audio-based melody categorization: Exploring signal representations and evaluation strategies” Computer Music Journal, vol. 41, no. 4, pp. 64-82, 2018.
Kroher, N., and Díaz-Báñez, J. M. “Modelling melodic variation and extracting melodic templates from flamenco singing performances,” Journal of Mathematics and Music, vol. 13, no. 2, pp. 150-170, 2019.
Choi, K., Fazekas, G., and Sandler, M. “Automatic tagging using deep convolutional neural networks,” arXiv preprint arXiv:1606.00298.
Kim, T., Lee, J., and Nam, J. “Sample-level CNN architectures for music auto-tagging using raw waveforms,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Alberta, Canada, 2018, pp. 366-370.
Dieleman, S., and Schrauwen, B. “End-to-end learning for music audio,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 2014, pp. 6964-6968.
Choi, K., Fazekas, G., and Sandler, M. “Transfer learning for music classification and regression tasks.” arXiv preprint arXiv:1703.09179 2017.
Hershey, S., Chaudhuri, S., Ellis, D. P., Gemmeke, J. F., Jansen, A., Moore, R. C., and Slaney, M. “CNN architectures for large-scale audio classification,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 2017, pp. 131-135.
Durand, S., Bello J. P., Bertrand D., and Gaël R. “Downbeat tracking with multiple features and deep neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 2015, pp. 409-413.
Schlüter, J., and Böck, S. “Improved musical onset detection with convolutional neural networks,” in 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), Florence, Italy, 2014, pp. 6979-6983.
Corbera, F., and Serra, X. “Tempo estimation for music loops and a simple confidence measure,” in Proceedings of the 17th International Society for Music Information Retrieval Conference, ISMIR, New York, USA, 2016, pp. 269-75.
Korzeniowski, F., and Widmer, G. “A fully convolutional deep auditory model for musical chord recognition,” in 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), Salerno, Italy, 2016, pp. 1-6.
Juhan, N., Choi, K., Lee, J., Chou, S., and Yang, Y. “Deep learning for audio-based music classification and tagging: Teaching computers to distinguish rock from bach,” IEEE signal processing magazine, vol. 36, no. 1, pp. 41-51, 2018.
Murthy, Y., Jeshventh, T. K. R., Zoeb, M., Saumyadip, M., and Shashidhar, G. K. “Singer identification from smaller snippets of audio clips using acoustic features and DNNs,” in 2018 Eleventh International Conference on Contemporary Computing (IC3), Noida, India, 2018, pp. 1-6.
Gómez, F., Dıaz-Bánez, J. M., Gómez, E., and Mora, J. “Flamenco music and its computational study,” in Mathematical Music Theory: Algebraic, Geometric, Combinatorial, Topological and Applied Approaches to Understanding Musical Phenomena, World Scientific Publishing, Singapore, ch. 8, pp. 303-315.
Kroher, N., Díaz-Báñez, J. M., Mora, J., and Gómez, E. “Corpus COFLA: a research corpus for the computational study of flamenco music,” Journal on Computing and Cultural Heritage (JOCCH), vol. 9, no. 2, pp. 1-21. 2016.
Serra, X. “Creating research corpora for the computational study of music: the case of the Compmusic project,” in Audio engineering society conference: 53rd international conference: Semantic audio, London, UK, 2014, article number 1-1, [9p.]
Mora, J., Gómez, F., Gómez, E., and Díaz-Báñez, J. M. “Melodic contour and mid-level global features applied to the analysis of flamenco cantes,” Journal of New Music Research, vol. 45, no. 2, pp. 145-159, 2016.
H. Wang, J., and Han, J. “BIDE: Efficient mining of frequent closed sequences”, in Proceedings of the 20th international conference on data engineering, Boston, MA, USA, 2004, pp. 79-90.
J. Salamon, E. Gomez, and J. Bonada, “Sinusoid extraction and salience function design for predominant melody estimation,” in International Conference on Digital Audio Effects, Paris, France, 2011, pp. 73–80.
Díaz-Báñez, J. M., and A. Mesa. “Fitting Rectilinear Polygonal Curves to a Set of Points in the Plane.”, in European Journal of Operational Research vol. 130, no. 1, pp. 214-222, 2001.
Choi, K., Fazekas, G., Sandler, M., and Cho, K. “Convolutional recurrent neural networks for music classification”, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, US, 2017, pp. 2392-2396.
He, K., Zhang, X., Ren, S., and Sun, J. “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 2016, pp. 770-778.
Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, 2017, pp. 4700-4708.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 2016, pp. 2818-2826.
Pons Puig, J., Nieto, O., Prockup, M., Schmidt, E. M., Ehmann, A. F., and Serra, X. “End-to-end learning for music audio tagging at scale,” in Proceedings of the 19th International Society for Music Information Retrieval Conference, Paris, France, 2018, pp. 637-44.
Graves, A., and Schmidhuber, J. “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” Neural Networks, vol.18, no. 5-6, pp. 602-610, 2005.
Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., and Le, Q. V. “Specaugment: A simple data augmentation method for automatic speech recognition”. arXiv preprint arXiv:1904.08779. 2019.
Ko, T., Peddinti, V., Povey, D., and Khudanpur, S. “Audio augmentation for speech recognition,” in Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 2015, pp. 3586-3589.
McFee, B., Humphrey, E. J., and Bello, J. P. “A software framework for musical data augmentation,” in 16th International Society for Music Information Retrieval Conference, Malaga, Spain, 2015, pp. 248-254.
Kingma, D. P., and Ba, J. “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

Fuente de los datos: Dialnet

Motivic pattern classification of music audio signals combining residual and LSTM networks

Universidad Politécnica de Madrid

University of Hawaii at Manoa

Resumen

Referencias bibliográficas