ParallelDots Research & Development

Testing the limits of unsupervised learning for semantic similarity.

Distributed Representations (Embeddings) of Sentences are important tools in any Data Scientist's NLP arsenal that can be used for too many tasks ranging from text classification, recommendation to multimodal learning tasks. One of the tasks on which a embeddings can be used as is Semantic Similarity. There can be different ways in which embeddings can be trained to perform on Semantic Similarity. While you can train supervised embeddings on small semantic similarity datasets, or use embeddings trained on larger NLI datasets (like the state of the art InferSent
(https://github.com/facebookresearch/InferSent)). But these two methods have one thing in common, they were trained using human tags and not grammatical structure itself.

In this paper we try to analyze how can unsupervised autoencoders be used to get semantic similarity of two sentences given no other human feedback, just the structure of grammar. We reach a respectable correlation of 0.34 against 0.68 by supervised models.