CLIENT GOALS

Our client, being a member of many word-appreciation communities, saw an opportunity to create a tool for exploring etymologies in a graphic interface. The problem was that the data wasn’t machine-readable in its current state as paragraphs of text.

We devised a plan to unlock this information by using Natural Language Processing (NLP) to extract data and relationships from open-source repositories of etymologies, such as Wiktionary. From there we would craft a mobile app and web app to provide the visual representation of that data to users.

“ETYMOLOGY” DEFINED

Etymology is a description of how a word has evolved from ancient languages into its present-day form. It is a dive into the culture, history, and meaning of a word.

They are often written in natural language as a paragraph, which makes them not readable by machines.


EtymologyExplorer is an application for Android, iOS, and web (beta) that automatically constructs graphical representations of more than 1M words of all languages for 5k monthly active users. It provides several different views for exploring relationships between words and learning more about their evolutions and relations. Other features include a freemium revenue model with gated access, a feedback system, an admin panel, and daily notifications.

 

Natural Language Processing

(Project Highlight)


To create the database, our team gathered 6M pages of raw Wiktionary content via available data dumps and web scraping. We then used several deep-learning techniques on the extracted etymologies to uncover inter-word connections. We thought that Named Entity Recognition (NER) would best extract connections, but experiments showed Sentiment Analysis to produce better results. Sentiment Analysis typically determines emotional context of text (e.g. positive or negative reviews). But in this case it was determining whether a sentence contributed a new connection to the database. The data was then scrubbed and stored in AWS Relational Database Service (RDS).

 Dimensionality Reduction

(Project Highlight)


In the related-word visualization, the user had difficulty understanding the content because of its sheer quantity (as many as 10k words). Our team realized this during our testing and came up with a creative solution to help reduce the dimensionality of the results.

 

Within the thousands of words in a family (sharing a root word), most share a few different similar meanings. For example words related to “animation“ (such as “animal” and “anemone”) mean either “soul”, “breath”, or “wind”. By using sentence vectors, our team was able to cluster groups of words into common definitions to allow for filtering in the app. This allows users to grok the various meanings that descend from an ancestor word.

 

 Conclusion


Mountain Dev was able to overcome significant technical challenges to produce a novel result and deliver the Etymos mobile app to word lovers all over the world. The mobile app was released in 2020 and now has over 5k monthly active users! The web app is in beta as of 2021 and is planned to be released in early 2022.

The team learned a great deal about etymologies and NLP and was able to demonstrate their data-analytics and machine-learning prowess. We are excited to implement similar projects in the future!

Access the app here at https://app.etymologyexplorer.com