כי קרוב אליך הדבר
Joseph Hostyk and Alex Zaloum
Sefer Similarity Map is an exploratory tool to help uncover the substructure of books/chapters of Tanach, Talmud, and other Jewish texts, and visualize the underlying relationships between them.
These relationships can illuminate historical, authorial, linguistic, and stylistic connections between texts. The tool can also pick out related texts to any that one is currently learning or teaching.
To start measuring similarity between texts, we consider the rates of words and phrases in each individual text. Other characteristics of the texts, such as theorized date of composition or author, were not included in the model, allowing the similarities along these attributes to emerge from the comparison of the texts themselves. This has been shown to work well, allowing texts to be clustered together merely based on counts of words.
Overall, this is a lot of data! There are around 40,000 unique words in Tanach. We have to keep track of each sefer’s rates of each of those words! That’s big enough if we just use 39 sfarim. If we switch to using subsections (e.g. the 929 prakim of Tanach), our table grows even larger.
Let's make those tables even bigger!
We wanted to select features that would be helpful in identifying true substructure. The counts of single words is useful: only two sfarim have the word "מדינתא" (Ezra and Nehemiah), and we would expect them to be similar to each other, and probably not so similar to other sfarim. But a “bag of words” model (pure word counts, where order doesn’t matter) can only contain so much information. "Rav Huna" and "Rav Yosef" each share "Rav", but should be considered differently.
We therefore expanded to phrases - or “n-grams”. (A bigram is a two-word phrase; a trigram is three.) By including the counts of these n-grams, we retain more contextual information.
We can add the counts of these n-grams to our matrix, so that for each text, we have both the counts of single words, but also longer phrases. Below, depending on the size of texts we were comparing, we run on n-grams of up to size 4.
This process leaves us with a giant matrix, with rows being the different texts, columns being the words within the corpus, and each cell being the frequency of that word within that text. Texts that have similar counts of the same words will have similar rows in the table. If we view the table as a representation of a high-dimensional space, then those text would also be near each other in this space. However, visualizing this would be really difficult! We can’t draw a plot in 100-thousand dimensions.
Dimensionality reduction is a technique to convert this high-dimensional data into something more manageable for us. We used a method called UMAP to reduce the dimensionality of our ngram-count matrix so that we could visualize the final result.
Texts that were near each other in the original high-dimensional space (because they have similar rates of the same words) should also be close in their UMAP projections into this new 2-D space. This means that texts that group together in the UMAP space (shown in our plots below), should truly have been similar in the high-dimensional space, meaning they should have similar values in the matrix, and should use similar rates of the same words. Clusters in our plots below should therefore truly represent similar texts!
Using raw counts is not ideal - longer texts will have higher values overall, which can throw off the clustering. Typically, you would normalize the values based on their frequency in the corpus as a whole. tf-idf is a common method; we played around with some custom scoring metrics, but did not include them below.
Natural Language Processing (NLP) techniques can add more nuance to the method. For example, we don’t account for prepositions: "והיה" is counted differently than "היה", while ideally we should lemmatize them and count them as the same word. However, the vast amounts of data should actually compensate for missing on the benefits NLP could bring to the table.
Similarly, we did not remove extremely common words. This makes comparisons of highly-repetitive texts difficult. When we ran our method on amudim in the Talmud, we produced two large clusters, instead of many highly differentiable ones. We believe this is because most amudim share a lot of similar words, which overwhelms their unique ones.
The method is slowed down by the number of texts selected for comparison. The number of features does not affect UMAP’s runtime significantly: a corpus of 200,000 words does not run that much slower than one of 50,000 words. However, a comparison of 40 texts runs extremely quickly, while 900 texts runs over a few minutes, and the 5,350 amudim of the Talmud took very long to run.
Our code is below, with interactive results at the bottom. We ran some analyses comparing full sfarim, and some on subsections (e.g. prakim of Tanach, amudim of Talmud).
By mousing over different dots, you can see the texts in the comparison. You can also search above each graph to highlight specifics texts within it.
We included some observations with each comparison. Let us know if you find any other interesting connections!
We can be reached at jhostyk [at] gmail.com and abzaloum [at] gmail.com.
Please reach out with any questions, or suggestions/requests for additional plots!
Our code is available on our github.
This plot shows every indiividual book of Sefaria.
You can search for sfarim by title in the box below!
This comprehensive plot shows distinct clusters, identifiable by author, topic, or language.
In order to zoom in on different sections, we created specialized plots below. For some, we further broke down the texts (e.g. into prakim or amudim) in order to get a more fine-grained view of the texts.
This plot shows an interesting result: most pages of the Talmud are not that differentiable from each other! Most of the amudim do not form isolated clusters, other than those of Brachot + Shabbos, and Nedarim + Niddah + Sotah.
We believe this is because of the repetition of common words/phrases in the Talmud.
Nevertheless, even among the large central cluster, we can still see general groupings of different masechtot and topics.
We believe this can be extremely useful as a pedagogical aid! If you've begun learning a certain page of gemara, you can search for it here, and then pan over nearby amudim to find ones that have similar vocabulary/word usage.
Analyzed by Jeremy Tibbetts.
The Tanach and Talmud graphs compare texts that share many words and topics, and so while those maps can show general patterns, the points on the map did not form distinct clusters. This graph of Kabbalah texts shows the strength of the method when comparing wildly varied texts. The Zohar/Tikkunei Zohar and Maggid Meisharim texts (all in Aramaic) are completely on their own to the side! The central cluster shows texts that are more similar, and that consequently form one connected cluster. We can still see subclusters within it: