SeferSimilarityMap

Sefer Similarity Map:

Visualizing the landscape of Jewish texts.

כי קרוב אליך הדבר

Joseph Hostyk and Alex Zaloum

Sefer Similarity Map is an exploratory tool to help uncover the substructure of books/chapters of Tanach, Talmud, and other Jewish texts, and visualize the underlying relationships between them.
These relationships can illuminate historical, authorial, linguistic, and stylistic connections between texts. The tool can also pick out related texts to any that one is currently learning or teaching.

Introduction

To start measuring similarity between texts, we consider the rates of words and phrases in each individual text. Other characteristics of the texts, such as theorized date of composition or author, were not included in the model, allowing the similarities along these attributes to emerge from the comparison of the texts themselves. This has been shown to work well, allowing texts to be clustered together merely based on counts of words.

Overall, this is a lot of data! There are around 40,000 unique words in Tanach. We have to keep track of each sefer’s rates of each of those words! That’s big enough if we just use 39 sfarim. If we switch to using subsections (e.g. the 929 prakim of Tanach), our table grows even larger.

n-grams

Let's make those tables even bigger!
We wanted to select features that would be helpful in identifying true substructure. The counts of single words is useful: only two sfarim have the word "מדינתא" (Ezra and Nehemiah), and we would expect them to be similar to each other, and probably not so similar to other sfarim. But a “bag of words” model (pure word counts, where order doesn’t matter) can only contain so much information. "Rav Huna" and "Rav Yosef" each share "Rav", but should be considered differently.
We therefore expanded to phrases - or “n-grams”. (A bigram is a two-word phrase; a trigram is three.) By including the counts of these n-grams, we retain more contextual information.

We can add the counts of these n-grams to our matrix, so that for each text, we have both the counts of single words, but also longer phrases. Below, depending on the size of texts we were comparing, we run on n-grams of up to size 4.

UMAP

This process leaves us with a giant matrix, with rows being the different texts, columns being the words within the corpus, and each cell being the frequency of that word within that text. Texts that have similar counts of the same words will have similar rows in the table. If we view the table as a representation of a high-dimensional space, then those text would also be near each other in this space. However, visualizing this would be really difficult! We can’t draw a plot in 100-thousand dimensions.
Dimensionality reduction is a technique to convert this high-dimensional data into something more manageable for us. We used a method called UMAP to reduce the dimensionality of our ngram-count matrix so that we could visualize the final result.
Texts that were near each other in the original high-dimensional space (because they have similar rates of the same words) should also be close in their UMAP projections into this new 2-D space. This means that texts that group together in the UMAP space (shown in our plots below), should truly have been similar in the high-dimensional space, meaning they should have similar values in the matrix, and should use similar rates of the same words. Clusters in our plots below should therefore truly represent similar texts!

Limitations:

Using raw counts is not ideal - longer texts will have higher values overall, which can throw off the clustering. Typically, you would normalize the values based on their frequency in the corpus as a whole. tf-idf is a common method; we played around with some custom scoring metrics, but did not include them below.
Natural Language Processing (NLP) techniques can add more nuance to the method. For example, we don’t account for prepositions: "והיה" is counted differently than "היה", while ideally we should lemmatize them and count them as the same word. However, the vast amounts of data should actually compensate for missing on the benefits NLP could bring to the table.
Similarly, we did not remove extremely common words. This makes comparisons of highly-repetitive texts difficult. When we ran our method on amudim in the Talmud, we produced two large clusters, instead of many highly differentiable ones. We believe this is because most amudim share a lot of similar words, which overwhelms their unique ones.
The method is slowed down by the number of texts selected for comparison. The number of features does not affect UMAP’s runtime significantly: a corpus of 200,000 words does not run that much slower than one of 50,000 words. However, a comparison of 40 texts runs extremely quickly, while 900 texts runs over a few minutes, and the 5,350 amudim of the Talmud took very long to run.

Results/Usage

Our code is below, with interactive results at the bottom. We ran some analyses comparing full sfarim, and some on subsections (e.g. prakim of Tanach, amudim of Talmud).
By mousing over different dots, you can see the texts in the comparison. You can also search above each graph to highlight specifics texts within it.

We included some observations with each comparison. Let us know if you find any other interesting connections!

Contact

We can be reached at jhostyk [at] gmail.com and abzaloum [at] gmail.com.
Please reach out with any questions, or suggestions/requests for additional plots!

Our code is available on our github.

Results

All Books of Sefaria

This plot shows every indiividual book of Sefaria.
You can search for sfarim by title in the box below!

Details:
4338 texts.
1923441 unique n-grams of size 1.

Interpretation

This comprehensive plot shows distinct clusters, identifiable by author, topic, or language.

  • For example, all the Targum texts, in Aramaic, are together on the left.
  • Authors whose works are collected into multiple sfarim (e.g. Rashash, Hon Ashir) comprise their own clusters.
  • Almost all of Tanach clusters together (search for Amos). The rishonim who comment upon them, cluster together nearby.
  • The cluster in the middle demonstrates less differentiable texts. Instead of one author who uses unique language, we have texts surrounded by commentaries upon them (e.g. Mishneh Torah broken into sections, with subcommentaries upon them; by searching for "Nazir", we can see the Mishna surrounded by texts on the topic).

In order to zoom in on different sections, we created specialized plots below. For some, we further broke down the texts (e.g. into prakim or amudim) in order to get a more fine-grained view of the texts.

929 Prakim of Tanach

Details:
929 prakim.
39995 unique n-grams of size 1.
28101 n-grams of size 2. (Removed 156551 that only appeared in one text.)
15003 n-grams of size 3. (Removed 215330 that only appeared in one text.)

Interpretation:

  • Prakim within individual sfarim are usually close to each other.
  • However, some are found in separate clusters. For example, certain chapters of Ezra/Daniel cluster on their own. These are the only Aramaic chapters in Tanach, so they don't share any words with other chapters, and cluster separately!
  • Ezra and Nehemia are near each other, because of their duplicated passages.
  • The 5 books of the Torah are all mostly near each other, with each comprising a mostly distinct cluster.
  • Job 1, 2, and 42, which set up the narrative structure of the sefer and are distinct from the other chapters, are found in a different cluster than the other chapters.

Shulchan Aruch and Commentary

Details:
66 items.
383227 unique unique n-grams of size 1.

Interpretation:

  • Each of the four sections of the Shulchan Aruch is in a separate, primary cluster.
  • Each of those sections is surrounded by its commentaries.
  • The Kaf HaChaim is near the Mishnah Berurah. According to Wikipedia: "Kaf HaChaim is often compared to the Mishna Berura in terms of scope and approach, but differs in its more extensive reliance upon quotations."
  • The Dagul Mervava is next to the Yad Efraim. The two were known to be in correspondence!
  • Not far off from these two are the Eshel Avraham and Shaarei Teshuvah, who lived in the same time and place (Poland/Galaica, end of 18th/beginning of 19th century).
  • The Magen Avraham on Orach Chaim is near three of its subcommentators: Machazitz haShekel, Levushei Serad, and Rabbi Akiva Eiger. The latter is also near Turei Zahav (the Taz), upon which Rabbi Akiva Eger wrote commentary.
  • The Be’ur HaGra and Be’er HaGolah are both terse commentaries whose main purpose is to provide cross-references, hence their separation from the rest of the supercommentaries and their proximity to one another.

Sfarim of Tanach + Apocrypha

Details:
53 items.
55534 unique n-grams of size 1.

Interpretation

  • Here, we ran on the full sfarim, without breaking them down into prakim. Many of the apocryphal texts are separate, at the top.
  • The Wisdom of Solomon is near Ecclesiastes. This is pretty interesting, because the Wisdom of Solomon was most likely originally written in Greek, and wasn't written by Solomon. Yet the Hebrew translation (written in 1885) from which we're working, is still rather similar to the Hebrew Ecclesiastes - a testament to the translator's skill.
  • Ben Sira can be found near Proverbs - both Wisdom literature and known to be very similar.
  • Ezra, Nehemiah, and Daniel are cluster toegether (on the left), as do most of the Major and Minor Prophets (on the bottom right).

Amudim of the Talmud

Details:
5350 amudim.
112522 unique n-grams of size 1.
180521 n-grams of size 2. (Removed 572684 that only appeared in one text.)
164825 n-grams of size 3. (Removed 1002596 that only appeared in one text.)
123580 n-grams of size 4. (Removed 1198009 that only appeared in one text.)

Interpretation:

This plot shows an interesting result: most pages of the Talmud are not that differentiable from each other! Most of the amudim do not form isolated clusters, other than those of Brachot + Shabbos, and Nedarim + Niddah + Sotah.
We believe this is because of the repetition of common words/phrases in the Talmud.
Nevertheless, even among the large central cluster, we can still see general groupings of different masechtot and topics.

We believe this can be extremely useful as a pedagogical aid! If you've begun learning a certain page of gemara, you can search for it here, and then pan over nearby amudim to find ones that have similar vocabulary/word usage.

Kabbalah

Details:
37 items.
250070 unique n-grams of size 1.

Interpretation

Analyzed by Jeremy Tibbetts.

The Tanach and Talmud graphs compare texts that share many words and topics, and so while those maps can show general patterns, the points on the map did not form distinct clusters. This graph of Kabbalah texts shows the strength of the method when comparing wildly varied texts. The Zohar/Tikkunei Zohar and Maggid Meisharim texts (all in Aramaic) are completely on their own to the side! The central cluster shows texts that are more similar, and that consequently form one connected cluster. We can still see subclusters within it:

  • Within that top-left cluster: the leftmost texts are Lurianic, the middle are Cordoverian, and the right-most are pre-Cordoverian.
  • The Etz Chayim and the Pri Etz Chayim (16th century) are clustered with older texts, such as the Avodat HaKodesh (13th century).
  • Milchamot Hashem ("The Wars of God") is Rav Gershonides (Rasag).
  • Avodat HaKodesh is considered one of the first “comprehensive” Kabbalistic works. Sefer HaKana and Pri Etz Hadar are both earlier works it would’ve drawn on.
  • The Gra’s work is in that cluster. His works are mainly commentaries that bridge the two time periods.
  • The top-left of the cluster is all the Sefer Yetzirah content: the texts themselves, and their commentaries.
  • Chesed L’Avraham is bundled very closely with the Lurianic works on the left. So is the Ramchal Asara Perakim, a summary of Lurianic Kabbalah.
  • Interestingly, Shaarei Kedusha was written by the Arizal (Luria’s) primary student but is much closer to the earlier texts in the plot.
  • Biur Eser Sefirot is with them also, despite being the most recently written text on here. It’s written by the Baal HaSulam on the Zohar.
  • Beginning of Wisdom is an intro to Kabbalah based on the writings of the Vilna Gaon and the Ramchal, and is directly in between them in the plot.
  • Kadmonim (Hechalot Rabbati) are towards the top right - united by Pardes Rimonim. All post-Cordoverian - Ramak, who took the Kadmonim and summarized them. The Arizal took those, and builds his own system on the foundation of the Ramak's work, in a later time period. He wrote the Or Neerav, a commentary on the Zohar, which on the plot is right next to the Pardes Rimmonim.

Chasidut

Details:
58 sfarim.
392491 n-grams of size 1.
940190 n-grams of size 2. (Removed 4468422 that only appeared in one text.)
554502 n-grams of size 3. (Removed 9913202 that only appeared in one text.)
228916 n-grams of size 4. (Removed 12021827 that only appeared in one text.)

Interpretation:

  • The works of Rav Tzadok (e.g. Sichat Shedim, Divrei Chalomot), Rav Nachman (e.g. Likutei Moharan, Sefer HaMidot), and the Baal Shem Tov (Keter Shem Tov, Tzavaat HaRivash) each form their own cluster.
  • The two Chabad sfarim (Tanya and Derech Mitzvosecha) cluster together.