Split text file into multiple smaller files

lineskc · May 23, 2021, 1:42am

I have a massive text file with a lot of markdown headings that I would like to split into separate text files based on the # Heading. I have a Mac as well as an iPad and iPhone. I am looking for any suggestions on the best way to accomplish this. I am not good with coding/scripting, so if you have a suggestion using that I may need some really simplifies explanations. But I was hoping there was something using Keyboard Maestro, Hazel, or Shortcuts that could help me out.

sylumer · May 23, 2021, 7:53am

How “massive” is the text file? Can you give a size in kb/mb/gb? Rough number of split out files you expect to have to generate.

You can get performance issues when dumping thousands of files into a single directory.

Do you use only one level of heading? If not, do you want to split on say level 1 only, or all levels.
If splitting on all levels, do you need a folder structure to split into?
Are your headings unique?
- This is an important consideration if you are using it as the basis of the file name.
What would you want to name each new file?
1, if the name is based on the heading, do any of your headings contain characters that are not valid in file names?
- This may be an additional piece of processing to carry out.
Is this a one off process, or something you wish to do regularly?
- if it is a regular process, a solution on your Mac will likely run faster and be more efficient, whereas based on your background, a Shortcuts solution might be the easiest for you to understand if it is viable for Shortcuts to process the file.

Out of interest, what is the driver for now needing to split up the file? It may be that there are additional steps that could be carried out to prepare the files for their future use. E.g. setting up internal links between files for a PKM solution, auto importing into something, etc.

lineskc · May 23, 2021, 7:08pm

The text file (actually a .md file) is 1MB and I would expect about 200 files when complete.

Only 1 heading level so no structure needed, it would be split with the heading and the contents under the heading as 1 file.

All headings are unique, this would ideally become the filename. Headings are all compatible with filenames. Only text and a dash here and there.

This is a one off process. I have a religious reference book that I converted from PDF to text using OCR and now would like to put it into Obsidian as a personal reference.

Hope I answered all your questions, let me know if I missed something.

sylumer · May 23, 2021, 9:31pm

Okay, in the grand scheme of computing that small numbers, but in the first case I would suggest maybe mdsaw. It seems to have been created to do exactly what you describe.

It is a Python script (Python 3 I think) that can split a Markdown file by heading, creating new files with that heading name.

I just tried it on a test file to make sure it worked, and it did for me.

In case it helps, here is what I did:

I copied the mdsaw script into a new file called mdsaw (no file extension) that I put in the same folder as my test markdown file (test.md).
In the terminal, I navigated to the folder with these files in and then entered the following command to allow mdsaw to be executed as a script.
- chmod +x mdsaw
I entered a command in the terminal to decompose (split) the markdown file and place the output in the current folder.
- ./mdsaw -d test.md ./
  - The ./mdsaw says excuse the mdsaw script.
  - The -d tells the script to decompose the file.
  - The test.md is specifying the file name (/path) to decompose.
  - The ./ at the end is telling mdsaw to output to the current directory.

I looks like the to_filename section of the script near the top is modifying the file name a bit, so you could modify that to suit your own naming preferences - I think even with no coding experience you can tell most of what it is doing and what you might want to change or remove. By default it is switching to lowercase and replacing spaces with hyphens.

I think the code you would want to swap in for this so that the case and spacing is retained would be like so:

def to_filename(name, extension):
	name = name.replace('.txt', '')
	#name = name.lower()
	#name = re.sub(r'[\W.]+', '-', name)
	name += f'.{extension}'
	return name

This is just commenting out two lines that make the file name lower case and substitute the spaces for hyphens.

If you needed to install Python 3, you can use your favourite search engine for any number of articles on how to update Python on the Mac to version 3.

I did suspect that you might be converting something for Zettelkasten like purposes for Obsidian (it is the in thing right now in PKM). The reason I was interested in the background is that you could probably create an automation to create backlinks or cross references in the resulting files that would then be your back links (/ graph connections) in Obsidian. But that obviously requires more effort. With only 200 files, that is not an insurmountable effort to do manually, but a scripted solution could be a lot quicker.

Now it may be that you just want to also do something like also create an index page that reads through the original unsplit Markdown file finds each heading in turn, and dumps that out to a new index file as a link to the generated files.

For example, using the modified to_filename code, this terminal command should create a simple index file that would show up as a centralised node for the other nodes of the book, once you deposit them in your Obsidian vault.

cat "test.md" | grep "^#" | sed -e 's/# //g' | sed -e 's/.*/- [[&]]/g' > "index.md"

The command takes your original file content, finds the lines which are headings (assumption is per everything being a level 1 heading as you asserted above), remove the Markdown heading syntax from each heading, prefix with a Markdown list identifier and wrap each heading text in Obsidian’s double square bracket internal link syntax, and then output that into a file called index.md.

This would create the index file in the same directory as the original file (here test.md), containing a list of sections in the same order as the original file.

I realise that might be quite a lot to take in, but I’ve tried to break things down and explain things step-by-step along with some of the thinking behind it.

Hope that helps.

lineskc · May 23, 2021, 10:10pm

Wow, thanks so much! I appreciate all the help.

Each topic/heading actually has content underneath it that links to other entries that are similar. I have already used BBEdit to do a regex search/replace to basically add [[]] around all those similar topics so now once I have them in obsidian, they will all be back linked and able to use the graph.

Thanks again!

Tjluoma · May 24, 2021, 12:21am

So, when you say “one heading level” I assume that you mean they are all using the same level, for example, each section begins with something like: ## Words Words Words ?

lineskc · May 24, 2021, 1:36am

Everything is a level 1 heading. so ever heading that is in the document is prefaced by one #. For example

# 1st section

# 2nd section

# 3rd section

lineskc · May 24, 2021, 3:42pm

I felt the need to thank you again. I did this last night and it worked perfectly. It was extremely fast and literally did exactly what I needed. Thanks!

sylumer · May 24, 2021, 8:18pm

Brilliant. Glad it worked so well for you.