Skip to content Skip to sidebar Skip to footer

Why Does My NodeJS Script Bog Down During Fs.readFile And Fs.appendFile Handling Large Numbers Of Files.

I have a folder with about 120k HTML pages that I need to open (each file is about 70kb), parse some data out with xPath and append that data to a .csv file. Below is my code: It s

Solution 1:

Node.js works asynchronously.

The problem

So the way your code is structured, this happens:

  1. The function iterateFiles issues 120k fs.readFile calls in a row, which causes Node.js to queue 120k filesystem read operations.

  2. When the read operations are complete Node.js will call the 120k callbacks for fs.readFile and each of these will issue a fs.appendFile operation, which will cause Node.js to queue 120k filesystem write operation.

  3. Eventually Node.js will call the 120k callbacks that were passed to fs.appendFile. Until these write operations are completed Node.js must hang onto the data that is to be written.

The Solution

For a task like this I would suggest using the synchronous version of the fs calls: fs.readFileSync and fs.appendFileSync.

When writing code for a web server or that is somehow event-driven, you don't want to use the synchronous version of these calls because they will cause your application to block. But if you are writing code which is doing batch processing of data (for instance, code that operates like a shell script would), it is simpler to use the synchronous version of these calls.

Illustration

The following code is a simplified model of your code and illustrates the problem. It is set to read from /tmp because that's as good a source of files as any. I've also set it to avoid doing any further work than parseFile if a file is empty.

var fs = require('fs');

var ParserRules = {
    saveFile: 'output.csv',
    parseFolder: '/tmp'
};

start();

function start() {
    console.log('Starting...');
    fs.readdir(ParserRules.parseFolder, iterateFiles);
}

function iterateFiles(err, filesToParse) {
    for (var i = 0; i < filesToParse.length; i++) {
        var file = ParserRules.parseFolder + '/' + filesToParse[i];
        console.log('Beginning read of file number ' + i);
        fs.readFile(file, {encoding: 'utf8'}, parseFile);
    }
}

var parse_count = 0;
function parseFile(err, data) {
    if (err)
        return;

    if (data.length) {
        console.log("Parse: " + parse_count++);
        getContent(data);
    }
}

function getContent(data) {
    saveToCsv(data, ParserRules.saveFile);
}

var save_count = 0;
function saveToCsv(data, filePath) {
    fs.appendFile(filePath, data, {encoding: 'utf8', flag: 'a'},
                  function(err){
        if (err) {
            console.log('Write Error: ' + err);
        } else {
            console.log('Saved: ' + save_count++);
        }
    });
}

If you run this code you'll see that all the Parse: messages appear contiguously. Then only after all the Parse: messages are output, you get the Saved: messages. So you'd see something like:

Beginning read of file number N
Beginning read of file number N+1
Parse: 0
Parse: 1
... more parse messages ...
Parse: 18
Parse: 19
Saved: 0
Saved: 1
... more saved messages...
Saved: 18
Saved: 19

What this tells is you is that Node does not start to save until all files are parsed. Since Node cannot release the data associated with a file until it it knows it won't be used again --- in this case, it means until the file has been saved --- then at some point Node will take a minimum of 120,000 * 70kb of memory to hold all the data from all the files.


Post a Comment for "Why Does My NodeJS Script Bog Down During Fs.readFile And Fs.appendFile Handling Large Numbers Of Files."