Why Does My NodeJS Script Bog Down During Fs.readFile And Fs.appendFile Handling Large Numbers Of Files.
Solution 1:
Node.js works asynchronously.
The problem
So the way your code is structured, this happens:
The function
iterateFiles
issues 120kfs.readFile
calls in a row, which causes Node.js to queue 120k filesystem read operations.When the read operations are complete Node.js will call the 120k callbacks for
fs.readFile
and each of these will issue afs.appendFile
operation, which will cause Node.js to queue 120k filesystem write operation.Eventually Node.js will call the 120k callbacks that were passed to
fs.appendFile
. Until these write operations are completed Node.js must hang onto the data that is to be written.
The Solution
For a task like this I would suggest using the synchronous version of the fs calls: fs.readFileSync
and fs.appendFileSync
.
When writing code for a web server or that is somehow event-driven, you don't want to use the synchronous version of these calls because they will cause your application to block. But if you are writing code which is doing batch processing of data (for instance, code that operates like a shell script would), it is simpler to use the synchronous version of these calls.
Illustration
The following code is a simplified model of your code and illustrates the problem. It is set to read from /tmp
because that's as good a source of files as any. I've also set it to avoid doing any further work than parseFile
if a file is empty.
var fs = require('fs');
var ParserRules = {
saveFile: 'output.csv',
parseFolder: '/tmp'
};
start();
function start() {
console.log('Starting...');
fs.readdir(ParserRules.parseFolder, iterateFiles);
}
function iterateFiles(err, filesToParse) {
for (var i = 0; i < filesToParse.length; i++) {
var file = ParserRules.parseFolder + '/' + filesToParse[i];
console.log('Beginning read of file number ' + i);
fs.readFile(file, {encoding: 'utf8'}, parseFile);
}
}
var parse_count = 0;
function parseFile(err, data) {
if (err)
return;
if (data.length) {
console.log("Parse: " + parse_count++);
getContent(data);
}
}
function getContent(data) {
saveToCsv(data, ParserRules.saveFile);
}
var save_count = 0;
function saveToCsv(data, filePath) {
fs.appendFile(filePath, data, {encoding: 'utf8', flag: 'a'},
function(err){
if (err) {
console.log('Write Error: ' + err);
} else {
console.log('Saved: ' + save_count++);
}
});
}
If you run this code you'll see that all the Parse:
messages appear contiguously. Then only after all the Parse:
messages are output, you get the Saved:
messages. So you'd see something like:
Beginning read of file number N
Beginning read of file number N+1
Parse: 0
Parse: 1
... more parse messages ...
Parse: 18
Parse: 19
Saved: 0
Saved: 1
... more saved messages...
Saved: 18
Saved: 19
What this tells is you is that Node does not start to save until all files are parsed. Since Node cannot release the data associated with a file until it it knows it won't be used again --- in this case, it means until the file has been saved --- then at some point Node will take a minimum of 120,000 * 70kb of memory to hold all the data from all the files.
Post a Comment for "Why Does My NodeJS Script Bog Down During Fs.readFile And Fs.appendFile Handling Large Numbers Of Files."